hbase java api操作匯入資料

阿新 • • 發佈：2019-02-06

使用hbase儲存名人資料集，資料集由名人文字資訊以及名人圖片組成。
名人文字資訊使用scrapy框架從wiki百科上爬取並儲存在csv格式中。
圖片資訊從百度圖片上爬取每人30張儲存在以該名人姓名命名的資料夾中
因此本文包含以下幾個方面：
- 爬取文字的爬蟲
- 爬取圖片的爬蟲
- 將資料匯入hbase

scrapy 爬取wiki百科

首先新建scrapy專案
items.py配置
然後在settings.py檔案中加入

FEED_URI = u'file:///F:/pySpace/celebrity/info1.csv'
FEED_FORMAT = 'CSV'

即以csv格式儲存爬取資料以及檔案儲存位置

在main.py檔案中加入

import sys
reload(sys)
sys.setdefaultencoding('utf-8')
sys.getdefaultencoding()
from scrapy import cmdline
cmdline.execute("scrapy crawl celebrity".split())

<python>
# -*- coding: utf-8 -*-
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from celebrity.items 
 import CelebrityItem
from scrapy.http import Request
import pandas as pd
#讀取待爬取的名人姓名列表
with open(r'F:\pySpace\celebrity\name_lists1.txt','r') as f:
    url_list = f.read()
url_list = url_list.split('\n')

class Celebrity(CrawlSpider):
    len_url = len(url_list)
    num =1
    name = "celebrity"
    front_url = 'https://zh.wikipedia.org/wiki/' 

    start_urls = [front_url + url_list[num].encode('utf-8')]

    def parse(self, response):
        item = CelebrityItem()
        selector = Selector(response)
        body = selector.xpath('//*[@id="mw-content-text"]')[0]
        Title = body.xpath('//span[@class="mw-headline"]/text()').extract()
        titles = ['簡介']
        for i in range(len(Title)):
            if Title[i] != '參考文獻' and Title[i] != '註釋' and Title[i] != '外部連結' and Title[i] != '參考資料':
                titles.append(Title[i])
        Passage = selector.xpath('//*[@id="mw-content-text"]/p')
        all_info = []
        for eachPassage in Passage:
            info =''.join(eachPassage.xpath('.//text()').extract())
            if info!= '':
                all_info.append(info.strip())
        Ul_list = selector.xpath('//*[@id="mw-content-text"]/ul')
        for eachul in Ul_list:
            info = ''.join(eachul.xpath('.//text()').extract())
            if info != '' and info!= '\n' and info != ' ':
                all_info.append(info)

        # 爬取帶標題的
        k = 0
        epoch = len(all_info) / len(titles)
        i=0
        if epoch >0:
            for i  in range(len(titles)):

                if i == len(titles)-1:
                    item['name'] = url_list[self.num].encode('utf-8')
                    item['title'] = titles[i]
                    item['info'] = ''.join(all_info[k:])
                else :
                    item['name'] = url_list[self.num].encode('utf-8')
                    item['title'] = titles[i]
                    item['info'] = ''.join(all_info[k:k+epoch])
                    k = k+epoch
                yield item
        else :
            for j in range(len(all_info)):
                item['name'] = url_list[self.num].encode('utf-8')
                item['title'] = titles[j]
                item['info'] = all_info[j]
                yield item


        #爬取不帶標題的
        # for j in range(len(all_info)):
        #     item['name'] = url_list[self.num].encode('utf-8')
        #     item['info'] = all_info[j]
        #     yield item

        print item['name']
        self.num = self.num + 1
        print self.num
        if self.num < self.len_url:
            nextUrl =self.front_url + url_list[self.num].encode('utf-8')
            yield Request(nextUrl,callback=self.parse)
</python>

爬取圖片

import urllib2
import re
import os
import sys
reload(sys)
sys.setdefaultencoding("utf-8")

def img_spider(name_file):

    user_agent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36"
    headers = {'User-Agent':user_agent}

    with open(name_file) as f:
        name_list = [name.rstrip().decode('utf-8') for name in f.readlines()]
        f.close()

    for name in name_list:
        if not os.path.exists('F:/pySpace/celebrity/img_data/' + name):
            os.makedirs('F:/pySpace/celebrity/img_data/' + name)
            try:
                url = "http://image.baidu.com/search/avatarjson?tn=resultjsonavatarnew&ie=utf-8&word=" + name.replace(' ','%20') + "&cg=girl&rn=60&pn=60"
                req = urllib2.Request(url, headers=headers)
                res = urllib2.urlopen(req)
                page = res.read()
                #print page
                img_srcs = re.findall('"objURL":"(.*?)"', page, re.S)
                print name,len(img_srcs)
            except:
                print name," error:"
                continue
            j = 1
            src_txt = ''

            for src in img_srcs:
                with open('F:/pySpace/celebrity/img_data/' + name + '/' + str(j)+'.jpg','wb') as p:
                    try:
                        print "downloading No.%d"%j
                        req = urllib2.Request(src, headers=headers)
                        img = urllib2.urlopen(src,timeout=3)
                        p.write(img.read())
                    except:
                        print "No.%d error:"%j
                        p.close()
                        continue
                    p.close()
                src_txt = src_txt + src + '\n'
                if j==30:
                    break
                j = j+1
            #儲存src路徑為txt
            with open('F:/pySpace/celebrity/img_data/' + name + '/' + name +'.txt','wb') as p2:
                p2.write(src_txt)
                p2.close()
                print "save %s txt done"%name


if __name__ == '__main__':
    name_file = "name_lists1.txt"
    img_spider(name_file)

通過java api 將資料匯入hbase

在hbase中建兩個表，分別為celebrity（儲存圖片資訊）和celebrity_info（儲存文字資訊）名人的姓名為rowkey。

<java>
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.*;
import org.apache.hadoop.hbase.client.*;
import com.csvreader.CsvReader;
import com.google.common.primitives.Chars;
import org.junit.Test;
import java.nio.charset.Charset;
import java.io.*;
import javax.swing.ImageIcon;
/**
 * Created by mxy on 2016/10/31.
 */
public class CelebrityDataBase {

    /*新建表*/
    public void createTable(String tablename)throws Exception{
        Configuration config = HBaseConfiguration.create();
        config.set("hbase.zookeeper.quorum","node4,node5,node6");
        HBaseAdmin admin = new HBaseAdmin(config);
        String table = tablename;

        if(admin.isTableAvailable(table)){
            admin.disableTable(table);
            admin.deleteTable(table);
        }else {
            HTableDescriptor t = new HTableDescriptor(table.getBytes());
            HColumnDescriptor cf1 = new HColumnDescriptor("cf1".getBytes()) ;
            cf1.setMaxVersions(10);
            t.addFamily(cf1);
            admin.createTable(t);
        }
        admin.close();
    }
    //插入資料csv格式文字資料
    public void putInfo()throws Exception{
        CsvReader r = new CsvReader("F://pySpace//celebrity//info.csv",',', Charset.forName("utf-8"));
        r.readHeaders();
        Configuration config = HBaseConfiguration.create();
        config.set("hbase.zookeeper.quorum","node4,node5,node6");
        HTable table = new HTable(config,"celebrity_info");
        while(r.readRecord()){

            System.out.println(r.get("name"));
//          String rowkey = r.get("name");
            Put put = new Put(r.get("name").getBytes());
            put.add("cf1".getBytes(),r.get("title").getBytes(),r.get("info").getBytes());
            table.put(put);

        }
        r.close();
        table.close();

    }

    //查詢圖片資料
    public void getImage(String celebrity_name,String img_num)throws Exception{
        Configuration config = HBaseConfiguration.create();
        config.set("hbase.zookeeper.quorum","node4,node5,node6");
        HTable table = new HTable(config,"celebrity");
        Get get = new Get(celebrity_name.getBytes());
        Result res = table.get(get);
        Cell c1 = res.getColumnLatestCell("cf1".getBytes(),img_num.getBytes());
        File file=new File("D://"+celebrity_name+img_num);//將輸出的二進位制流轉化後的圖片的路徑
        FileOutputStream fos=new FileOutputStream(file);
        fos.write(c1.getValue());
        fos.flush();
        System.out.println(file.length());
        fos.close();
        table.close();
    }

    //查詢文字資料
    public void getInfo(String name) throws Exception{
        Configuration config = HBaseConfiguration.create();
        config.set("hbase.zookeeper.quorum","node4,node5,node6");
        HTable table = new HTable(config,"celebrity_info");

        Get get = new Get(name.getBytes());
        Result res = table.get(get);
        Result result = table.get(get);
        for(Cell cell : result.rawCells()){
            System.out.println("rowKey:" + new String(CellUtil.cloneRow(cell))
                    + " cfName:" + new String(CellUtil.cloneFamily(cell))
                    + " qualifierName:" + new String(CellUtil.cloneQualifier(cell))
                    + " value:" + new String(CellUtil.cloneValue(cell)));
        }
        table.close();
    }

//插入圖片資料
    public void putImage(String each_celebrity,String each_img)throws Exception{

        String str = null;
        Configuration config = HBaseConfiguration.create();
        config.set("hbase.zookeeper.quorum","node4,node5,node6");
        HTable table = new HTable(config,"celebrity");
        str = String.format("F://pySpace//celebrity//img_data//%s//%s",each_celebrity,each_img);
        File file = new File(str);
        int size = 0;
        size = (int)file.length();
        System.out.println(size);
        byte[] bbb = new byte[size];
        try {
            InputStream a = new FileInputStream(file);
            a.read(bbb);
//            System.out.println(bbb);
//            System.out.println(Integer.toBinaryString(bbb));
        } catch (FileNotFoundException e) {
// TODO Auto-generated catch block
            e.printStackTrace();
        } catch (IOException e) {
// TODO Auto-generated catch block
            e.printStackTrace();
        }
        String rowkey = each_celebrity;
        Put put = new Put(rowkey.getBytes());
        put.add("cf1".getBytes(),each_img.getBytes(),bbb);
        table.put(put);
        table.close();

    }

    public static void main(String args[]){
        CelebrityDatabase pt = new CelebrityDatabase();
        try {
            pt.createTable("celebrity);
            pt.createTable("celebrity_info);
        } catch (Exception e) {
            e.printStackTrace();
            System.out.println("createTable error");
        }
        String root_path = "F://pySpace//celebrity//img_data";
        File file = new File(root_path);
        File[] files = file.listFiles();

        for(int i = 0;i < files.length;i++){
            String each_path = root_path +"//"+ files[i].getName();
            File celebrity_file = new File(each_path);
            File[] celebrity_files = celebrity_file.listFiles();
            System.out.println(each_path);
            for(int j = 0;j<celebrity_files.length - 1;j++){
                try {
                    pt.putImage(files[i].getName(),celebrity_files[j].getName());
                } catch (Exception e) {
                    e.printStackTrace();
                    System.out.println("putImage error");
                }
            }

        }
        //存入文字資訊
        try {
            pt.putInfo();
        } catch (Exception e) {
            e.printStackTrace();
        }

        //取出圖片
        try {
            pt.getImage("龔照勝","13.jpg");
        } catch (Exception e) {
            e.printStackTrace();
            System.out.println("getImage error");
        }
        //取出文字
        try {
            pt.getInfo("成龍");
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

</java>

hbase java api操作匯入資料

使用hbase儲存名人資料集，資料集由名人文字資訊以及名人圖片組成。名人文字資訊使用scrapy框架從wiki百科上爬取並儲存在csv格式中。圖片資訊從百度圖片上爬取每人30張儲存在以該名人姓名命名的資料夾中因此本文包含以下幾個方面： - 爬取文字

二：Java API操作HBase

package com.zoujc.Utils; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.*; import org.apache.hadoop.hbase.client.*;

一：Java API操作HBase

package com.zoujc.Utils; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.client.*; import org.apache.hadoop.hbase.fil

大資料入門教程系列之Hive的Java API 操作

Java 想要訪問Hive，需要通過beeline的方式連線Hive，hiveserver2提供了一個新的命令列工具beeline，hiveserver2 對之前的hive做了升級，功能更加強大，它增加了許可權控制，要使用beeline需要先啟動hiverserver2，再使用beeline連線

HBase Java API 基本操作

學完hbase shell API的基本操作之後，可以通過Java API 對hbase基本操作實現一把。基本概念 java類對應資料模型 HBaseConfiguration HBase配置類 HBaseAdmin HBase管理A

hadoop2-HBase的Java API操作

Hbase提供了豐富的Java API，以及執行緒池操作，下面我用執行緒池來展示一下使用Java API操作Hbase。專案結構如下：我使用的Hbase的版本是大家下載後，可以拿到裡面的lib目錄下面的jar檔案，即上所示的hbase-lib資源。介面類： /hbase-util

18 大資料zookeeper --使用java api操作zookeeper

ZooKeeper服務命令: 在準備好相應的配置之後，可以直接通過zkServer.sh 這個指令碼進行服務的相關操作 1. 啟動ZK服務: sh bin/zkServer.sh start 2. 檢視ZK服務狀態: sh bin/zkServer.sh status 3. 停止

使用用Phoenix的Java api操作HBase

①先將phoenix的 core.jar包和 phoenix的client.jar 包放到lib裡。 ②建立連線，過程和mysql類似 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

JAVA API 操作HBASE（一）

使用java API操作HBase 實現功能建立表刪除表新增列新增列名稱列出所有表名稱列出所有表下的列名稱使用到的Hbase操作類 HBaseConfiguration 配置hbase配置資訊 HBaseAdmin 使用其進行Hbase資料表的操作

hbase程式設計：通過Java api操作hbase

轉：http://www.aboutyun.com/thread-7151-1-1.html http://blog.csdn.net/cnweike/article/details/42920547 http://blog.csdn.net/zwx19921215/art

hbase java api樣例（版本1.3.1，新API）

quorum desc color -i arraylist byte logs sin fig 驗證了如下幾種java api的使用方法。 1.創建表 2.創建表（預分區） 3.單條插入 4.批量插入 5.批量插入（寫緩存） 6.單條get 7.批量get 8.簡單sca

hbase的api操作

personal ner except create value str nero test 技術分享創建maven工程，修改jdk pom文件裏添加需要的jar包 dependencies> <dependency>

java api操作elasticsearch

model pub turn ole factor string data mark .json elasticsearch的maven依賴 <dependency> <groupId>org.elasticsearch&

使用ZooKeeper提供的Java API操作ZooKeeper

zookeeper 服務協調框架分布式集群 Java API 建立客戶端與zk服務端的連接我們先來創建一個普通的maven工程，然後在pom.xml文件中配置zookeeper依賴： <dependencies> <dependency>

使用Java API操作zookeeper的acl權限

zookeeper Java API 分布式服務協調框架 ACL權限默認匿名權限 ZooKeeper提供了如下幾種驗證模式（scheme）： digest：Client端由用戶名和密碼驗證，譬如user:password，digest的密碼生成方式是Sha1摘要的base64形式 a

Hadoop_07_HDFS的Java API 操作

分享 pack 參數 class 根目錄 div onf comm ima 通過Java API來訪問HDFS 1.Windows上配置環境變量　　解壓Hadoop，然後把Hadoop的根目錄配置到HADOOP_HOME環境變量裏面　　然後把HADOOP_HOME/li

大數據學習系列之三 ----- HBase Java Api 圖文詳解

工具 itl 進行圖片置配動態數據 sync ase tac 引言在上一篇中大數據學習系列之二 ----- HBase環境搭建(單機) 中，成功搭建了Hadoop+HBase的環境，本文則主要講述使用Java 對HBase的一些操作。一、事前準備 1.確認hado

Hbase java API 的呼叫例子

1、首先要在專案中匯入Hbase依賴的jar包 2、修改windows中的 C:\Windows\System32\drivers\etc\hosts 10.49.85.152 master 10.49.85.182 slaver1 10.49.85.183

springboot上傳下載檔案（3）--java api 操作HDFS叢集+叢集配置

只有光頭才能變強! 前一篇文章講了nginx+ftp搭建獨立的檔案伺服器但這個伺服器宕機了怎麼辦？我們用hdfs分散式檔案系統來解決這個問題（同時也為hadoop系列開個頭）目錄 1、Ubuntu14.04下配置Hadoop(2.8.5)叢集環境詳解(完全分

Hbase通過BulkLoad快速匯入資料

HBase是一個分散式的、面向列的開源資料庫，它可以讓我們隨機的、實時的訪問大資料。大量的資料匯入到Hbase中時速度回很慢，不過我們可以使用bulkload來匯入。 BulkLoad的過程主要有以下部分： 1. 從資料來源提取資料並上傳到HDFS中。 2. 使用MapReduce作

hbase java api操作匯入資料

scrapy 爬取wiki百科

爬取圖片

通過java api 將資料匯入hbase

相關推薦