python消費kafka資料批量插入到es

阿新 • • 發佈：2019-02-17

1、es的批量插入

這是為了方便後期配置的更改，把配置資訊放在logging.conf中
用elasticsearch來實現批量操作，先安裝依賴包，sudo pip install Elasticsearch2

from elasticsearch import Elasticsearch  
class ImportEsData:

    logging.config.fileConfig("logging.conf")
    logger = logging.getLogger("msg")

    def __init__(self,hosts,index,type) 
:
       self.es = Elasticsearch(hosts=hosts.strip(',').split(','), timeout=5000)
       self.index = index
       self.type = type


    def set_date(self,data):  
        # 批量處理  
        # es.index(index="test-index",doc_type="test-type",id=42,body={"any":"data","timestamp":datetime.now()})
        self.es.index(index=self.index,doc_type=self.index,body=data)

2、使用pykafka消費kafka

1.因為kafka是0.8，pykafka不支援zk，只能用get_simple_consumer來實現
2.為了實現多個應用同時消費而且不重消費，所以一個應用消費一個partition
3. 為是確保消費資料量在不滿足10000這個批量值，能在一個時間範圍內插入到es中，這裡設定consumer_timeout_ms一個超時等待時間，退出等待消費阻塞。
4.退出等待消費阻塞後導致無法再消費資料，因此在獲取self.consumer 的外層加入了while True 一個死迴圈

#!/usr/bin/python
# -*- coding: UTF-8 -*- 

from pykafka import KafkaClient
import logging
import logging.config
from ConfigUtil import ConfigUtil
import datetime


class KafkaPython:
    logging.config.fileConfig("logging.conf")
    logger = logging.getLogger("msg")
    logger_data = logging.getLogger("data")

    def __init__(self):
        self.server = ConfigUtil().get("kafka","kafka_server")
        self.topic  = ConfigUtil().get("kafka","topic")
        self.group = ConfigUtil().get("kafka","group")
        self.partition_id = int(ConfigUtil().get("kafka","partition"))
        self.consumer_timeout_ms = int(ConfigUtil().get("kafka","consumer_timeout_ms"))
        self.consumer = None
        self.hosts = ConfigUtil().get("es","hosts")
        self.index_name = ConfigUtil().get("es","index_name")
        self.type_name = ConfigUtil().get("es","type_name")


    def getConnect(self):
        client = KafkaClient(self.server)
        topic = client.topics[self.topic]
        p = topic.partitions
        ps={p.get(self.partition_id)}

        self.consumer = topic.get_simple_consumer(
            consumer_group=self.group,
            auto_commit_enable=True,
            consumer_timeout_ms=self.consumer_timeout_ms,
            # num_consumer_fetchers=1,
            # consumer_id='test1',
            partitions=ps
            )
        self.starttime = datetime.datetime.now()


    def beginConsumer(self):
        print("beginConsumer kafka-python")
        imprtEsData = ImportEsData(self.hosts,self.index_name,self.type_name)
        #建立ACTIONS  
        count = 0
        ACTIONS = [] 

        while True:
            endtime = datetime.datetime.now()
            print (endtime - self.starttime).seconds
            for message in self.consumer:
                if message is not None:
                    try:
                        count = count + 1
                        # print(str(message.partition.id)+","+str(message.offset)+","+str(count))
                        # self.logger.info(str(message.partition.id)+","+str(message.offset)+","+str(count))
                        action = {  
                            "_index": self.index_name,  
                            "_type": self.type_name,  
                            "_source": message.value
                        }
                        ACTIONS.append(action)
                        if len(ACTIONS) >= 10000:
                            imprtEsData.set_date(ACTIONS)
                            ACTIONS = []
                            self.consumer.commit_offsets()
                            endtime = datetime.datetime.now()
                            print (endtime - self.starttime).seconds
                            #break
                    except (Exception) as e:
                        # self.consumer.commit_offsets()
                        print(e)
                        self.logger.error(e)
                        self.logger.error(str(message.partition.id)+","+str(message.offset)+","+message.value+"\n")
                        # self.logger_data.error(message.value+"\n")
                    # self.consumer.commit_offsets()


            if len(ACTIONS) > 0:
                self.logger.info("等待時間超過，consumer_timeout_ms，把集合資料插入es")
                imprtEsData.set_date(ACTIONS)
                ACTIONS = []
                self.consumer.commit_offsets()




    def disConnect(self):
        self.consumer.close()


from elasticsearch import Elasticsearch  
from elasticsearch.helpers import bulk
class ImportEsData:

    logging.config.fileConfig("logging.conf")
    logger = logging.getLogger("msg")

    def __init__(self,hosts,index,type):
       self.es = Elasticsearch(hosts=hosts.strip(',').split(','), timeout=5000)
       self.index = index
       self.type = type


    def set_date(self,data):  
        # 批量處理  
        success = bulk(self.es, data, index=self.index, raise_on_error=True)  
        self.logger.info(success)

3.執行

if __name__ == '__main__':
    kp = KafkaPython()
    kp.getConnect()
    kp.beginConsumer()
    # kp.disConnect()

注：簡單的寫了一個從kafka中讀取資料到一個list裡，當資料達到一個閾值時，在批量插入到 es的外掛
現在還在批量的壓測中。。。
歡迎一起討論

python消費kafka資料批量插入到es

1、es的批量插入這是為了方便後期配置的更改，把配置資訊放在logging.conf中用elasticsearch來實現批量操作，先安裝依賴包，sudo pip install Elasticsearch2 from elasticsear

c#資料批量插入

由於之前面試中經常被問到有關EF的資料批量插入問題，今天以Sqlserver資料庫為例，對.net中處理資料批量處理的方案進行了測試對比。 1.四種測試方案（1）普通的EF資料批量插入：即呼叫DbSet中的Addrange方法（2）不進行上下文跟蹤的EF資料批量插入：即關閉自呼叫的DetectCha

使用jdbc向資料庫中注入大量資料（以10W條資料批量插入為例）

例項：10w條資料的插入(批量插入) import java.sql.Connection; import java.sql.DriverManager; import java.sql.PreparedStatement; public class HomeWork02 { //預

mysql測試資料批量插入

簡介場景1：測試sql伺服器效能時需要單表100萬以上資料時場景2：業務測試資料1000個賬號每個賬號有5個商品當我們遇到以上場景時，如何快速造資料？原理利用select的交叉連線(cross join)。如果不帶WHERE條件子句，它將會返回被連

Spark 消費Kafka資料

spark RDD消費的哦，不是spark streaming。導maven包：注意版本哦，要跟自己機器的一致

Django 使用ORM將資料批量插入資料庫之bulk_create()

在Django中需要向資料庫中插入多條資料（list）。使用如下方法，每次save()的時候都會訪問一次資料庫。導致效能問題 for i in resultlist: p = Account(name=i) p.save() 1 2 3 4

將其他庫中的表的資料批量插入新增到另一個庫的表中

jkdb.factory中的jkdb為資料庫，factory為表名兩張表的欄位和欄位型別需要一致。 INSERT INTO jkdb.factory SELECT id, name, sex

Storm-Kafka模組常用介面分析及消費kafka資料例子

使用storm-kafka模組讀取kafka中的資料，按照以下兩步進行構建(我使用的版本是0.9.3) 1. 使用BrokerHosts介面來配置kafka broker host與partition的mapping資訊; 2. 使用KafkaConfig來配置一些與kaf

簡單Storm消費Kafka資料並存儲到redis例項（訂單資訊處理）

maven依賴 <dependencies> <dependency> <groupId>org.apache.storm</groupId> <artifactId&g

MSSQL資料批量插入優化詳細

public void ThirdWay() { Stopwatch sw = new Stopwatch(); Stopwatch sw1 = new Stopwatch(); DataTable dt = GetTa

python 解析excel資料並插入資料庫（可執行）

背景：應業務要求需要不定期將一些excel資料匯入到線上資料庫run.py實現如下：#encoding=utf-8import xlrdfrom configparser import ConfigParserimport pymysqlimport systry:book

storm實時消費kafka資料

程式環境，在kafka建立名稱為data的topic,開啟消費者模式，準備輸入資料。程式的pom.xml檔案 <dependencies> <dependency> <groupId>org.

C# 幾種資料庫的大資料批量插入 C# 幾種資料庫的大資料批量插入

C# 幾種資料庫的大資料批量插入轉載：http://www.cnblogs.com/luluping/archive/2012/08/09/2629515.html 在之前只知道SqlServer支援資料批量插入，殊不知道Oracle、SQLite和MySql也是

storm消費kafka資料

http://blog.csdn.net/tonylee0329/article/details/43016385 使用storm-kafka模組讀取kafka中的資料，按照以下兩步進行構建(我使用的版本是0.9.3) 1. 使用BrokerHosts介面來

spark Streaming 直接消費Kafka資料，儲存到 HDFS 實戰程式設計實踐

最近在學習spark streaming 相關知識，現在總結一下主要程式碼如下 def createStreamingContext():StreamingContext ={ val sparkConf = new SparkConf().setAppName("

C#實現json物件資料批量插入資料庫

最近在做客戶的一個專案，其中有一個小地方是用JQgrid表格載入從介面讀出的資料，之後提供多選功能，最後再將選中的資料再匯入到另一個數據庫中。熟悉JQgrid的同志們都知道，JQgrid獲取選中行的資料相當簡單，唯獨要注意的是，選中的資料是json物件格式的，需要傳到

C# 資料批量插入及更新

*table為更新表名或插入資料表 SqlConnection conn = new SqlConnection(sqlConStr); SqlCommand comm = conn.CreateCommand();

mybatis資料批量插入

首先看看批處理的mapper.xml檔案 <insert id="insertbatch" parameterType="java.util.List"> <selectKey keyProperty="fetchTime" order="BEFO

Oracle資料庫學習小結（三）---大資料批量插入bulkcopy

　　上篇文章中說到要跟大家聊聊bulkcopy這個的用法，今天我們就來了解一下這個用法。在之前我們寫程式的時候如果遇到需要往資料庫中插入資料你會怎麼辦？可能第一反應是加個迴圈一條一條插入唄，如果我

python演算法與資料結構-插入排序(34)

一、插入排序的介紹　　插入排序的工作方式非常像人們排序一手撲克牌一樣。開始時，我們的左手為空並且桌子上的牌面朝下。然後，我們每次從桌子上拿走一張牌並將它插入左手中正確的位置。為了找到一張牌的正確位置，我們從右到左將它與已在手中的每張牌進行比較，如下圖所示：　　那插曲排序是如何藉助上面提到的思想來實

python消費kafka資料批量插入到es

相關推薦