kafka叢集Broker端引數設定及調優準則建議-kafka 商業環境實戰
阿新 • • 發佈:2018-12-20
1 Distributed streaming platform
Apache Kafka® is a distributed streaming platform. What exactly does that mean? A streaming platform has three key capabilities: - Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system. - Store streams of records in a fault-tolerant durable way. - Process streams of records as they occur. Kafka is generally used for two broad classes of applications: - Building real-time streaming data pipelines that reliably get data between systems or applications - Building real-time streaming applications that transform or react to the streams of data To understand how Kafka does these things, let's dive in and explore Kafka's capabilities from the bottom up.First a few concepts: - Kafka is run as a cluster on one or more servers that can span multiple datacenters. - The Kafka cluster stores streams of records in categories called topics. - Each record consists of a key, a value, and a timestamp.
2 Kafka as a Storage System
Any message queue that allows publishing messages decoupled from consuming them is effectively acting as a storage system for the in-flight messages. What is different about Kafka is that it is a very good storage system. - Data written to Kafka is written to disk and replicated for fault-tolerance. Kafka allows producers to wait on acknowledgement so that a write isn't considered complete until it is fully replicated and guaranteed to persist even if the server written to fails. - The disk structures Kafka uses scale well,Kafka will perform the same whether you have 50 KB or 50 TB of persistent data on the server. - As a result of taking storage seriously and allowing the clients to control their read position, you can think of Kafka as a kind of special purpose distributed filesystem dedicated to high-performance, low-latency commit log storage, replication, and propagation.
3 kafka實現高吞吐率的祕密
- 一個使用者程式要把檔案內容傳送到網路這個使用者程式是工作在使用者空間,檔案和網路socket屬於硬體資源,兩者之間有一個核心空間。因此在作業系統內部,整個過程為:
在Linux kernel2.2 之後出現了一種叫做"零拷貝(zero-copy)"系統呼叫機制,就是跳過“使用者緩衝區”的拷貝,建立一個磁碟空間和記憶體的直接對映,資料不再複製到“使用者態緩衝區”
- kafka的佇列topic被分為了多個區partition,每個partition又分為多個段segment,所以一個佇列中的訊息實際上是儲存在N多個片段檔案中,通過分段的方式,每次檔案操作都是對一個小檔案的操作,增加了並行處理能力
- kafka允許進行批量傳送訊息,先將訊息快取在記憶體中,然後通過一次請求批量把訊息傳送出去,比如:可以指定快取的訊息達到某個量的時候就發出去,或者快取了固定的時間後就傳送出去,如100條訊息就傳送,或者每5秒傳送一次這種策略將大大減少服務端的I/O次數。
- kafka還支援對訊息集合進行壓縮,Producer可以通過GZIP或Snappy格式或LZ4對訊息集合進行壓縮,壓縮的好處就是減少傳輸的資料量,減輕對網路傳輸的壓力。