1. 程式人生 > >DataFu's Hourglass: Incremental Data Processing in Hadoop

DataFu's Hourglass: Incremental Data Processing in Hadoop

For a large scale site such as LinkedIn, tracking metrics accurately and efficiently is an important task. For example, imagine we need a dashboard that shows the number of visitors to every page on the site over the last thirty days. To keep this dashboard up to date, we can schedule a query that runs daily and gathers the stats for the last 30 days. However, this simple implementation would be wasteful: only one day of data has changed, but we'd be consuming and recalculating the stats for all 30.

A more efficient solution is to make the query incremental: using basic arithmetic, we can update the output from the previous day by adding and subtracting input data. This enables the job to process only the new data, significantly reducing the computational resources required. Unfortunately, although there are many benefits to the incremental approach, getting incremental jobs right is hard:

  • The job must maintain state to keep track of what has already been done, and compare this against the input to determine what to process.
  • If the previous output is reused, then the job needs to be written to consume not just new input data, but also previous outputs.
  • There are more things that can go wrong with an incremental job, so you typically need to spend more time writing automated tests to make sure things are working.

To solve these problems, we are happy to announce that we have open sourced Hourglass, a framework that makes it much easier to write incremental Hadoop jobs. We are releasing Hourglass under the Apache 2.0 License as part of the DataFu project. We will be presenting our "Hourglass: a Library for Incremental Processing on Hadoop" paper at the IEEE BigData 2013 conference on October 9th.

In this post, we will give an overview of the basic concepts behind Hourglass and walk through examples of using the framework to solve processing tasks incrementally. The first example presents a job that counts how many times a member has logged in to a site. The second example presents a job that estimates the number of members who have visited in the past thirty days. Lastly, we will show you how to get the code and start writing your own incremental hadoop jobs.

Basic Concepts

Hourglass is designed to make computations over sliding windows more efficient. For these types of computations, the input data is partitioned in some way, usually according to time, and the range of input data to process is adjusted as new data arrives. Hourglass works with input data that is partitioned by day, as this is a common scheme for partitioning temporal data.

We have found that two types of sliding window computations are extremely common in practice:

  • Fixed-length: the length of the window is set to some constant number of days and the entire window moves forward as new data becomes available. Example: a daily report summarizing the the number of visitors to a site from the past 30 days.
  • Fixed-start: the beginning of the window stays constant, but the end slides forward as new input data becomes available. Example: a daily report summarizing all visitors to a site since the site launched.

We designed Hourglass with these two use cases in mind. Our goal was to design building blocks that could efficiently solve these problems while maintaining a programming model familiar to developers of MapReduce jobs.

The two major building blocks of incremental processing with Hourglass are a pair of Hadoop jobs:

  • Partition-preserving: consumes partitioned input data and produces partitioned output.
  • Partition-collapsing: consumes partitioned input data and merges it to produce a single output.

We'll discuss these two jobs in the next two sections.

Partition-preserving job

Concepts, Partition Preserving Job

In the partition-preserving job, input data that is partitioned by day is consumed and output data is produced that is also partitioned by day. This is equivalent to running one MapReduce job separately for each day of input data. Suppose that the input data is a page view event and the goal is to count the number of page views by member. This job would produce the page view counts per member, partitioned by day.

Partition-collapsing job

Concepts, Partition Collapsing Job

In the partition-collapsing job, input data that is partitioned by day is consumed and a single output is produced. If the input data is a page view event and the goal is to count the number of page views by member, then this job would produce the page view counts per member over the entire n days.

Functionally, the partition-collapsing job is not too different from a standard MapReduce job. However, one very useful feature it has is the ability to reuse its previous output, enabling it to avoid reprocessing input data. So if day n+1 arrives, it can merge it with the previous output, without having to reprocess days 1 through n. For many aggregation problems, this makes the computation much more efficient.

Concepts, Partition Collapsing Job with reuse

Hourglass programming model

The Hourglass jobs are implemented as MapReduce jobs for Hadoop:

Concepts, MapReduce

The map method receives values from the input data and, for each input, produces zero or more key-value pairs as output. Implementing the map operation is similar to implementing the same for Hadoop, just with a different interface. For the partition-preserving job, Hourglass automatically keeps the data partitioned by day, so the developer can focus purely on the application logic.

The reduce method receives each key and a list of values. For each key, it produces the same key and a single new value. In some cases, it could produce no output at all for a particular key. In a standard reduce implementation in Hadoop, the programmer is provided an interface to the list of values and is responsible for pulling each value from it. With an accumulator, this is reversed: the values are passed in one at a time and at most one value can be produced as output.

The combine method is optional and can be used as an optimization. Its purpose is to reduce the amount of data that is passed to the reducer, limiting I/O costs. Like the reducer, it also uses an accumulator. For each key, it produces the same key and a single new value, where the input and output values have the same type.

Hourglass uses Avro for all of the input and output data types in the diagram above, namely k, v, v2, and v3. One of the tasks when programming with Hourglass is to define the schemas for these types. The exception is the input schema, which is implicitly determined by the jobs when the input is inspected.

Example 1: Counting Events Per Member

With the basic concepts out of the way, let's look at an example. Suppose that we have a website that tracks user logins as an event, and for each event, the member ID is recorded. These events are collected and stored in HDFS in Avro under paths with the format /data/event/yyyy/MM/dd. Suppose for this example our Avro schema is:

The goal is to count how many times each member has logged in over the entire history and produce a daily report containing these counts. One solution is to simply consume all data under /data/event each day and aggregate by member ID. While this solution works, it is very wasteful (and only gets more wasteful over time), as it recomputes all the data every day, even though only 1 day worth of data has changed. Wouldn't it be better if we could merge the previous result with the new data? With Hourglass you can.

To continue our example, let's say there are two days of data currently available, 2013/03/15 and 2013/03/16, and that their contents are:

Let's aggregate the counts by member ID using Hourglass. To perform the aggregation we will use PartitionCollapsingIncrementalJob, which takes a partitioned data set and collapses all the partitions together into a single output. The goal is to aggregate the two days of input and produce a single day of output, as in the following diagram:

Example 1, Step 1

First, create the job:

Next, we will define schemas for the key and value used by the job. The key affects how data is grouped in the reducer when we perform the aggregation. In this case, it will be the member ID. The value is the piece of data being aggregated, which will be an integer representing the count.

This produces the following representation:

Now we can tell the job what our schemas are. Hourglass allows two different value types. One is the intermediate value type that is produced by the mapper and combiner. The other is the output value type, the product of the reducer. In this case we will use the same value type for each.

Next, we will tell Hourglass where to find the data, where to write the data, and that we want to reuse the previous output.

Now let's get into some application logic. The mapper will produce a key-value pair from each input record, consisting of the member ID and a count, which for each input record will just be 1.

An accumulator is responsible for aggregating this data. Records will be grouped by member ID and then passed to the accumulator one-by-one. The accumulator keeps a running total and adds each input count to it. When all data has been passed to it, the getFinal() method will be called, which returns the output record containing the count.

Since the intermediate and output values have the same schema, the accumulator can also be used for the combiner, so let's indicate that we want it to be used for that:

Finally, we run the job.

When we inspect the output we find that the counts match what we expect:

Now suppose that a new day of data becomes available:

Let's run the job again. Since Hourglass already has a result for the previous day, it consumes the new day of input and the previous output, rather than all the input data it already processed.

Example 1, Step 2

The previous output is passed to the accumulator, where it is aggregated with the new data. This produces the output we expect:

In this example, we only have a few days of input data, so the impact of incrementally processing the new data is small. However, as the size of the input data grows, the benefit of incrementally processing data becomes very significant.

Example 2: Cardinality Estimation

Suppose that we have another event that tracks every page view that occurs on the site. One piece of information recorded in the event is the member ID. We want to use this event to tackle another problem: a daily report that estimates the number of members who are active on the site in the past 30 days.

The straightforward approach is to read in the past 30 days of data, perform a distinct operation on the member IDs, and then count the IDs. However, as in the previous case, the input data day-to-day is practically the same. It only differs in the days at the beginning and end of the window. This means that each day we are repeating much of the same work. But unlike the previous case, we cannot simply merge in the new day of data with the previous output because the window length is fixed and we want the oldest day to be removed when the window advances. The PartitionCollapsingIncrementalJob class alone from the previous example will not solve this problem.

Hourglass includes another class to address the fixed-length sliding window use case: the PartitionPreservingIncrementalJob. This type of job consumes partitioned input data, just like the collapsing version, but unlike the other job its output is partitioned. It keeps the data partitioned as it is processing it and uses Hadoop's multiple-outputs feature to produce data partitioned by day. This is equivalent to running a MapReduce job for each individual day of input data, but much more efficient.

With the PartitionPreservingIncrementalJob, we can perform aggregation per day and then use the PartitionCollapsingIncrementalJob to produce the final result. For basic arithmetic-based operations like summation, we could even save ourselves more work by reusing the output, subtracting off the oldest day and adding the newest one.

So how can we use the two jobs together to get the cardinality of active members over the past 30 days? One solution is to use PartitionPreservingIncrementalJob to produce daily sets of distinct member IDs. That is, each day of data produced has all the IDs for members that accessed the site that day. In other words, this is a distinct operation. Then PartitionCollapsingIncrementalJob can consume this data, perform distinct again, and count the number of IDs. The benefit of this approach is that when a new day of data arrives, the partition-preserving job only needs to process that new day and nothing else, as the previous days have already been processed. This idea is outlined below.

Example 2, Architecture

This solution should be more efficient than the naive solution. However, if an estimation of member cardinality is satisfactory, then we could make the job even more efficient. HyperLogLog is capable of estimating the cardinality of large data sets very accurately using a relatively small amount of memory. For example, cardinalities in the billions can be estimated to within 2% accuracy using only 1.5kb of memory. It's also friendly to distributed computing as multiple HyperLogLog estimators can be merged together.

HyperLogLog is a good fit for this use case. For this example, we will use HyperLogLogPlus from stream-lib, an implementation based on this paper that includes some enhancements to the original algorithm. We can use a HyperLogLogPlus estimator for each day of input data in the partition-preserving job and serialize the estimator's bytes as the output. Then the partition-collapsing job can merge together the estimators for the time window to produce the final estimate.

Let's start by defining the mapper. The key it uses is just a dummy value, as we are only producing a single statistic in this case. For the value we use a record with two fields: one is the count estimate; the other we'll just call "data", which can be either a single member ID or the bytes from the serialized estimator. For the map output we use the member ID.

Next, we'll define the accumulator, which can be used for both the combiner and the reducer. This accumulator can handle either member IDs or estimator bytes. When it receives a member ID it adds it to the HyperLogLog estimator. When it receives an estimator it merges it with the current estimator to produce a new one. To produce the final result, it gets the current estimate and also serializes the current estimator as a sequence of bytes.

So there you have it. With the mapper and accumulator now defined, it is just a matter of passing them to the jobs and providing some other configuration. The key piece is to ensure that the second job uses a 30 day sliding window:

Try it yourself!

Here is how you can start using Hourglass. We'll test out the job from the first example against some test data we'll create in a Hadoop. First, clone the DataFu repository and navigate to the Hourglass directory:

Build the Hourglass JAR, and in addition build a test jar that contains the example jobs above.

Define some variables that we'll need for the hadoop jar command. These list the JAR dependencies, as well as the two JARs we just built.

Generate some test data under /data/event using a generate tool. This will create some random events for dates between 2013/03/01 and 2013/03/14. Each record consists of just a single long value in the range 1-100.

Just to get a sense for what the data looks like, we can copy it locally and dump the first several records.

Now run the countbyid tool, which executes the job from the first example that we defined earlier. This will count the number of events for each ID value. In the output you will notice that it reads all fourteen days of input that are available.

We can see what this produced by copying the output locally and dumping the first several records. Each record consists of an ID and a count.

Now let's generate an additional day of data for 2013/03/15.

We can run the incremental job again. This time it will reuse the previous output and will only consume the new day of input.

We can download the new output and inspect the counts:

Both of the examples in this post are also included as unit tests in the Example class within the source code. Some code has been omitted from the examples in this post for sake of space, so please check the original source if you're interested in more of the details.

If you're interested in the project, we also encourage you to try running the unit tests, which can be run in Eclipse once the project is loaded there, or by running ant test at the command line.

Conclusion

We hope this whets your appetite for incremental data processing with DataFu's Hourglass. The code is available on Github in the DataFu repository under an Apache 2.0 license. Documentation is available here. We are accepting contributions, so if you are interesting in helping out, please fork the code and send us your pull requests!

Update: DataFu is now an Apache Incubator project. All links in this blog post have been updated to point to the new Apache DataFu homepage. This blog post has also been cross-posted on the Apache DataFu Blog.

相關推薦

DataFu's Hourglass: Incremental Data Processing in Hadoop

For a large scale site such as LinkedIn, tracking metrics accurately and efficiently is an important task. For example, imagine we need a dashboard that

翻譯-In-Stream Big Data Processing 流式大數據處理

rto 風格 需要 最重要的 建立 reference 處理器 web 用戶id 相當長一段時間以來,大數據社區已經普遍認識到了批量數據處理的不足。很多應用都對實時查詢和流式處理產生了迫切需求。最近幾年,在這個理念的推動下,催生出了一系列解決方案,Twitter Storm

time bushfire alerting with Complex Event Processing in Apache Flink on Amazon EMR and IoT sensor network | AWS Big Data Blog

Bushfires are frequent events in the warmer months of the year when the climate is hot and dry. Countries like Australia and the United States are

hadoop錯誤:java.io.IOException: Incompatible namespaceIDs in /hadoop/data: namenode namespaceID

hadoop的datanode啟動失敗,檢視日誌顯示:的2018-06-12 13:49:03,075 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatib

Data manipulation in python (module 3)

asp ins ide multipl method cts python str tro 1. Visualization wheel dimensions Abstraction - Figuration boxes and charts(abstraction) o

Data manipulation in python (module 4)

lin cal tween idt panda char depth nds packer 1. Matplotlib Backend Layer Deals with th e rendering of plots to screen or files In jup

What is “passive data structure” in Android/Java?

nag holding bstr say roi containe ive ces get From the Android developer web link: http://developer.android.com/reference/android/content

數據處理不等式:Data Processing Inequality

ext right 工程 log src enter 可用 proc 互信 我是在差分隱私下看到的,新解決方案的可用性肯定小於原有解決方案的可用性,也就是說信息的後續處理只會降低所擁有的信息量。 那麽如果這麽說的話為什麽還要做特征工程呢,這是因為該不等式有一個巨大

SDP(0):Streaming-Data-Processor - Data Processing with Akka-Stream

數據庫管理 新的 集成 部分 ont lock 感覺 sharding 數據源 再有兩天就進入2018了,想想還是要準備一下明年的工作方向。回想當初開始學習函數式編程時的主要目的是想設計一套標準API給那些習慣了OOP方式開發商業應用軟件的程序員們,使他們能用一種接近

Set replication in Hadoop

doc new txt nbsp on() word dfs ant exp I was trying loading file using hadoop API as an experiment. I want to set replication to minimum

Display certain line(s) from a text file in Linux.

statistic role n-1 mark art 2.7 box rtai ase Purpose: Display certain line or line

What’s new for Spark SQL in Apache Spark 1.3(中英雙語)

block htm park -h apache HA log -a -- 文章標題 What’s new for Spark SQL in Apache Spark 1.3 作者介紹 Michael Armbrust 文章正文 參考文獻

netstat -n | awk ‘/^tcp/ {++S[$NF]} END {for(a in

netstat#netstat -n | awk ‘/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}’ netstat -n|grep ^tcp | awk ‘{++S[$NF]} END {for(a in S) print a, S[a]}’LAST_

Survey Report on Data Skew in Big Data

range 如何 長時間 變量 延遲 過濾 gas 而且 允許 1 Introduction 信息時代產生了大量的數據,運用和使用數據已經成為一個公司乃至一個國家核心實力的重要組成部分。當代大數據一般指的是:數據量巨大,需要運用新處理模式才能具有更強的決策力、洞察力和流程優

Facebook is spending $1 billion on its first data center in Asia

sof ces south ora force www. ice ini at-t 此文章源自:https://money.cnn.com/2018/09/06/technology/facebook-singapore/index.html Facebook i

SDPA: Toward a Stateful Data Plane in Software-Defined Networking

文章名稱:SDPA: Toward a Stateful Data Plane in Software-Defined Networking 發表時間:2017 期刊來源:IEEE/ACM Transactions on Networking 摘要 OpenFlow僅僅提供簡單的“mat

《AppIntent - Analyzing Sensitive Data Transmission in Android for Privacy Leakage Detection》論文閱讀筆記

AppIntent: Analyzing Sensitive Data Transmission in Android for Privacy Leakage Detection APPIntent:分析敏感資料傳播在Android裝置中隱私洩露的檢測 文獻引

IE 7275 Data Mining in Engineering

代寫IE 7275作業、代做R程式語言作業、代寫k-NN留學生作業、代做R課程設計作業Homework 4IE 7275 Data Mining in EngineeringNote: Read the materials “Tutorial on k-NN with R.pdf.” and “Tutoria

Is my business data safe in cloud? NetSuite到底安全嗎?

NetSuite 產品群QQ:779253701 73% companies are planning to move to cloud in 2 years. Why? Oracle + NetSuite, 4 powerful layers of security 1. Multi

DPAPI Secrets. Security analysis and data recovery in DPAPI

轉載自:https://www.passcape.com/index.php?section=docsys&cmd=details&id=28 DPAPI Secrets Introduction   Part 1. DPAPI architecture and