Building a Big Data Pipeline With Airflow, Spark and Zeppelin

阿新 • • 發佈：2018-12-28

Building a Big Data Pipeline With Airflow, Spark and Zeppelin

“black tunnel interior with white lights” by Jared Arango on Unsplash

In this data-driven era, there is no piece of information that can’t be useful. Every bit of data stored on the systems of your company, no matter its field of activity, is valuable. Maximizing the exploitation of this new black gold is the fastest way towards success, because data offers an enormous amount of answers, even to questions you still haven’t thought of yet.

Luckily for us, setting up a Big Data pipeline that can efficiently scale with the size of your data is no longer a challenge since the main technologies within the Big Data ecosystem are all open-source.

No matter which technology you use to store data, whether it’s a powerful Hadoop cluster or a trusted RDBMS (Relational Database Management System), connecting it to a fully-functioning pipeline is a project that’ll reward you with invaluable insights. One pipeline that can be easily integrated within a vast range of data architectures is composed of the following three technologies: Apache Airflow, Apache Spark, and Apache Zeppelin.

First, let Airflow organize things for you

Apache Airflow is one of those rare technologies that are easy to put in place yet offer extensive capabilities. The workflow management system that was first introduced by Airbnb back in 2015 has gained a lot of popularity thanks to its powerful user interface and its effectiveness through the use of Python.

Airflow relies on four core elements that allow it to simplify any given pipeline:

DAGs (Directed Acyclic Graphs): Airflow uses this concept to structure batch jobs in an extremely efficient way, with DAGs you have a big number of possibilities to structure your pipeline in the most suitable way
Tasks: this is where all the fun happens; Airflow’s DAGs are divided into tasks, and all of the work happens through the code you write in these tasks (and yes, you can literally do anything within an Airflow task)
Scheduler: unlike other workflow management tools within the Big Data universe (notably Luigi), Airflow has its own scheduler which makes setting up the pipeline even easier
X-COM: in a wide array of business cases, the nature of your pipeline may require that you pass information between the multiple tasks. With Airflow that can be easily done through the use of X-COM functions that rely on Airflow’s own database to store data you need to pass from one task to another

Having an Airflow server and scheduler up and running is a few commands away and in a few minutes you could find yourself navigating the friendly user interface of your own Airflow web-server, which is quite easy to master:

The next step consists of connecting Airflow to your database / data management system, fortunately Airflow offers a pretty straightforward way to do that through the UI:

Connecting Airflow to your data management system

And that’s literally all you need to do to have an up and running Airflow server integrated within your data architecture. Now you can use its powerful capabilities to manage your data pipelines by conceiving your pipelines via Airflow’s DAGs system.

Then, let Spark do the hard work

Spark no longer needs an introduction, but in case you’re unfamiliar with the distributed data-processing framework that took the world by storm since it was open sourced in 2013, this 15-minute tutorial by Simplilearn will surely get you up to speed.

As long as you’re running it on a cluster adequate to the size of your data, Spark offers ridiculously fast processing power. And through Spark SQL, it allows you to query your data as if you were using SQL or Hive-QL.

Now all you need to do is to use Spark within your Airflow tasks to process your data according to your business needs. I strongly recommend using the PySpark module and then using Airflow’s PythonOperator for your tasks; that way you get to execute your Spark jobs directly within the Airflow Python functions.

I also recommend relying on helper functions so that you don’t find yourself copy-pasting the same bits of code within different tasks.

Spark SQL offers equivalents to all of the operations that may be present within your queries, so the transition will definitely be seamless. Use PySpark to restructure your data according to your needs and then use its immense processing power to calculate multiple aggregations, then you could store its output on your database, through the Airflow hook.

Finally, enjoy the results through Zeppelin

Apache Zeppelin is another technology at the Apache Software Foundation that’s gaining massive popularity. Through its use of the notebook concept it became the go-to data visualization tool in the Hadoop ecosystem.

Using Zeppelin allows you to visualize your data dynamically and in real-time, and through the forms that you can create within a Zeppelin dashboard you could easily create dynamic scripts that use the forms’ input to run a specific set of operations on a dynamically specified data-set:

Thanks to these dynamic forms, a Zeppelin dashboard becomes an efficient tool to offer even users who have never written a line of code an instant and complete access to the company’s data.

Just like Airflow, setting up a Zeppelin server is pretty straightforward. Then you just need to configure the Spark interpreter so that you can run PySpark scripts within Zeppelin notes on the data you already prepared via the Airflow-Spark pipeline.

Additionally, Zeppelin offers a huge number of interpreters allowing its notes to run multiple types of scripts (with the Spark interpreter being the most hyped).

After loading your data, visualizing it via multiple visualization types can be instantly done via the multiple paragraphs of the note:

And with the release of Zeppelin 0.8.0 in 2018, you could now extend its capabilities (like adding custom visualizations) through Helium, its new plugin system.

To integrate Zeppelin within the pipeline, all you need to do is to configure the Spark interpreter. And if you prefer to access the data calculated with Spark using your database instead, that’s also possible through the use of the appropriate Zeppelin interpreter.

That’s it!

That’s all you need to do to have an up and running Big Data pipeline that allows you to extract and visualize enormous amounts of information from your data.

Start by putting in place an Airflow server that organizes the pipeline, then rely on a Spark cluster to process and aggregate the data, and finally let Zeppelin guide you through the multiple stories your data can tell.

For any questions or if you need some help with one of these technologies, you could email me directly and I’ll get back to you as soon as possible.

This story is published in The Startup, Medium’s largest entrepreneurship publication followed by + 378,907 people.

Building a Big Data Pipeline With Airflow, Spark and Zeppelin

Building a Big Data Pipeline With Airflow, Spark and Zeppelin

First, let Airflow organize things for you

Then, let Spark do the hard work

Finally, enjoy the results through Zeppelin

That’s it!

This story is published in The Startup, Medium’s largest entrepreneurship publication followed by + 378,907 people.

Building a Big Data Pipeline With Airflow, Spark and Zeppelin

Building a Simple Web App With Bottle, SQLAlchemy, and the Twitter API

Building a Repeatable Data Analysis Process with Jupyter Notebooks

Starting to develop in PySpark with Jupyter installed in a Big Data Cluster

Building a Continuous Delivery Pipeline for AWS Service Catalog (Sync AWS Service Catalog with Version Control)

Building a React Native App With Complex Navigation Using React Navigation

Building a banking voice bot with Dialogflow and KOOKOO.

Building a Real-World Pipeline for Image Classification

Predictive Data Science with Amazon SageMaker and a Data Lake on AWS

Amazon Kinesis- Setting up a Streaming Data Pipeline

Building with Watson: Streaming data enhanced with PubNub BLOCKS and Conversation

Winning a Kaggle competition with Apache Spark and SparkML Machine Learning Pipelines

谷歌機器對話Self-Play框架M2M-Building a Conversational Agent Overnight with Dialogue Self-Play

How I wrote a diploma in LaTeX with GitHub, Docker and TravisCI

Creating a Modern OCR Pipeline Using Computer Vision and Deep Learning

Create a web-based chatbot with voice input and output

Building a Data Processing Pipeline with Amazon Kinesis Data Streams and Kubeless

Do more with Data: Building a Data Supplier plugin for Sketch

Building our data science platform with Spark and Jupyter

Building Data Models with PowerPivot_進階篇2

Building a Big Data Pipeline With Airflow, Spark and Zeppelin

Building a Big Data Pipeline With Airflow, Spark and Zeppelin

First, let Airflow organize things for you

Then, let Spark do the hard work

Finally, enjoy the results through Zeppelin

That’s it!

This story is published in The Startup, Medium’s largest entrepreneurship publication followed by + 378,907 people.

相關推薦