Run Concurrent EMR Jobs using Data Pipeline

阿新 • • 發佈：2019-01-12

AWS Data Pipeline supports parallel or concurrent job submission using HadoopActivity. You can choose either a fair scheduler or capacity scheduler to maximize cluster resource utilization; the best choice for you depends on your use case.

Data Pipeline uses the Amazon DynamoDB storage handler, which is a MapReduce application that imports and exports DynamoDB tables. The following example also includes steps to export the specified table to Amazon S3 using HadoopActivity.

Note: The DynamoDB, EMRCluster, and S3 resources for this backup must be in the same AWS region.

You can either use the JSON syntax at HadoopActivity, or use the AWS console:

Sign in to the Data Pipeline console and choose Create Pipeline.
Fill in the fields with the following:
For Name

, add something meaningful to you.
For Source, choose Build using Architect.
For Run, choose On pipeline activation.
For Logging, add an S3 location to copy execution logs to, or choose Disabled.
Choose Edit in Architect.
Choose Add, and select HadoopActivity.
Fill in the Jar URI

field with s3://dynamodb-emr-<region>/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar, adding the region that your resources are in.
From the Add an optional field menu, choose Argument. Add a new Argument field for each of the four values.
Fill in the newly created Argument field with the following, adding the names and locations of your resources as necessary:
[org.apache.hadoop.dynamodb.tools.DynamoDbExport,<s3 output directory path>,<table name>, 0.25]
From the Add an optional field menu, choose Runs On.
Open the newly created Runs On drop-down menu, and choose Create new: EMR Cluster.
Repeat steps 4 and 5 for each DynamoDB table you want to back up.
Open the Resource drop-down on the right side of the screen and choose the EmrCluster panel.
From the Add an optional field menu, choose the following options and settings:
Choose Release Label, and enter emr-4.7.2.
Choose Master Instance Type, and enter an instance size that meets your needs.
Choose Core Instance Type, and enter an instance size that meets your needs.
Choose Hadoop Scheduler Type, and enter PARALLEL_CAPACITY_SCHEDULING or PARALLEL_FAIR_SCHEDULING, depending on whether you want a capacity or fair scheduler.
Choose Activate.

If your EMR cluster has an attached EC2 key pair, you can log in to the master node and run the yarn application –list command to see the number of running applications.

Run Concurrent EMR Jobs using Data Pipeline

Run Concurrent EMR Jobs using Data Pipeline

Fast Order Search Using Yelp’s Data Pipeline and Elasticsearch

Configure EMR to Run a PySpark Job Using Python 3.x

BBC's head of TV slams Netflix for using data

Building a Big Data Pipeline With Airflow, Spark and Zeppelin

Using Data Science to help Women make Contraceptive Choices

Using data science to improve public policy

Proactive Data Pipeline Alerting with Pulse

Data Pipeline License

Stop and Start Amazon EC2 Instances with Data Pipeline

Use Data Pipeline to Copy Tables to Another Database

Data Pipeline

Run the CodeDeploy Agent Using Non

Resolve AWS Data Pipeline error "Resource is stalled. Associated tasks not able to make progress."

AWS | Amazon Data Pipeline

Amazon Kinesis- Setting up a Streaming Data Pipeline

AWS Data Pipeline資料處理_資料驅動型工作流管理系統

AWS Data Pipeline開發人員資源_資料處理服務

AWS Data Pipeline價格_資料處理服務

(Les17 Retrieving Data Using Subqueries)[20180103]

Run Concurrent EMR Jobs using Data Pipeline

相關推薦