1. 程式人生 > >Run Concurrent EMR Jobs using Data Pipeline

Run Concurrent EMR Jobs using Data Pipeline

AWS Data Pipeline supports parallel or concurrent job submission using HadoopActivity. You can choose either a fair scheduler or capacity scheduler to maximize cluster resource utilization; the best choice for you depends on your use case.

Data Pipeline uses the Amazon DynamoDB storage handler, which is a MapReduce application that imports and exports DynamoDB tables. The following example also includes steps to export the specified table to Amazon S3 using HadoopActivity.

Note: The DynamoDB, EMRCluster, and S3 resources for this backup must be in the same AWS region.

You can either use the JSON syntax at HadoopActivity, or use the AWS console:

  1. Sign in to the Data Pipeline console and choose Create Pipeline.
  2. Fill in the fields with the following:
    For Name
    , add something meaningful to you.
    For Source, choose Build using Architect.
    For Run, choose On pipeline activation.
    For Logging, add an S3 location to copy execution logs to, or choose Disabled.
  3. Choose Edit in Architect.
  4. Choose Add, and select HadoopActivity.
  5. Fill in the Jar URI
    field with s3://dynamodb-emr-<region>/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar, adding the region that your resources are in.
  6. From the Add an optional field menu, choose Argument. Add a new Argument field for each of the four values.
  7. Fill in the newly created Argument field with the following, adding the names and locations of your resources as necessary:
    [org.apache.hadoop.dynamodb.tools.DynamoDbExport,<s3 output directory path>,<table name>, 0.25]
  8. From the Add an optional field menu, choose Runs On.
  9. Open the newly created Runs On drop-down menu, and choose Create new: EMR Cluster.
  10. Repeat steps 4 and 5 for each DynamoDB table you want to back up.
  11. Open the Resource drop-down on the right side of the screen and choose the EmrCluster panel.
  12. From the Add an optional field menu, choose the following options and settings:
    Choose Release Label, and enter emr-4.7.2.
    Choose Master Instance Type, and enter an instance size that meets your needs.
    Choose Core Instance Type, and enter an instance size that meets your needs.
    Choose Hadoop Scheduler Type, and enter PARALLEL_CAPACITY_SCHEDULING or PARALLEL_FAIR_SCHEDULING, depending on whether you want a capacity or fair scheduler.
  13. Choose Activate.

If your EMR cluster has an attached EC2 key pair, you can log in to the master node and run the yarn application –list command to see the number of running applications.

相關推薦

Run Concurrent EMR Jobs using Data Pipeline

AWS Data Pipeline supports parallel or concurrent job submission using HadoopActivity. You can choose either a fair scheduler or capacity sched

Fast Order Search Using Yelp’s Data Pipeline and Elasticsearch

Since its inception in 2013, Yelp has grown its transactions platform to tens of millions of orders. With this growth, it’s become slow and cumberso

Configure EMR to Run a PySpark Job Using Python 3.x

Amazon Web Services is Hiring. Amazon Web Services (AWS) is a dynamic, growing business unit within Amazon.com. We are currently hiring So

BBC's head of TV slams Netflix for using data

Data-gathering robots are undermining Netflix's creative content. Currently, the US company relies on algorithms to suggest whether new programmes would be

Building a Big Data Pipeline With Airflow, Spark and Zeppelin

Building a Big Data Pipeline With Airflow, Spark and Zeppelin“black tunnel interior with white lights” by Jared Arango on UnsplashIn this data-driven era,

Using Data Science to help Women make Contraceptive Choices

Clearly, some of the contraceptives are more popular and used more often than others, leading to imbalance between classes. We can try to address this in t

Using data science to improve public policy

100 researchers and students from MIT and six other universities gathered on campus this April for the first weekend-long MIT Policy Hackathon. This interd

Proactive Data Pipeline Alerting with Pulse

In mid-2017, we were working with one of the world’s largest healthcare companies to put a new data application into production. The customer had gro

Data Pipeline License

You may not, and you will not encourage, assist or authorize any other person to, (a) incorporate any portion of it into your own programs o

Stop and Start Amazon EC2 Instances with Data Pipeline

You can use AWS Data Pipeline to programmatically start and stop your EC2 instances at scheduled instances. Data Pipeline uses AWS technologies

Use Data Pipeline to Copy Tables to Another Database

Download and use these scripts to copy a table from one database to another using Data Pipeline. Before you begin, modify the sample definition

Data Pipeline

For example, a pipeline that runs a daily job (a Low Frequency activity) on AWS to replicate an Amazon DynamoDB table to Amazon S3 would cost $

Run the CodeDeploy Agent Using Non

Amazon Web Services is Hiring. Amazon Web Services (AWS) is a dynamic, growing business unit within Amazon.com. We are currently hiring So

Resolve AWS Data Pipeline error "Resource is stalled. Associated tasks not able to make progress."

Here are some common reasons why Amazon EC2 instances time out in Data Pipeline. Software updates after launch If you don

AWS | Amazon Data Pipeline

AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services, as

Amazon Kinesis- Setting up a Streaming Data Pipeline

Ray Zhu from the Amazon Kinesis team wrote this great post about how to set up a streaming data pipeline. He carefully shows you step by step how

AWS Data Pipeline資料處理_資料驅動型工作流管理系統

AWS Data Pipeline 是一種 Web 服務,可幫助您可靠地處理資料並以指定的間隔在不同 AWS 計算與儲存服務以及本地資料來源之間移動資料。利用 AWS Data Pipeline,您可以定期在您儲存資料的位置訪問資料,大規模轉換和處理資料,並高效地將結果傳

AWS Data Pipeline開發人員資源_資料處理服務

Amazon Web Services 誠聘精英。 Amazon Web Services (AWS) 是 Amazon.com 的一個充滿活力、不斷壯大的業務部門。我們現誠聘軟體開發工程師、產品經理、客戶經理、解決方案架構師、支援工程師、系統工程師以及設計師等人才。請訪問我

AWS Data Pipeline價格_資料處理服務

例如,在 AWS 上執行日常作業(低頻活動),即將 Amazon DynamoDB 資料表複製到 Amazon S3 每月需要收費 0.60 USD。如果一個 Amazon EC2 活動新增到了相同的管道中,以根據 Amazon S3 中的資料生成報告,則管道的總花費將為每月 1.2

(Les17 Retrieving Data Using Subqueries)[20180103]

del all rap block gpo 系統 access 寫法 join 學習目標: -多列子查詢 -SQL語句中使用標量子查詢 -更新或刪除行使用關聯子查詢 -使用EXISTS和NOT EXISTS操作符 -使用WITH子句