1. 程式人生 > >Amazon EMR FAQs

Amazon EMR FAQs

Q: How does Amazon EMR use Amazon EC2 and Amazon S3?

Customers upload their input data and a data processing application into Amazon S3. Amazon EMR then launches a number of Amazon EC2 instances as specified by the customer. The service begins the cluster execution while pulling the input data from Amazon S3 using S3N protocol into the launched Amazon EC2 instances. Once the cluster is finished, Amazon EMR transfers the output data to Amazon S3, where customers can then retrieve it or use as input in another cluster.

Q: How is a computation done in Amazon EMR?

Amazon EMR uses the Hadoop data processing engine to conduct computations implemented in the MapReduce programming model. The customer implements their algorithm in terms of map() and reduce() functions. The service starts a customer-specified number of Amazon EC2 instances, comprised of one master and multiple slaves. Amazon EMR runs Hadoop software on these instances. The master node divides input data into blocks, and distributes the processing of the blocks to the slave node. Each slave node then runs the map function on the data it has been allocated, generating intermediate data. The intermediate data is then sorted and partitioned and sent to processes which apply the reducer function to it. These processes also run on the slave nodes. Finally, the output from the reducer tasks is collected in files. A single “cluster” may involve a sequence of such MapReduce steps.

Q: How reliable is Amazon EMR?

Amazon EMR manages an Amazon EC2 cluster of compute instances using Amazon’s highly available, proven network infrastructure and datacenters. Amazon EMR uses industry proven, fault-tolerant Hadoop software as its data processing engine. Hadoop splits the data into multiple subsets and assigns each subset to more than one Amazon EC2 instance. So, if an Amazon EC2 instance fails to process one subset of data, the results of another Amazon EC2 instance can be used.

Q: How quickly will my cluster be up and running and processing my input data?

Amazon EMR starts resource provisioning of Amazon EC2 On-Demand instances almost immediately. If the instances are not available, Amazon EMR will keep trying to provision the resources for your cluster until they are provisioned or you cancel your request. The instance provisioning is done on a best-efforts basis and depends on the number of instances requested, time when the cluster is created, and total number of requests in the system. After resources have been provisioned, it typically takes fewer than 15 minutes to start processing.

In order to guarantee capacity for your clusters at the time you need it, you may pay a one-time fee for Amazon EC2 Reserved Instances to reserve instance capacity in the cloud at a discounted hourly rate. Like On-Demand Instances, customers pay usage charges only for the time when their instances are running. In this way, Reserved Instances enable businesses with known instance requirements to maintain the elasticity and flexibility of On-Demand Instances, while also reducing their predictable usage costs even further.

Q: Which Amazon EC2 instance types does Amazon EMR support?

Amazon EMR supports 12 EC2 instance types including Standard, High CPU, High Memory, Cluster Compute, High I/O, and High Storage. Standard Instances have memory to CPU ratios suitable for most general-purpose applications. High CPU instances have proportionally more CPU resources than memory (RAM) and are well suited for compute-intensive applications. High Memory instances offer large memory sizes for high throughput applications. Cluster Compute instances have proportionally high CPU with increased network performance and are well suited for High Performance Compute (HPC) applications and other demanding network-bound applications. High Storage instances offer 48 TB of storage across 24 disks and are ideal for applications that require sequential access to very large data sets such as data warehousing and log processing. See the EMR pricing page for details on available instance types and pricing per region.

Q: How do I select the right Amazon EC2 instance type?

When choosing instance types, you should consider the characteristics of your application with regards to resource utilization and select the optimal instance family. One of the advantages of Amazon EMR with Amazon EC2 is that you pay only for what you use, which makes it convenient and inexpensive to test the performance of your clusters on different instance types and quantity. One effective way to determine the most appropriate instance type is to launch several small clusters and benchmark your clusters.

Q: How do I select the right number of instances for my cluster?

The number of instances to use in your cluster is application-dependent and should be based on both the amount of resources required to store and process your data and the acceptable amount of time for your job to complete. As a general guideline, we recommend that you limit 60% of your disk space to storing the data you will be processing, leaving the rest for intermediate output. Hence, given 3x replication on HDFS, if you were looking to process 5 TB on m1.xlarge instances, which have 1,690 GB of disk space, we recommend your cluster contains at least (5 TB * 3) / (1,690 GB * .6) = 15 m1.xlarge core nodes. You may want to increase this number if your job generates a high amount of intermediate data or has significant I/O requirements. You may also want to include additional task nodes to improve processing performance. See Amazon EC2 Instance Types for details on local instance storage for each instance type configuration.

Q: How long will it take to run my cluster?

The time to run your cluster will depend on several factors including the type of your cluster, the amount of input data, and the number and type of Amazon EC2 instances you choose for your cluster.

Q: If the master node in a cluster goes down, can Amazon EMR recover it?

No. If the master node goes down, your cluster will be terminated and you’ll have to rerun your job. Amazon EMR currently does not support automatic failover of the master nodes or master node state recovery. In case of master node failure, the AWS Management console displays “The master node was terminated” message which is an indicator for you to start a new cluster. Customers can instrument check pointing in their clusters to save intermediate data (data created in the middle of a cluster that has not yet been reduced) on Amazon S3. This will allow resuming the cluster from the last check point in case of failure.

Q: If a slave node goes down in a cluster, can Amazon EMR recover from it?

Yes. Amazon EMR is fault tolerant for slave failures and continues job execution if a slave node goes down. Amazon EMR will also provision a new node when a core node fails. However, Amazon EMR will not replace nodes if all nodes in the cluster are lost.

Q: Can I SSH onto my cluster nodes?

Yes. You can SSH onto your cluster nodes and execute Hadoop commands directly from there. If you need to SSH into a slave node, you have to first SSH to the master node, and then SSH into the slave node.

Q: What is Amazon EMR Bootstrap Actions?

Bootstrap Actions is a feature in Amazon EMR that provides users a way to run custom set-up prior to the execution of their cluster. Bootstrap Actions can be used to install software or configure instances before running your cluster. You can read more about bootstrap actions in EMR's Developer Guide.

Q: How can I use Bootstrap Actions?

You can write a Bootstrap Action script in any language already installed on the cluster instance including Bash, Perl, Python, Ruby, C++, or Java. There are several pre-defined Bootstrap Actions available. Once the script is written, you need to upload it to Amazon S3 and reference its location when you start a cluster. Please refer to the “Developer’s Guide”: http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/ for details on how to use Bootstrap Actions.

Q: How do I configure Hadoop settings for my cluster?

The EMR default Hadoop configuration is appropriate for most workloads. However, based on your cluster’s specific memory and processing requirements, it may be appropriate to tune these settings. For example, if your cluster tasks are memory-intensive, you may choose to use fewer tasks per core and reduce your job tracker heap size. For this situation, a pre-defined Bootstrap Action is available to configure your cluster on startup. See the Configure Memory Intensive Bootstrap Action in the Developer’s Guide for configuration details and usage instructions. An additional predefined bootstrap action is available that allows you to customize your cluster settings to any value of your choice. See the Configure Hadoop Bootstrap Action in the Developer’s Guide for usage instructions.

Q: Can I modify the number of slave nodes in a running cluster?

Yes. Slave nodes can be of two types: (1) core nodes, which both host persistent data using Hadoop Distributed File System (HDFS) and run Hadoop tasks and (2) task nodes, which only run Hadoop tasks. While a cluster is running you may increase the number of core nodes and you may either increase or decrease the number of task nodes. This can be done through the API, Java SDK, or though the command line client. Please refer to the Resizing Running clusters section in the Developer’s Guide for details on how to modify the size of your running cluster.

Q: When would I want to use core nodes versus task nodes?

As core nodes host persistent data in HDFS and cannot be removed, core nodes should be reserved for the capacity that is required until your cluster completes. As task nodes can be added or removed and do not contain HDFS, they are ideal for capacity that is only needed on a temporary basis.

Q: Why would I want to modify the number of slave nodes in my running cluster?

There are several scenarios where you may want to modify the number of slave nodes in a running cluster. If your cluster is running slower than expected, or timing requirements change, you can increase the number of core nodes to increase cluster performance. If different phases of your cluster have different capacity needs, you can start with a small number of core nodes and increase or decrease the number of task nodes to meet your cluster’s varying capacity requirements.

Q: Can I automatically modify the number of slave nodes between cluster steps?

Yes. You may include a predefined step in your workflow that automatically resizes a cluster between steps that are known to have different capacity needs. As all steps are guaranteed to run sequentially, this allows you to set the number of slave nodes that will execute a given cluster step.

Q: How can I allow other IAM users to access my cluster?

To create a new cluster that is visible to all IAM users within the EMR CLI: Add the --visible-to-all-users flag when you create the cluster. For example: elastic-mapreduce --create --visible-to-all-users. Within the Management Console, simply select “Visible to all IAM Users” on the Advanced Options pane of the Create cluster Wizard.

To make an existing cluster visible to all IAM users you must use the EMR CLI. Use --set-visible-to-all-users and specify the cluster identifier. For example: elastic-mapreduce --set-visible-to-all-users true --jobflow j-xxxxxxx. This can only be done by the creator of the cluster.

To learn more, see the Configuring User Permissions section of the EMR Developer Guide.

相關推薦

Amazon EMR FAQs

Q: How does Amazon EMR use Amazon EC2 and Amazon S3? Customers upload their input data and a data processing application into Amazon S3.

Migrate to Apache HBase on Amazon S3 on Amazon EMR: Guidelines and Best Practices

This blog post provides guidance and best practices about how to migrate from Apache HBase on HDFS to Apache HBase on Amazon S3 on Amazon EMR.

Launch an edge node for Amazon EMR to run RStudio

RStudio Server provides a browser-based interface for R and a popular tool among data scientists. Data scientist use Apache Spark cluster running

time bushfire alerting with Complex Event Processing in Apache Flink on Amazon EMR and IoT sensor network | AWS Big Data Blog

Bushfires are frequent events in the warmer months of the year when the climate is hot and dry. Countries like Australia and the United States are

Amazon Glacier FAQs

Q: What can I expect the total cost of ownership (TCO) to be? Amazon Glacier is a secure, durable, and extremely low-cost cloud st

Amazon EBS FAQs

Q: What level of performance consistency can I expect to see from my Provisioned IOPS SSD (io1) volumes? When attached to EBS-optimized

Large-Scale Machine Learning with Spark on Amazon EMR

This is a guest post by Jeff Smith, Data Engineer at Intent Media. Intent Media, in their own words: “Intent Media operates a platform for adverti

Resolve "OutOfMemoryError" Hive Java Heap Space Exceptions on Amazon EMR that Occur when Hive Outputs the Query Results

export HIVE_CLIENT_HEAPSIZE=1024 export HIVE_METASTORE_HEAPSIZE=2048 export HIVE_SERVER2_HEAPSIZE=3072 if [ "$SERVICE" = "metastore" ] then exp

Resolve Amazon EMR Hive Query Failure because of an Intermittent Hive

2018-05-09T11:53:28,837 ERROR [HiveServer2-Background-Pool: Thread-64([])]: ql.Driver (SessionState.java:printError(1097)) - FAILED: Execution E

Assign a Static Private IP Address to the Master Node of an Amazon EMR Cluster

Amazon Web Services is Hiring. Amazon Web Services (AWS) is a dynamic, growing business unit within Amazon.com. We are currently hiring So

Troubleshoot Cluster Launch Issues after Amazon EMR Release Version Upgrade

<property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://<HOSTNAME OF YOUR EXTERNAL METASTO

Amazon VPC FAQs

Q. What is a default VPC? A default VPC is a logically isolated virtual network in the AWS cloud that is automatically created for your

Set Up a Spark SQL JDBC Connection on Amazon EMR

Amazon Web Services is Hiring. Amazon Web Services (AWS) is a dynamic, growing business unit within Amazon.com. We are currently hiring So

Strategies for Reducing Your Amazon EMR Costs

This is a guest post by Prateek Gupta, a lead engineer at BloomReach BloomReach has built a personalized discovery platform with applicati

Amazon WorkMail FAQs

Q: How can I migrate mailboxes from my existing email solution to Amazon WorkMail? You can migrate your existing mailboxes to Amazon Wor

Resolve "The provided key element does not match the schema" Error When Importing DynamoDB Tables Using Hive on Amazon EMR

2018-02-01 08:17:27,782 [INFO] [TezChild] |s3n.S3NativeFileSystem|: Opening 's3://bucket/folder/ddb_hive.sql' for reading 2018-02-01 08:17:27,81

Amazon EC2 FAQs

Q: What are Accelerated Computing instances? Accelerated Computing instance family is a family of instances which use hardware accel

Forcing an Amazon EMR Cluster to Resize

Amazon Web Services is Hiring. Amazon Web Services (AWS) is a dynamic, growing business unit within Amazon.com. We are currently hiring So

Auto Scaling in Amazon EMR

Amazon Web Services is Hiring. Amazon Web Services (AWS) is a dynamic, growing business unit within Amazon.com. We are currently hiring So

Launch an Amazon EMR Cluster in a VPC Environment

Amazon Web Services is Hiring. Amazon Web Services (AWS) is a dynamic, growing business unit within Amazon.com. We are currently hiring So