Submit Spark Jobs to Remote Amazon EMR Cluster

阿新 • • 發佈：2019-01-12

Prepare your local machine

Note: Spark jobs can be submitted when deploy-mode is set to client or cluster.

1. Install all Spark client libraries on your local machine. For example, if you are using an emr-5.10.0 cluster (which has Spark 2.2.0 installed), then download spark-2.2.0-bin-hadoop2.7.tgz

and place it on your local machine's PATH environment variable. To determine which version of Spark and Apache Hadoop you are using (and therefore which Spark binary you need to download), see Spark Release History and Hadoop Version History.

2. Create an environment variable called HADOOP_CONF_DIR, and then point it to a directory on your local machine. All files in /etc/hadoop/conf on the Amazon EMR cluster must be present in the directory that HADOOP_CONF_DIR points to. Spark uses the configuration files, such as yarn-site.xml for YARN settings and hdfs-site.xml for HDFS settings, that are in the directory that HADOOP_CONF_DIR points to.

Note: When you submit a Spark job in cluster mode, the driver runs on cluster nodes that have all Hadoop binaries installed. When you submit a Spark job in client mode, all Hadoop binaries must be downloaded and installed on your local machine.

Submit the Spark job

Your local machine is now ready to submit a Spark job to a remote Amazon EMR cluster, using a command similar to the following:

Submit Spark Jobs to Remote Amazon EMR Cluster

Submit Spark Jobs to Remote Amazon EMR Cluster

Assign a Static Private IP Address to the Master Node of an Amazon EMR Cluster

Forcing an Amazon EMR Cluster to Resize

Launch an Amazon EMR Cluster in a VPC Environment

Amazon EMR Cluster Instance Group Arrested

Amazon EMR Cluster Status Throttling Error

Amazon EMR Cluster Bootstrap Failed

View Amazon EMR Cluster Web Interfaces

Failed to submit local jar to spark cluster: java.nio.file.NoSuchFileException

Migrate to Apache HBase on Amazon S3 on Amazon EMR: Guidelines and Best Practices

Launch an edge node for Amazon EMR to run RStudio

Large-Scale Machine Learning with Spark on Amazon EMR

Troubleshoot Cluster Launch Issues after Amazon EMR Release Version Upgrade

Set Up a Spark SQL JDBC Connection on Amazon EMR

How to Easily Deploy an Amazon EKS Cluster with Pulumi

Use Logs to Troubleshoot Issues with Hive Queries in Amazon EMR

Can I Configure Amazon EMR to Use Amazon S3 Instead of HDFS?

Apache Spark on Amazon EMR

[中英對照]Introduction to Remote Direct Memory Access (RDMA) | RDMA概述

【轉載】Apache Spark Jobs 性能調優（二）

Submit Spark Jobs to Remote Amazon EMR Cluster

相關推薦