1. 程式人生 > >Submit Spark Jobs to Remote Amazon EMR Cluster

Submit Spark Jobs to Remote Amazon EMR Cluster

Prepare your local machine

Note: Spark jobs can be submitted when deploy-mode is set to client or cluster.

1.    Install all Spark client libraries on your local machine. For example, if you are using an emr-5.10.0 cluster (which has Spark 2.2.0 installed), then download spark-2.2.0-bin-hadoop2.7.tgz

and place it on your local machine's PATH environment variable. To determine which version of Spark and Apache Hadoop you are using (and therefore which Spark binary you need to download), see Spark Release History and Hadoop Version History.

2.    Create an environment variable called HADOOP_CONF_DIR, and then point it to a directory on your local machine. All files in /etc/hadoop/conf on the Amazon EMR cluster must be present in the directory that HADOOP_CONF_DIR points to. Spark uses the configuration files, such as yarn-site.xml for YARN settings and hdfs-site.xml for HDFS settings, that are in the directory that HADOOP_CONF_DIR points to.

Note: When you submit a Spark job in cluster mode, the driver runs on cluster nodes that have all Hadoop binaries installed. When you submit a Spark job in client mode, all Hadoop binaries must be downloaded and installed on your local machine.

Submit the Spark job

Your local machine is now ready to submit a Spark job to a remote Amazon EMR cluster, using a command similar to the following: 


