1. 程式人生 > >Submit Spark Jobs to Remote Amazon EMR Cluster

Submit Spark Jobs to Remote Amazon EMR Cluster

Prepare your local machine

Note: Spark jobs can be submitted when deploy-mode is set to client or cluster.

1.    Install all Spark client libraries on your local machine. For example, if you are using an emr-5.10.0 cluster (which has Spark 2.2.0 installed), then download spark-2.2.0-bin-hadoop2.7.tgz

and place it on your local machine's PATH environment variable. To determine which version of Spark and Apache Hadoop you are using (and therefore which Spark binary you need to download), see Spark Release History and Hadoop Version History.

2.    Create an environment variable called HADOOP_CONF_DIR, and then point it to a directory on your local machine. All files in /etc/hadoop/conf on the Amazon EMR cluster must be present in the directory that HADOOP_CONF_DIR points to. Spark uses the configuration files, such as yarn-site.xml for YARN settings and hdfs-site.xml for HDFS settings, that are in the directory that HADOOP_CONF_DIR points to.

Note: When you submit a Spark job in cluster mode, the driver runs on cluster nodes that have all Hadoop binaries installed. When you submit a Spark job in client mode, all Hadoop binaries must be downloaded and installed on your local machine.

Submit the Spark job

Your local machine is now ready to submit a Spark job to a remote Amazon EMR cluster, using a command similar to the following: 

相關推薦

Submit Spark Jobs to Remote Amazon EMR Cluster

Prepare your local machine Note: Spark jobs can be submitted when deploy-mode is set to client or cluster. 1.    Install

Assign a Static Private IP Address to the Master Node of an Amazon EMR Cluster

Amazon Web Services is Hiring. Amazon Web Services (AWS) is a dynamic, growing business unit within Amazon.com. We are currently hiring So

Forcing an Amazon EMR Cluster to Resize

Amazon Web Services is Hiring. Amazon Web Services (AWS) is a dynamic, growing business unit within Amazon.com. We are currently hiring So

Launch an Amazon EMR Cluster in a VPC Environment

Amazon Web Services is Hiring. Amazon Web Services (AWS) is a dynamic, growing business unit within Amazon.com. We are currently hiring So

Amazon EMR Cluster Instance Group Arrested

When you initiate the resizing of an EMR cluster instance group, EMR attempts to add or remove the specified number of instances. When adding i

Amazon EMR Cluster Status Throttling Error

Amazon Web Services is Hiring. Amazon Web Services (AWS) is a dynamic, growing business unit within Amazon.com. We are currently hiring So

Amazon EMR Cluster Bootstrap Failed

Amazon Web Services is Hiring. Amazon Web Services (AWS) is a dynamic, growing business unit within Amazon.com. We are currently hiring So

View Amazon EMR Cluster Web Interfaces

Amazon Web Services is Hiring. Amazon Web Services (AWS) is a dynamic, growing business unit within Amazon.com. We are currently hiring So

Failed to submit local jar to spark cluster: java.nio.file.NoSuchFileException

The standalone mode cluster wants to pass jar files to hdfs because the driver is on any node in the cluster. hdfs dfs -put xxx.jar /user/ spark

Migrate to Apache HBase on Amazon S3 on Amazon EMR: Guidelines and Best Practices

This blog post provides guidance and best practices about how to migrate from Apache HBase on HDFS to Apache HBase on Amazon S3 on Amazon EMR.

Launch an edge node for Amazon EMR to run RStudio

RStudio Server provides a browser-based interface for R and a popular tool among data scientists. Data scientist use Apache Spark cluster running

Large-Scale Machine Learning with Spark on Amazon EMR

This is a guest post by Jeff Smith, Data Engineer at Intent Media. Intent Media, in their own words: “Intent Media operates a platform for adverti

Troubleshoot Cluster Launch Issues after Amazon EMR Release Version Upgrade

<property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://<HOSTNAME OF YOUR EXTERNAL METASTO

Set Up a Spark SQL JDBC Connection on Amazon EMR

Amazon Web Services is Hiring. Amazon Web Services (AWS) is a dynamic, growing business unit within Amazon.com. We are currently hiring So

Use Logs to Troubleshoot Issues with Hive Queries in Amazon EMR

$ aws s3 ls s3://aws-logs-223377617334-us-west-2/elasticmapreduce/j-3MCDUQO2MWNJ5/ PRE containers/

Can I Configure Amazon EMR to Use Amazon S3 Instead of HDFS?

You can't configure Amazon EMR to use S3 instead of HDFS for the Hadoop storage layer. HDFS and the EMR File System (EMRFS), which uses Amazon

Apache Spark on Amazon EMR

Apache Spark includes several libraries to help build applications for machine learning (MLlib), stream processing (Spark Streaming), and graph p

[中英對照]Introduction to Remote Direct Memory Access (RDMA) | RDMA概述

str ech network context offload memory ice ase cal 前言: 什麽是RDMA? 簡單來說,RDMA就是指不通過操作系統(OS)內核以及TCP/IP協議棧在網絡上傳輸數據,因此延遲(latency)非常低,CPU消耗非常少。 下

【轉載】Apache Spark Jobs 性能調優(二)

放棄 instance bar 並行處理 defaults 執行 .exe nag 原則 調試資源分配 Spark 的用戶郵件郵件列表中經常會出現 “我有一個500個節點的集群,為什麽但是我的應用一次只有兩個 task 在執行”,鑒於 Spark 控制資源使用的參數的數