Spark實戰(1) 配置AWS EMR 和Zeppelin Notebook
阿新 • • 發佈:2018-11-03
SparkContext和SparkSession的區別,如何取用?
- SparkContext:
- 在Spark 2.0.0之前使用
- 通過資源管理器例如YARN來連線叢集
- 需要傳入SparkConf來建立SparkContext物件
- 如果要使用SQL,HIVE或者Streaming的API, 需要建立單獨的Context
-
val conf = new SparkConf() .setAppName(“RetailDataAnalysis”) .setMaster(“spark://master:7077”) .
- SparkSession:
- 出現在Spark 2.0.0之後, 推薦使用
- 除了能夠呼叫Spark的全部功能之外,允許DataFrame和Dataset APIs
- 對於SQL, HIVE和Streaming,不需要建立單獨的Context
- 可以在初始化session之後配置config
# Creating Spark session: val spark = SparkSession .
配置AWS EMR
# 1. Open aws console
# 2. Access the EMR
# 3. Create cluser
# 4. Go to andvanced options
# 5. Release: emr-5.11.1
# 6. Hadoop: 2.7.3
# 7. Zeppelin: 0.7.3
# 8. Spark: 2.2.1
# 9. Choose spot price to save budget
# 10. Create you key pair, download and chmod 400 it
# 11. Add inbound Security Group: 22 for ssh, 8890 for Zeppelin
建立Zeppelin Notebook
# 1. access master node public dns:8890
# 2. Create new note
# 3. Default Interpreter: spark
%pyspark # 4. import the pyspark package
# after importing package, you could run python code in zeppelin
for i in [1,2,3]:
print(i)
# the spark context is already set
sc
# the spark session is already set
spark
# read file fro aws s3
df = spark.read.csv("s3n://MyaccessKey:[email protected]/file.csv")