1. 程式人生 > >六 Spark API介紹

六 Spark API介紹

Spark機器學習,API瀏覽
Spark官方API
http://spark.apache.org/docs/1.6.2/api/java/index.html
http://spark.apache.org/docs/2.2.0/api/java/index.html

1 RDD的支援,是Spark的基礎,2根據需求來檢視API

一Spark的功能模組
SparkSQL 
SparkGraphx
SparkScreaming
SparkML
SparkMLLIb

二常用的機器學習的API
ml 輸入採用DataFrame(輸入來源於SparkSQL)
mllib 輸入引數是普通的RDD(輸入來自於hdfs)


例子userId(使用者ID),productId(產品ID),評分,來推薦給使用者

協同過濾來找到使用者對其它產品感興趣
常用演算法:ALS演算法(最小二乘法)
org.apache.spark.ml.recommendation ALS

監督分類: org.apache.spark.mllib.classification,
預先給使用者打上標籤

非監督分類mllib.clustering 裡面也是一樣的方法
KMeans

決策樹 mllib.tree

圖形計算org.apache.spark.graphx
org.apache.spark.sql : 我們把資料匯入到mysql中,如何放入到spark中來,然後進行機器學習進行預測統計分析,然後放入到hdfs中去 

四API擴充套件
可以從mysql,oracle中讀取資料
org.apache.spark.sql
org.apache.spark.sql.api.java
org.apache.spark.sql.expressions
org.apache.spark.sql.hive
org.apache.spark.sql.hive.execution
org.apache.spark.sql.jdbc
org.apache.spark.sql.sources
org.apache.spark.sql.types
org.apache.spark.sql.util

org.apache.spark.straming相當於我們的流式計算,
org.apache.spark.streaming.flume
org.apache.spark.streaming.kafka
org.apache.spark.streaming.kinesis
org.apache.spark.streaming.mqtt
org.apache.spark.streaming.receiver
org.apache.spark.streaming.scheduler
org.apache.spark.streaming.twitter
org.apache.spark.streaming.util
org.apache.spark.streaming.zeromq

ml 輸入採用DataFrame(輸入來源於SparkSQL)
org.apache.spark.ml
org.apache.spark.ml.attribute
org.apache.spark.ml.classification
org.apache.spark.ml.clustering
org.apache.spark.ml.evaluation
org.apache.spark.ml.feature
org.apache.spark.ml.param
org.apache.spark.ml.recommendation
org.apache.spark.ml.regression
org.apache.spark.ml.source.libsvm
org.apache.spark.ml.tree
org.apache.spark.ml.tuning
org.apache.spark.ml.util

mllib 輸入引數是普通的RDD(輸入來自於hdfs)
org.apache.spark.mllib.classification
org.apache.spark.mllib.clustering
org.apache.spark.mllib.evaluation
org.apache.spark.mllib.feature
org.apache.spark.mllib.fpm
org.apache.spark.mllib.linalg
org.apache.spark.mllib.linalg.distributed
org.apache.spark.mllib.optimization
org.apache.spark.mllib.pmml
org.apache.spark.mllib.random
org.apache.spark.mllib.rdd
org.apache.spark.mllib.recommendation
org.apache.spark.mllib.regression
org.apache.spark.mllib.stat
org.apache.spark.mllib.stat.distribution
org.apache.spark.mllib.stat.test
org.apache.spark.mllib.tree
org.apache.spark.mllib.tree.configuration
org.apache.spark.mllib.tree.impurity
org.apache.spark.mllib.tree.loss
org.apache.spark.mllib.tree.model
org.apache.spark.mllib.util