使Apache Spark和Mysql作資料分析

阿新 • • 發佈：2018-12-26

使用用spart-shell讀取MySQL表中的資料

步驟1：執行spark-shell命令，進入spark-shell命令列，執行命令如下：

[email protected]:~/run/spark/bin$ ./spark-shell --master spark://ubuntu1:7077 --jars /home/bigdata/run/spark/mysql-connector-java-5.1.30-bin.jar

執行結果如下：

[email protected]:~/run/spark/bin$ ./spark-shell --master spark://ubuntu1:7077 --jars /home/bigdata/run/spark/mysql-connector-java-5.1.30-bin.jar
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/05/08 01:40:28 WARN spark.SparkConf: The configuration key 'spark.history.updateInterval' has been deprecated as of Spark 1.3 and may be removed in the future. Please use the new key 'spark.history.fs.update.interval' instead.
17/05/08 01:40:46 WARN spark.SparkConf: The configuration key 'spark.history.updateInterval' has been deprecated as of Spark 1.3 and may be removed in the future. Please use the new key 'spark.history.fs.update.interval' instead.
17/05/08 01:40:46 WARN spark.SparkConf: The configuration key 'spark.history.updateInterval' has been deprecated as of Spark 1.3 and may be removed in the future. Please use the new key 'spark.history.fs.update.interval' instead.
17/05/08 01:40:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/05/08 01:40:57 WARN spark.SparkConf: The configuration key 'spark.history.updateInterval' has been deprecated as of Spark 1.3 and may be removed in the future. Please use the new key 'spark.history.fs.update.interval' instead.
17/05/08 01:41:01 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/home/bigdata/run/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/home/bigdata/run/spark-2.1.0-bin-hadoop2.7/jars/datanucleus-api-jdo-3.2.6.jar."
17/05/08 01:41:01 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/home/bigdata/run/spark-2.1.0-bin-hadoop2.7/jars/datanucleus-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/home/bigdata/run/spark/jars/datanucleus-rdbms-3.2.9.jar."
17/05/08 01:41:01 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/home/bigdata/run/spark-2.1.0-bin-hadoop2.7/jars/datanucleus-core-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/home/bigdata/run/spark/jars/datanucleus-core-3.2.10.jar."
17/05/08 01:41:10 WARN metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://10.3.19.171:4040
Spark context available as 'sc' (master = spark://ubuntu1:7077, app id = app-20170508014050-0004).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_25)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

步驟2：建立變數sqlContext

scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc) 
warning: there was one deprecation warning; re-run with -deprecation for details
sqlContext: org.apache.spark.sql.SQLContext = [email protected]

步驟3：從Mysql中載入資料

scala> val dataframe_mysql = sqlContext.read.format("jdbc").option("url", "jdbc:mysql://127.0.0.1:3306/mydatabase").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "mytable").option("user", "myname").option("password", "mypassword").load()
dataframe_mysql: org.apache.spark.sql.DataFrame = [id: string, grouptype: int ... 16 more fields]

步驟4：顯示dataframe中的資料

scala> dataframe_mysql.show
+---+---------+-------+---------+------+------+---+--------------------+---+-----------+-----+----+--------------------+--------------------+-----+------+----------+---+
| id|grouptype|groupid|loginname|  name|   pwd|sex|            birthday|tel|mobilephone|email|isOk|       lastLoginTime|             addtime|intro|credit|experience|img|
+---+---------+-------+---------+------+------+---+--------------------+---+-----------+-----+----+--------------------+--------------------+-----+------+----------+---+
|  1|        1|      1|    admin| admin| admin|  1|2016-05-05 14:51:...|  1|          1|    1|   1|2016-05-10 14:52:...|2016-05-08 14:52:...|    1|     1|         1|  1|
|  2|        2|      2|   wanghb|wanghb|wanghb|  2|2016-05-10 14:56:...|  2|          2|    2|   2|2016-05-11 14:57:...|2016-05-10 14:57:...|    2|     2|        22|  2|
+---+---------+-------+---------+------+------+---+--------------------+---+-----------+-----+----+--------------------+--------------------+-----+------+----------+---+

步驟5：為了後續查詢，將dataframe中的資料註冊為一個臨時表

scala> dataframe_mysql.registerTempTable("tmp_tablename")
warning: there was one deprecation warning; re-run with -deprecation for details

步驟6：現在可以從臨時表"tmp_tablename"中查詢資料

scala> dataframe_mysql.sqlContext.sql("select * from tmp_tablename").collect.foreach(println)
[1,1,1,admin,admin,admin,1,2016-05-05 14:51:58.0,1,1,1,1,2016-05-10 14:52:07.0,2016-05-08 14:52:12.0,1,1,1,1]
[2,2,2,wanghb,wanghb,wanghb,2,2016-05-10 14:56:58.0,2,2,2,2,2016-05-11 14:57:05.0,2016-05-10 14:57:08.0,2,2,22,2]

通過Spark將資料寫入MySQL

使Apache Spark和Mysql作資料分析

使用用spart-shell讀取MySQL表中的資料步驟1：執行spark-shell命令，進入spark-shell命令列，執行命令如下： [email protected]:~/run/spark/bin$ ./spark-shell --maste

使用Apache Spark和MySQL打造強大的資料分析

藉助真實案例和程式碼樣本，本文作者展示瞭如何將Sparke和MySQL結合起來，創造資料分析上的強大工具。 Apache Spark是一個類似Apache Hadoop的叢集計算框架，在Wikipedia上有大量描述：Apache Spark是一個開源叢集計算框架，出自加州大學伯克利分校的AMPLa

使用Apache Spark和Apache Hudi構建分析資料湖

## 1. 引入大多數現代資料湖都是基於某種分散式檔案系統（DFS），如HDFS或基於雲的儲存，如AWS S3構建的。遵循的基本原則之一是檔案的“一次寫入多次讀取”訪問模型。這對於處理海量資料非常有用，如數百GB到TB的資料。但是在構建分析資料湖時，更新資料並不罕見。根據不同場景，這些更新頻率可能是每

spark從mysql讀取資料（redis/mongdb/hbase等類似，換成各自RDD即可）

package com.ws.jdbc import java.sql.DriverManager import org.apache.spark.rdd.JdbcRDD import org.apache.spark.{SparkConf, SparkCont

hive整合spark和mysql

一、Hive安裝 1.Hive簡介 Hive是Facebook開發的構建於Hadoop叢集之上的資料倉庫應用，可以將結構化的資料檔案對映為一張資料庫表，並提供完整的SQL查詢功能，可以將SQL語句轉換為MapReduce任務進行執行。

元資料與資料治理｜Spark SQL結構化資料分析（第六篇）

資料科學家們早已熟悉的R和Pandas等傳統資料分析框架雖然提供了直觀易用的API，卻侷限於單機，無法覆蓋分散式大資料場景。在Spark1.3.0以Spark SQL原有的SchemaRDD為藍本，引入了Spark DataFrameAPI，不僅為Scala、Python、Jav

『 Spark 』7. 使用 Spark DataFrame 進行大資料分析

寫在前面本系列是綜合了自己在學習spark過程中的理解記錄＋對參考文章中的一些理解＋個人實踐spark過程中的一些心得而來。寫這樣一個系列僅僅是為了梳理個人學習spark的筆記記錄，所以一切以能夠理解為主，沒有必要的細節就不會記錄了，而且文中有時候會出現英文

Mac 安裝使用apache php 和mysql

Mac 系統說明 OS X EI Capitan 版本 10.11.5 說明 MAC上預設自帶了 apache 和php 只需要開啟對應的地方就可以使用， mysql 需要自己安裝配置Apache 啟動apache ap

SpringMVC4+thymeleaf3的一個簡單例項（篇五：頁面和MySql的資料互動-展示以及儲存）

這一篇將介紹怎樣把頁面資料儲存的MySql資料庫，並將資料庫內容展示到頁面上。首先做一個基礎工作，新增以下jar到lib： 2: spring-jdbc-4.3.3.RELEASE.jar 3: spring-tx-4.3.3.RELEASE.jar 2和3從sprin

根據配置和zookeeper節點資料分析dubbo註冊功能

當啟動dubbo會在zookeeper的根節點目錄下生成dubbo的目錄進dubbo的目錄我們會看到我們編寫的service變成一個目錄存放在dubbo下面在其下面有consumers，routers，providers，configurators目錄

怎麼實現員工和工資大資料分析

前言現如今市場上的人事系統五花八門，可做了大資料分析的人事系統少之又少，最近本人花了一個星期好好研究了大資料展示方面的內容，圖表主要用的是echarts來實現的，官網地址：https://echarts.apache.org/zh/index.html 下面兩張圖片

怎麼實現員工和工資大資料分析，echarts+js實現

Beginning Data Exploration and Analysis with Apache Spark 使用Apache Spark開始資料探索和分析中文字幕

使用Apache Spark開始資料探索和分析中文字幕 Beginning Data Exploration and Analysis with Apache Spark 無論您是想要探索資料還是開發複雜的機器學習模型，資料準備都是任何資料專業人士的主要任務 Spark是一種引擎，它

Spark Stream整合flum和kafka，資料儲存在HBASE上，分析後存入資料庫

開發環境：Hadoop+HBASE+Phoenix+flum+kafka+spark+MySQL 預設配置好了Hadoop的開發環境，並且已經安裝好HBASE等元件。下面通過一個簡單的案例進行整合：這是整個工作的流程圖：第一步：獲取資料來源　　由於外部埋點獲取資源較為繁瑣

[Spark周邊]--SQL Server 2019預覽結合了SQL Server和Apache Spark來建立統一的資料平臺

感謝原文作者：https://cloudblogs.microsoft.com/sqlserver/2018/09/24/sql-server-2019-preview-combines-sql-server-and-apache-spark-to-create-a-unified-data-pla

spark高階資料分析系列之第二章用 Scala 和 Spark 進行資料分析

2.1資料科學家的Scala spark是用scala語言編寫的，使用scala語言進行大資料開發的好處有 1、效能開銷小減少不同環境下傳遞程式碼和資料的錯誤和效能開銷 2、能用上最新的版

基於spark和flink的電商資料分析專案

目錄業務需求業務資料來源使用者訪問Session分析 Session聚合統計 Session分層抽樣 Top10熱門品類 Top10活躍Session 頁面單跳轉化率分析各區域熱門商品統計分析廣告點選流量實時統計分析

SQL Server 2019預覽結合了SQL Server和Apache Spark來建立統一的資料平臺

本文翻譯自：今天在Ignite上，微軟宣佈推出SQL Server 2019。25年來，SQL Server幫助企業管理其關係資料的各個方面。在最近的版本中，SQL Server不僅通過統一圖形和關係資料來查詢關係資料，還通過R和Python模型培訓和評分將機器學習帶到資

spark高階資料分析系列之第三章音樂推薦和 Audioscrobbler 資料集

3.1資料集和整體思路資料集本章實現的是歌曲推薦，使用的是ALS演算法，ALS是spark.mllib中唯一的推薦演算法，因為只有ALS演算法可以進行並行運算。使用資料集在這裡，裡面包含該三個檔案：表一：user_artist_data.txt 包含該的是（使用者ID、歌

【雲星資料---Apache Flink實戰系列(精品版)】：Apache Flink高階特性與高階應用008-Slot和Parallelism的深入分析003

四、任務槽（task-slot）和槽共享（Slot Sharing） 1.任務槽（Task slot） 1.flink的TM就是執行在不同節點上的JVM程序（process）,這個程序會

使Apache Spark和Mysql作資料分析

使用用spart-shell讀取MySQL表中的資料

通過Spark將資料寫入MySQL

相關推薦