Spark SQL 筆記(1)—— Hive

阿新 • • 發佈：2018-11-06

1 大資料入門

學習 Hadoop ,Hive 的使用
學習 Spark
DataFrame 和 DataSet 在 Spark 框架中的核心地位

2 Hive

2.1 hive 產生的背景

MapReduce 程式設計的不便性；
HDFS 上的檔案缺少 Schema;

2.2 Hive 是什麼

通常用於進行離線資料處理（採用MapReduce）
底層支援多種不同的執行引擎（MapReduce，Tez,Spark）
支援多種不同的壓縮格式、儲存格式以及自定義函式
壓縮：GZIP,LZO,Snappy,BZIP2…

UDF : 自定義函式

2.3 Hive 體系架構

在這裡插入圖片描述

2.4 Hive 測試環境

在這裡插入圖片描述

2.5 Hive 生產環境

在這裡插入圖片描述

3 Hive 安裝

hive-1.1.0-cdh5.7.0.tar.gz

3.1 解壓

tar -zxvf hive-1.1.0-cdh5.7.0.tar.gz -C /home/hadoop/apps

3.2 配置 Hive

參考 https://cwiki.apache.org/confluence/display/Hive/GettingStarted

3.2.1 配置環境變數

export HIVE_HOME=/home/hadoop/apps/hive-1.1.0-cdh5.7.0
export PATH=$PATH:$HIVE_HOME/bin

3.2.2 `hive-env.sh`

/home/hadoop/apps/hive-1.1.0-cdh5.7.0/conf

HADOOP_HOME=/home/hadoop/apps/hadoop-2.6.0-cdh5.7.0

3.2.3 新建一個 `hive-site.xml`

<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/sparksql?createDatabaseIfNotExist=true&amp;useSSL=false</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>

<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>

<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>

<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>root</value>
<description>password to use against metastore database</description>
</property>
</configuration>

3.2.4 將mysql 驅動放到 lib 目錄

下載地址 https://dev.mysql.com/downloads/connector/j/5.1.html

4 測試

4.1 登入 mysql

[[email protected] ~]$ mysql -uroot -proot
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 4
Server version: 5.7.10 MySQL Community Server (GPL)

Copyright (c) 2000, 2015, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
| sparksql           |
| sys                |
+--------------------+
5 rows in set (0.03 sec)

mysql> use sparksql;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
| sparksql           |
| sys                |
+--------------------+
5 rows in set (0.00 sec)

mysql> show tables;
+---------------------------+
| Tables_in_sparksql        |
+---------------------------+
| BUCKETING_COLS            |
| CDS                       |
| COLUMNS_V2                |
| DATABASE_PARAMS           |
| DBS                       |
| FUNCS                     |
| FUNC_RU                   |
| GLOBAL_PRIVS              |
| PARTITION_KEYS            |
| PARTITION_KEY_VALS        |
| PARTITION_PARAMS          |
| PART_COL_STATS            |
| ROLES                     |
| SDS                       |
| SD_PARAMS                 |
| SEQUENCE_TABLE            |
| SERDES                    |
| SERDE_PARAMS              |
| SKEWED_COL_NAMES          |
| SKEWED_COL_VALUE_LOC_MAP  |
| SKEWED_STRING_LIST        |
| SKEWED_STRING_LIST_VALUES |
| SKEWED_VALUES             |
| SORT_COLS                 |
| TABLE_PARAMS              |
| TAB_COL_STATS             |
| TBLS                      |
| VERSION                   |
+---------------------------+
28 rows in set (0.00 sec)

mysql> select * from TBLS;
Empty set (0.00 sec)

5 練習

先啟動 hadoop 叢集
啟動hive ,命令列輸入 hive

5.1 建立表

hive> create table hive_wordcount(context string);
OK
Time taken: 0.593 seconds
hive> show tables;
OK
hive_wordcount
Time taken: 0.114 seconds, Fetched: 1 row(s)
hive>

切換到 mysql 資料庫檢視

[[email protected] ~]$ mysql -uroot -proot
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 54
Server version: 5.7.10 MySQL Community Server (GPL)

Copyright (c) 2000, 2015, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
| sparksql           |
| sys                |
+--------------------+
5 rows in set (0.00 sec)

mysql> use sparksql;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> show tables;
+---------------------------+
| Tables_in_sparksql        |
+---------------------------+
| BUCKETING_COLS            |
| CDS                       |
| COLUMNS_V2                |
| DATABASE_PARAMS           |
| DBS                       |
| FUNCS                     |
| FUNC_RU                   |
| GLOBAL_PRIVS              |
| PARTITIONS                |
| PARTITION_KEYS            |
| PARTITION_KEY_VALS        |
| PARTITION_PARAMS          |
| PART_COL_STATS            |
| ROLES                     |
| SDS                       |
| SD_PARAMS                 |
| SEQUENCE_TABLE            |
| SERDES                    |
| SERDE_PARAMS              |
| SKEWED_COL_NAMES          |
| SKEWED_COL_VALUE_LOC_MAP  |
| SKEWED_STRING_LIST        |
| SKEWED_STRING_LIST_VALUES |
| SKEWED_VALUES             |
| SORT_COLS                 |
| TABLE_PARAMS              |
| TAB_COL_STATS             |
| TBLS                      |
| VERSION                   |
+---------------------------+
29 rows in set (0.00 sec)

mysql> select * from TBLS;
+--------+-------------+-------+------------------+--------+-----------+-------+----------------+---------------+--------------------+--------------------+
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER  | RETENTION | SD_ID | TBL_NAME       | TBL_TYPE      | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT |
+--------+-------------+-------+------------------+--------+-----------+-------+----------------+---------------+--------------------+--------------------+
|      1 |  1540994772 |     1 |                0 | hadoop |         0 |     1 | hive_wordcount | MANAGED_TABLE | NULL               | NULL               |
+--------+-------------+-------+------------------+--------+-----------+-------+----------------+---------------+--------------------+--------------------+
1 row in set (0.00 sec)

mysql>

mysql> select * from COLUMNS_V2;
+-------+---------+-------------+-----------+-------------+
| CD_ID | COMMENT | COLUMN_NAME | TYPE_NAME | INTEGER_IDX |
+-------+---------+-------------+-----------+-------------+
|     1 | NULL    | context     | string    |           0 |
+-------+---------+-------------+-----------+-------------+
1 row in set (0.00 sec)

5.2 資料匯入表

在這裡插入圖片描述

hive> load data local inpath '/home/hadoop/words.txt' into table hive_wordcount;
Loading data to table default.hive_wordcount
Table default.hive_wordcount stats: [numFiles=1, totalSize=46]
OK
Time taken: 1.223 seconds

檢視匯入結果

hive> select * from hive_wordcount;
OK
hello world tom hello world
tom jerry
hello 
Time taken: 0.293 seconds, Fetched: 3 row(s)

5.3 統計

lateral view explode() ------ 每行記錄按照指定的分隔符拆解

hive> select word,count(1) from hive_wordcount lateral view explode(split(context,' ')) wc as word group by word;
Query ID = hadoop_20181031220606_a8bd43d1-8706-408f-a293-8d65428fcd43
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1540991507430_0001, Tracking URL = http://node1:8088/proxy/application_1540991507430_0001/
Kill Command = /home/hadoop/apps/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1540991507430_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-10-31 22:21:58,691 Stage-1 map = 0%,  reduce = 0%
2018-10-31 22:22:08,281 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.79 sec
2018-10-31 22:22:15,572 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.94 sec
MapReduce Total cumulative CPU time: 2 seconds 940 msec
Ended Job = job_1540991507430_0001
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 2.94 sec   HDFS Read: 8788 HDFS Write: 33 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 940 msec
OK
	1
hello	3
jerry	1
tom	2
world	2
Time taken: 30.297 seconds, Fetched: 5 row(s)

Spark SQL 筆記(1)—— Hive

1 大資料入門學習 Hadoop ,Hive 的使用學習 Spark DataFrame 和 DataSet 在 Spark 框架中的核心地位 2 Hive 2.1 hive 產生的背景 MapReduce 程式設計的不便性；

Spark SQL 筆記(5)—— Hive 到 Spark SQL（1）

1 SQLContext 1.1 Spark1.x 中Spark SQL 的入口點：SQLContext 參考連結 https://spark.apache.org/docs/1.6.1/sql-programming-guide.html#starting-point-sqlc

Spark SQL 筆記(10)——實戰網站日誌分析（1）

1 使用者行為日誌介紹 1.1 行為日誌生成方法 Nginx Ajax 1.2 日誌內容訪問的系統屬性：作業系統、瀏覽器訪問特徵：點選的 url、從哪個url 跳轉過來的（referer）、頁

Spark SQL 筆記(18)——spark SQL 總結(1)

1 Spark SQl 使用場景 Ad-hoc querying of data in files Live SQL analytics over streaming data ETL capabilities alongside familiar SQL I

SQL 筆記1，left join，group by，having

rom from 報錯 order by use join unknown and select 表：XS,XK,CJ left join 表1 on 表1.字段=表2.字段 group by 分組條件 order by 排序條件 asc正序(小到大)，desc倒序 hav

Spark-SQL連接Hive

ces submit mat targe runt match tms force trying 第一步：修個Hive的配置文件hive-site.xml 　　添加如下屬性，取消本地元數據服務： <property> <name>hive.

Spark SQL筆記整理（三）：加載保存功能與Spark SQL函數

code ren maven依賴 append 關聯 dfs 取值 struct nal 加載保存功能數據加載（json文件、jdbc）與保存（json、jdbc）測試代碼如下： package cn.xpleaf.bigdata.spark.scala.sql.p1

Spark SQL筆記整理（二）：DataFrame編程模型與操作案例

代碼最重要的 ssi func nbu 產生 michael array image DataFrame原理與解析 Spark SQL和DataFrame 1、Spark SQL是Spark中的一個模塊，主要用於進行結構化數據的處理。它提供的最核心的編程抽象，就是Data

sql筆記1

sel 常見分配權限控制 reat 筆記 let creat 數據 sql筆記1 1.sql語句分為3類： DML數據操作語言：select insert update delete等 DDL數據定義語言：create alter drop等 DCL數據控制語言：gra

Spark SQL 筆記(4)——Spark SQL 介紹

1 Spark SQL 背景介紹 1.1 Hive 介紹類似 sql 的 Hive QL 語言， sql -> mapreduce 改進： hive on tez，hive on spark, hive on mapreduce 1.2 Spark

Spark SQL 筆記(3)——Spark 環境搭建

1 local 模式直接執行即可 2 Standalone 模式和 Hadoop/HDFS 的架構類似 /home/hadoop/apps/spark-2.1.3-bin-2.6.0-cdh5.7.0/conf 2.1 spark-env.sh SPARK_MA

Spark SQL 筆記(2)——Spark 生態圈和 Hadoop 生態圈對比

1 Spark 產生的背景 1.1 MapReduce 的侷限性程式碼繁瑣只能夠支援map 和 reduce 方法；執行效率低；不適合多次迭代、互動式、流式的處理； 1.2 框架多樣化批處理（離線）：MapReduce,H

通過spark-sql快速讀取hive中的資料

1 配置並啟動 1.1 建立並配置hive-site.xml 在執行Spark SQL CLI中需要使用到Hive Metastore，故需要在Spark中新增其uris。具體方法是將HIVE_CONF/hive-site.xml複製到SPARK_CONF目錄下，然後在該配置檔案中，新增hive.metast

Spark SQL 筆記(7)—— DataFrame API操作案例

1 測試資料 stu.txt 1|Anaa|111111|[email protected] 2|Bob|22222|[email protected] 3|Candy|333333

Spark SQL 筆記(11)——實戰網站日誌分析（2）統計結果入庫

1 統計結果入庫使用 DataFrame API 完成統計分析使用 SQL API 完成統計分析將結果寫入 MySQL 資料庫 1.1 調優點分割槽欄位的資料型別的調整 https://

Spark SQL 筆記(15)——實戰網站日誌分析（5）資料視覺化

1 常見的視覺化框架 echarts highcharts d3.js HUE Zeppelin 2 建立 Web 專案下載Echarts的檔案放到此目錄 http://echarts.bai

Spark SQL 筆記(16)—— Spark on YARN

1 Spark 的4種執行模式不管使用壽命模式，Spark 應用程式的程式碼是不變的，只需要在提交的時候通過 --master引數來指定 Local,開發時使用 Standalone，Spark自帶的，如果一個叢集是 Standalone ,那麼就需要在多臺

Spark SQL 筆記(17)—— 專案效能調優

1 叢集優化儲存格式的選擇 ,https://www.infoq.cn/article/bigdata-store-choose 壓縮格式的選擇，https://www.ibm.com/develo

Spark SQL 筆記(19)——spark SQL 總結(2) DataFrame VS SQL

1 DataFrame DataFrame = RDD + Schema DataFrame is just a type alias for Dataset of Row DataFrame ov

Spark學習筆記1：Spark概覽

Spark是一個用來實現快速而通用的叢集計算的平臺。 Spark專案包含多個緊密整合的元件。Spark的核心是一個對由很多計算任務組成的，執行在多個工作機器或者是一個計算叢集上的應用進行排程，分發以及監控的計算引擎。Sark核心引擎有著速度快和通用的特點，因此Spark支援

Spark SQL 筆記(1)—— Hive

1 大資料入門

2 Hive

2.1 hive 產生的背景

2.2 Hive 是什麼

2.3 Hive 體系架構

2.4 Hive 測試環境

2.5 Hive 生產環境

3 Hive 安裝

3.1 解壓

3.2 配置 Hive

3.2.1 配置環境變數

3.2.2 hive-env.sh

3.2.3 新建一個 hive-site.xml

3.2.4 將mysql 驅動放到 lib 目錄

4 測試

4.1 登入 mysql

5 練習

5.1 建立表

5.2 資料匯入表

5.3 統計

相關推薦

3.2.2 `hive-env.sh`

3.2.3 新建一個 `hive-site.xml`