從hdfs匯入資料到hive表

阿新 • • 發佈：2020-12-07

在檔案已經匯入（存入）hdfs之後，需要建表進行對映才可以show tables。

現在假設檔案已匯入該hdfs目錄： /apps/hive/warehouse/db_name.db/tb_name （這裡也可能是其他檔案，如csv，txt等，如：/username/test/test.txt）

方式一：建立的是外部分割槽表

1. 先按照hdfs中檔案的欄位，建立外部分割槽表：

create external table if not exists db_name.tb_name_2(id string, name string) partitioned by (day string) row format delimited fields terminated by '\1' STORED AS textfile LOCATION '/apps/hive/warehouse/db_name.db/tb_name/day=xxxxxxxx

'; # location指向直到分割槽（假設分割槽為day欄位，如果指向的表不是分割槽表，則直到該表即可——不管怎樣，要使得該目錄裡就是資料）

2. 建立分割槽

alter table db_name.tb_name_2 add partition (day=xxxxxxxx) location '/apps/hive/warehouse/db_name.db/tb_name/day=xxxxxxxx';

方式二：建立的是外部非分割槽表，則location直接指向資料地址：

如果指向的資料地址是某個非分割槽表資料，則直接指向該表：

create external table if not existsdb_name.tb_name_2(id string, name string) row format delimited fields terminated by '\1' STORED AS textfile LOCATION '/apps/hive/warehouse/db_name.db/tb_name';

如果指向的資料地址只是某個表的分割槽，則直接指向分割槽資料（雖然建立的外部表不是分割槽表）：

create external table if not existsdb_name.tb_name_2(id string, name string) row format delimited fields terminated by '\1' STORED AS textfile LOCATION '/apps/hive/warehouse/db_name.db/tb_name/day=xxxxxxxx';

方式三：建立的是內部表：先建表，再load

建表：create table if not existsdb_name.tb_name_2

(id string, name string) row format delimited fields terminated by '\1' STORED AS textfile;

load資料：load data inpath'/apps/hive/warehouse/db_name.db/tb_name' into tabledb_name.tb_name_2;

方式四：如果嫌棄命令列麻煩，可以使用python程式碼完成建表和插入語句（spark-sql）

import os
from pyspark import SparkContext, SparkConf
from pyspark.sql.session import SparkSession
from pyspark.sql import HiveContext
from pyspark.sql import SQLContext
from pyspark.storagelevel import StorageLevel
from pyspark.sql.types import StructField, StructType, StringType
import warnings
warnings.filterwarnings("ignore")

os.environ["PYSPARK_PYTHON"]="/root/anaconda3/envs/my_env/bin/python3.7"  #如果只有一個環境env，則可不必要

# local 模式
# sc = SparkContext('local', 'test')

# cluster模式
conf = SparkConf().setAppName('test')
conf.set(" "," ")    # 各種spark配置
sc = SparkContext(conf=conf)
# ------
spark = SparkSession.builder.appName("SparkOnHive").enableHiveSupport().getOrCreate()    #為了使用hive命令
hive_text = HiveContext(spark)   #這個也可
sql_text = SQLContext(sc)

import pandas as pd
dt = pd.read_csv('xxxxxx')
df = dt.rename(columns)=lambda x:x.replace(" ","").replace("\n","").replace("\t","")   #變換列名，去掉不必要的字元
#schema=StructType(                        # RDD變成DataFrame需要schema
#        [StructField("id", StringType(), True),
#         StructField("name", StringType(), True),
#         StructField("sex", StringType(), True)]
#        )
data = hive_text.createDataFrame(df)    # 從pandas的df變成pyspark的DataFrame
hive_text.registerDataFrameAsTable(data, tableName="test_table")   #此時產生了虛擬表
hive_text.sql("use db_name")
hive_text.sql("create table if not exists tb_name(id string, name string, sex string) row format delimited fields terminated by '\1' STORED AS textfile")  #開始執行sql語句，建表
hive_text.sql("insert overwrite table tb_name select * from test_table")  # 虛擬表test_table可以直接使用
# hive_text.sql("alter table tb_name add if not exists partition(day=xxxxxxxx)")  # 如果有分割槽，要先建分割槽
data_select = hive_text.sql("select * from xxxx")   # 注意：得到的值data_select屬於pyspark的DataFrame格式 

sc.stop()

內部表和外部表：

未被external修飾的是內部表，被external修飾的為外部表；

區別：
內部表資料由Hive自身管理，外部表資料由HDFS管理；
內部表資料儲存的位置是hive.metastore.warehouse.dir（預設：/user/hive/warehouse），外部表資料的儲存位置由自己制定；
刪除內部表會直接刪除元資料（metadata）及儲存資料；刪除外部表僅僅會刪除元資料，HDFS上的檔案並不會被刪除；
對內部表的修改會將修改直接同步給元資料，而對外部表的表結構和分割槽進行修改，則需要修復（MSCK REPAIR TABLE table_name;）

參考：

https://blog.csdn.net/qq_44449767/article/details/99716613

https://www.jianshu.com/p/1a4dfd654786

從hdfs匯入資料到hive表

從hdfs匯入資料到hive表

hive從mysql匯入資料量變多的解決方案

Hadoop Mapper 階段將資料直接從 HDFS 匯入 Hbase

python實現從wind匯入資料

MySQL 之 LOAD DATA INFILE 快速匯入資料 (單表資料很大)

Oracle定時備份資料然後匯入資料和表

datax將mysql資料匯入hive表

匯入hive表資料為空問題解決

Sqoop從MySQL向Hive增量式匯入資料報錯：Exception in thread "main" java.lang.NoClassDefFoundError: org/json/JSONObject

Java開發筆記9(一、從另一個表匯入另一個表獲取區域樹資料；二、根據stationId生成二級樹)

Linux Solr5.1安裝及匯入Oracle資料庫表資料

MySQL LOAD DATA INFILE—批量從檔案（csv、txt）匯入資料

sqoop 匯入從mysql匯入資料到hive報類找不到

PowerDesigner從Excel匯入表

從 Neo4j 匯入 Nebula Graph 實踐見 SPark 資料匯入原理

探尋從HDFS到Spark的高效資料通道：以小檔案輸入為案例(轉)

Python Excel 批量付款匯入明細資料分析整理核銷下載表匯入資料轉換

xml檔案從本地匯入，並把XML檔案中的資料傳入另一頁面

Hive入門操作-通過Hive中SQL讀取hdfs中資料

Zabbix使用python匯出效能資料execl表-從零到無

從hdfs匯入資料到hive表

相關推薦