1. 程式人生 > >Hive Tutorial 閱讀記錄

Hive Tutorial 閱讀記錄

# Hive Tutorial 官網原文連結:[https://cwiki.apache.org/confluence/display/Hive/Tutorial](https://cwiki.apache.org/confluence/display/Hive/Tutorial) [TOC] ## 1、Concepts ### 1.1、What Is Hive > Hive is a data warehousing infrastructure based on [Apache Hadoop](http://hadoop.apache.org/). Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing on commodity hardware. Hive是一個基於 Apache Hadoop 的資料倉庫基礎設施。Hadoop 為在商業硬體上儲存和處理資料提供了大規模的向外擴充套件和容錯能力。 > Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides SQL which enables users to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive's SQL gives users multiple places to integrate their own functionality to do custom analysis, such as User Defined Functions (UDFs). Hive 可以方便地對大量資料進行彙總、ad-hoc 查詢和分析。 它提供了 SQL,使使用者能夠輕鬆地進行 ad-hoc 查詢、彙總和資料分析。同時,Hive 的 SQL 給使用者提供了多個地方來整合他們自己的功能來做自定義分析,比如使用者定義函式(UDFs)。 ### 1.2、What Hive Is NOT > Hive is not designed for online transaction processing. It is best used for traditional data warehousing tasks. Hive 不是為線上事務處理而設計的。它最適合用於傳統的資料倉庫任務。 ### 1.3、Getting Started > For details on setting up Hive, HiveServer2, and Beeline, please refer to the GettingStarted guide. 關於 Hive、HiveServer2 和 Beeline 的詳細設定,請參考 GettingStarted guide。 > Books about Hive lists some books that may also be helpful for getting started with Hive. 關於 Hive 的書中列出了一些可能對開始使用 Hive 有幫助的書。 > In the following sections we provide a tutorial on the capabilities of the system. We start by describing the concepts of data types, tables, and partitions (which are very similar to what you would find in a traditional relational DBMS) and then illustrate the capabilities of Hive with the help of some examples. 在下面的章節中,我們將提供關於該系統功能的教程。我們首先描述資料型別、表和分割槽的概念(與傳統關係 DBMS 非常相似),然後通過一些例子說明 Hive 的功能。 ### 1.4、Data Units > In the order of granularity - Hive data is organized into: hive 資料有如下組織形式: > Databases: Namespaces function to avoid naming conflicts for tables, views, partitions, columns, and so on. Databases can also be used to enforce security for a user or group of users. - 資料庫:名稱空間函式,以避免表、檢視、分割槽、列等的命名衝突。資料庫還可以用於為一個使用者或一組使用者實施安全性。 > Tables: Homogeneous units of data which have the same schema. An example of a table could be page_views table, where each row could comprise of the following columns (schema): > timestamp—which is of INT type that corresponds to a UNIX timestamp of when the page was viewed. userid —which is of BIGINT type that identifies the user who viewed the page. page_url—which is of STRING type that captures the location of the page. referer_url—which is of STRING that captures the location of the page from where the user arrived at the current page. IP—which is of STRING type that captures the IP address from where the page request was made. - 表:具有相同schema的同種資料單元。一個表的示例就是 page_views 表,表中每行都由下面的列組成: - timestamp:INT型別,頁面瀏覽時間 - userid:BIGINT型別, - page_url:STRING型別, - referer_url:STRING型別, - IP:STRING型別, > Partitions: Each Table can have one or more partition Keys which determines how the data is stored. Partitions—apart from being storage units—also allow the user to efficiently identify the rows that satisfy a specified criteria; for example, a date_partition of type STRING and country_partition of type STRING. Each unique value of the partition keys defines a partition of the Table. For example, all "US" data from "2009-12-23" is a partition of the page_views table. Therefore, if you run analysis on only the "US" data for 2009-12-23, you can run that query only on the relevant partition of the table, thereby speeding up the analysis significantly. Note however, that just because a partition is named 2009-12-23 does not mean that it contains all or only data from that date; partitions are named after dates for convenience; it is the user's job to guarantee the relationship between partition name and data content! Partition columns are virtual columns, they are not part of the data itself but are derived on load. - 分割槽: 每個表有一個或多個分割槽 key ,這些 key 決定了資料如何被儲存。 **除了作為儲存單元之外,分割槽還允許使用者有效地標識滿足指定條件的行**。例如,STRING 型別的 date_partition 和 STRING 型別的 country_partition。 **每個唯一的分割槽 key 對應表的一個分割槽**。例如,從"2009-12-23"開始的所有"US"下的資料都是 page_views 表的一個分割槽中的資料。 因此,如果基於"US"下2009-12-23的資料分析,你可以只在相關分割槽下執行查詢。從而,可以**提高分析效率**。 然而,**僅僅因為一個分割槽命名為2009-12-23,並不意味著它包含或僅包含該日期的所有資料,分割槽以日期命名是為了方便。** 保證分割槽名和資料內容之間的關係是使用者的工作!**分割槽列是虛擬列,它們不是資料本身的一部分**,而是在載入時派生的。 > Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table. For example the page_views table may be bucketed by userid, which is one of the columns, other than the partitions columns, of the page_view table. These can be used to efficiently sample the data. - 分桶: **通過計算表的某些列的 hash 值,分割槽中的資料再被劃分到桶中。這可以被用來高效地抽樣資料。** 例如,page_views 表根據 userid 分桶,**userid 是 page_view 表的列之一,而不是分割槽列**。 > Note that it is not necessary for tables to be partitioned or bucketed, but these abstractions allow the system to prune large quantities of data during query processing, resulting in faster query execution. 分割槽或分桶並不是必要的。但是這些抽象允許系統在查詢處理期間刪除大量資料,從而加快查詢的執行。 ### 1.5、Type System > Hive supports primitive and complex data types, as described below. See [Hive Data Types](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types) for additional information. Hive 支援基本和複雜資料型別,如下所述。有關更多資訊,請參閱 Hive Data Types。 #### 1.5.1、Primitive Types - Types are associated with the columns in the tables. The following Primitive types are supported:【型別和表中的列相關,下面是支援的基本資料型別:】 - Integers - TINYINT—1 byte integer - SMALLINT—2 byte integer - INT—4 byte integer - BIGINT—8 byte integer - Boolean type - BOOLEAN—TRUE/FALSE - Floating point numbers - FLOAT—single precision - DOUBLE—Double precision - Fixed point numbers - DECIMAL—a fixed point value of user defined scale and precision - String types - STRING—sequence of characters in a specified character set - VARCHAR—sequence of characters in a specified character set with a maximum length - CHAR—sequence of characters in a specified character set with a defined length - Date and time types - TIMESTAMP — A date and time without a timezone ("LocalDateTime" semantics) - TIMESTAMP WITH LOCAL TIME ZONE — A point in time measured down to nanoseconds ("Instant" semantics) - DATE—a date - Binary types - BINARY—a sequence of bytes > The Types are organized in the following hierarchy (where the parent is a super type of all the children instances): 這些型別按以下層次結構組織(父例項是所有子例項的超型別): - Type - Primitive Type - Number - DOUBLE - FLOAT - BIGINT - INT - SMALLINT - TINYINT - STRING - BOOLEAN > This type hierarchy defines how the types are implicitly converted in the query language. Implicit conversion is allowed for types from child to an ancestor. So when a query expression expects type1 and the data is of type2, type2 is implicitly converted to type1 if type1 is an ancestor of type2 in the type hierarchy. Note that the type hierarchy allows the implicit conversion of STRING to DOUBLE. 這種型別層次結構定義瞭如何在查詢語言中隱式地轉換型別。 允許從子型別到祖先型別的隱式轉換。 因此,當查詢表示式期望型別1且資料為型別2時,如果在型別層次結構中,型別1是型別2的祖先,則型別2將隱式轉換為型別1。 請注意,型別層次結構允許隱式地將 STRING 轉換為 DOUBLE。 > Explicit type conversion can be done using the cast operator as shown in the [#Built In Functions](https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-BuiltInFunctions) section below. 顯式型別轉換可以使用強制轉換操作符完成,如下面的 #Built in Functions 一節所示。 #### 1.5.2、Complex Types > Complex Types can be built up from primitive types and other composite types using: 複雜型別可以從基本型別和其他組合型別構建: > Structs: the elements within the type can be accessed using the DOT (.) notation. For example, for a column c of type STRUCT {a INT; b INT}, the a field is accessed by the expression c.a - Structs:型別中的元素可以使用點號訪問。例如,c 列的型別是 `STRUCT {a INT; b INT}`,通過 `c.a` 訪問欄位。 > Maps (key-value tuples): The elements are accessed using ['element name'] notation. For example in a map M comprising of a mapping from 'group' -> gid the gid value can be accessed using M['group'] - Maps(鍵值元組):使用 `['element name']` 訪問元素。例如,對映 M 由 `'group' -> gid` 組成,可以通過 `M['group']` 訪問 gid 值。 > Arrays (indexable lists): The elements in the array have to be in the same type. Elements can be accessed using the [n] notation where n is an index (zero-based) into the array. For example, for an array A having the elements ['a', 'b', 'c'], A[1] retruns 'b'. - Arrays(可索引的列表):陣列中的元素必須具有相同的型別。可以使用 [n] 來訪問元素,n 是索引(從0開始)。例如,陣列 A 有元素 `['a', 'b', 'c']`,那麼 A[1] 將返回 'b'。 > Using the primitive types and the constructs for creating complex types, types with arbitrary levels of nesting can be created. For example, a type User may comprise of the following fields: 使用基本型別和用於建立複雜型別的構造,可以建立具有任意巢狀級別的型別。例如,一個型別使用者可能包含以下欄位: - gender—which is a STRING. - active—which is a BOOLEAN. #### 1.5.3、Timestamp > Timestamps have been the source of much confusion, so we try to document the intended semantics of Hive. Timestamps 一直是很多困惑的根源,所以我們試圖記錄 Hive 的語義。 **Timestamp ("LocalDateTime" semantics)** > Java's "LocalDateTime" timestamps record a date and time as year, month, date, hour, minute, and seconds without a timezone. These timestamps always have those same values regardless of the local time zone. Java 的 “LocalDateTime” 時間戳將日期和時間記錄為年、月、日、時、分和秒,而沒有時區。 無論本地時區是什麼,這些時間戳總是具有相同的值。 > For example, the timestamp value of "2014-12-12 12:34:56" is decomposed into year, month, day, hour, minute and seconds fields, but with no time zone information available. It does not correspond to any specific instant. It will always be the same value regardless of the local time zone. Unless your application uses UTC consistently, timestamp with local time zone is strongly preferred over timestamp for most applications. When users say an event is at 10:00, it is always in reference to a certain timezone and means a point in time, rather than 10:00 in an arbitrary time zone. 例如,“2014-12-12 12:34:56” 的時間戳值被分解為年、月、日、小時、分鐘和秒欄位,但是沒有時區資訊可用。 它不對應於任何特定的時刻。它將始終是相同的值,無論當地時區是什麼。除非你的應用程式一致使用 UTC,否則具有當地時區的時間戳對於大多數應用程式來說都比時間戳更受歡迎。當用戶說一個事件在 10:00 時,它總是與某個時區有關,意思是一個時間點,而不是任意時區中的 10:00。 **Timestamp with local time zone ("Instant" semantics)** > Java's "Instant" timestamps define a point in time that remains constant regardless of where the data is read. Thus, the timestamp will be adjusted by the local time zone to match the original point in time. Java 的 "Instant" 時間戳定義了一個無論從何處讀取資料都保持不變的時間點。因此,時間戳將根據當地時區調整,以匹配原始的時間點。 Type | Value in America/Los_Angeles | Value in America/New York ---|:---|:--- timestamp | 2014-12-12 12:34:56 | 2014-12-12 12:34:56 timestamp with local time zone | 2014-12-12 12:34:56 | 2014-12-12 15:34:56 **Comparisons with other tools** 見原文:[https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-BuiltInOperatorsandFunctions](https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-BuiltInOperatorsandFunctions) ### 1.6、Built In Operators and Functions > The operators and functions listed below are not necessarily up to date. ([Hive Operators and UDFs](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF) has more current information.) In [Beeline](https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-Beeline%E2%80%93NewCommandLineShell) or the Hive [CLI](https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-Beeline%E2%80%93NewCommandLineShell), use these commands to show the latest documentation: 下面列出的操作符和函式不一定是最新的。在 Beeline 或 Hive 命令列中,使用這些命令顯示最新的文件: ```sql SHOW FUNCTIONS; DESCRIBE F