Apache Hive integration with Elasticsearh

阿新 • • 發佈：2019-01-02

Configurationedit

When using Hive, one can use TBLPROPERTIES to specify the properties (as an alternative to Hadoop Configuration object) when declaring the external table backed by Elasticsearch:

使用hive時，宣告一個由Elasticsearch支援的外部表。可以用TBLPROPERTIES指定一個配置屬性（作為一個Hadoop Configuraion object替代品）

CREATE EXTERNAL TABLE artists (...)

STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'

TBLPROPERTIES('es.resource' = 'radio/artists',

'es.index.auto.create' = 'false') ;

elasticsearch-hadoop setting

Mappingedit

By default, elasticsearch-hadoop uses the Hive table schema to map the data in Elasticsearch, using both the field names and types in the process. There are cases however when the names in Hive cannot be used with Elasticsearch (the field name can contain characters accepted by Elasticsearch but not by Hive). For such cases, one can use the

es.mapping.names setting which accepts a comma-separated list of names mapping in the following format: Hive field name:Elasticsearch field name

在預設情況下，elasticsearch-hadoop使用Hive table schema對映資料到Elasticsearch，過程中使用欄位名和型別，但是有很多情況下在hive中欄位名不能被Elasticsearch使用。這時，可以用es.mapping.names設定接受一個以逗號分隔的欄位名錶對映在下面這種格式：Hive field name:Elasticsearch field name

To wit:

CREATE EXTERNAL TABLE artists (...)

STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'

TBLPROPERTIES('es.resource' = 'radio/artists',

'es.mapping.names' = 'date:@timestamp , url:url_123 ');

兩個欄位名date和url，用逗號分隔開

	name mapping for two fields
	Hive column date mapped in Elasticsearch to @timestamp
	Hive column url mapped in Elasticsearch to url_123

Hive is case insensitive while Elasticsearch is not. The loss of information can create invalid queries (as the column in Hive might not match the one in Elasticsearch). To avoid this, elasticsearch-hadoop will always convert Hive column names to lower-case. This being said, it is recommended to use the default Hive style and use upper-case names only for Hive commands and avoid mixed-case names.

Hive是大小寫不敏感，但Elasticsearch大小寫敏感。這樣可能會建立無效的queries（例如，Hive的列和Elasticsearch的列不匹配）。為了避免這個，elasticsearch-hadoop總是轉化Hive列名為小寫。它推薦使用預設Hive並且大寫名字僅能作為Hiver命令來避免混合大小寫的名字。

Hive treats missing values through a special value NULL as indicated here here. This means that when running an incorrect query (with incorrect or non-existing field names) the Hive tables will be populated with NULL instead of throwing an exception. Make sure to validate your data and keep a close eye on your schema since updates will otherwise go unnotice due to this lenient behavior.

Hive用特定值null處理缺失值。這意味著當執行一個錯誤的query（用一個錯誤的或者不存在的欄位名）hive表就用null填充就不報錯了。確認你的資料有效並注意schema的更新，不然就注意不到這個現象。

Writing data to Elasticsearchedit

寫資料到Elasticsearch

With elasticsearch-hadoop, Elasticsearch becomes just an external table in which data can be loaded or read from:

在elasticsearch-hadoop中，Elasticsearch就成了一個數據可被load或read的外部表

CREATE EXTERNAL TABLE artists (

id BIGINT,

name STRING,

links STRUCT<url:STRING, picture:STRING>)

STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'

TBLPROPERTIES('es.resource' = 'radio/artists');

-- insert data to Elasticsearch from another table called 'source'

INSERT OVERWRITE TABLE artists

SELECT NULL, s.name, named_struct('url', s.url, 'picture', s.picture)

FROM source s;

	Elasticsearch Hive StorageHandler
	Elasticsearch resource (index and type) associated with the given storage

For cases where the id (or other metadata fields like ttl or timestamp) of the document needs to be specified, one can do so by setting the appropriate namely es.mapping.id. Following the previous example, to indicate to Elasticsearch to use the field id as the document id, update the table properties:

在id需要被指定的文件，可以用對映es.mapping.id。接著前面的例子，說明Elasticsearch用欄位id做文件的id，更新table屬性：

CREATE EXTERNAL TABLE artists (

id BIGINT,

...)

STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'

TBLPROPERTIES('es.mapping.id' = 'id'...);

Writing existing JSON to Elasticsearchedit

For cases where the job input data is already in JSON, elasticsearch-hadoop allows direct indexing without applying any transformation; the data is taken as is and sent directly to Elasticsearch. In such cases, one needs to indicate the json input by setting the es.input.json parameter. As such, in this case elasticsearch-hadoop expects the output table to contain only one field, who s content is used as the JSON document. That is, the library will recognize specific textual types (such as string or binary) or simply call (toString).

寫已有的JSON 到Elasticsearch中

在任務是要寫入資料已經在JSON中，elasticsearch-hadoop允許直接索引無需應用轉換；資料直接傳送到Elasticsearch。在這個情況下，一種需要說明通過設定JSON輸入es.input.json引數。這時elasticsearch-hadoop期望輸出表僅含有一個內容使用JSON的欄位，就這樣庫會被辨識為特定的文字型別（例如string或binary）或就叫（toString）.

Apache Hive integration with Elasticsearh

Configurationedit

Mappingedit

Writing data to Elasticsearchedit

Writing existing JSON to Elasticsearchedit

Apache Hive integration with Elasticsearh

Spring boot with Apache Hive

Apache Hive 基本理論與安裝指南

Apache Hive

002-Apache Hive

Apache Hive 筆記

Exception in thread "main" org.apache.hive.service.cli.HiveSQLException: java.io.IOException: org.

Apache Hive在CentOS6上的安裝與配置

《Apache Hive官方文件》首頁

apache-hive-1.2.1-bin 安裝

Continuous Integration with Google Application Engine and Travis

Rust RPG: Introductory Tutorial of Rust, Unit Testing, and Continuous Integration with a Roguelike

New in Cloudera Enterprise 6: Apache Hive 2.1

CUBE Keyword in Apache Hive

hive部署安裝（apache-hive-1.1.0）

ServiceNow integration with Amazon Connect

Zendesk integration with Amazon Connect

apache-hive-1.2.2安裝教程

IBM UrbanCode Deploy Blueprint Designer: Integration with VMware vCenter

hadoop入門第七步---hive部署安裝（apache-hive-1.1.0）

Apache Hive integration with Elasticsearh

Configurationedit

Mappingedit

Writing data to Elasticsearchedit

Writing existing JSON to Elasticsearchedit

相關推薦