HDPCD-Java-複習筆記(19)
Hive
Apache Hive maintains
metadata information in a metastore to generate tables. A Hive table consists of:
· A schema stored in the metastore |
· Data stored on HDFS |
HiveQL
Hive converts HiveQL commands into MapReduce jobs。
Hive and Pig
Pig
is designed to move and restructure data
Pig -- Is a good choice for ETL jobs, where unstructured data is reformatted so that it is easier to define a structure to it.
Hive -- Is a good choice when you want to query data that has a certain known structure.
Comparing Hive to SQL
SQL Datatypes |
SQL Semantics |
INT |
SELECT, LOAD, INSERT from query |
TINYINT/SMALLINT/BIGINT |
Expressions in WHERE and HAVING |
BOOLEAN |
GROUP BY, ORDER BY, SORT BY |
FLOAT |
CLUSTER BY, DISTRIBUTE BY |
DOUBLE |
Sub-queries in FROM clause |
STRING |
ROLLUP |
BINARY |
CUBE |
TIMESTAMP |
UNION |
ARRAY, MAP, STRUCT |
LEFT, RIGHT and FULL INNER/OUTER JOIN |
DECIMAL |
CROSS JOIN, LEFT SEMI JOIN |
CHAR |
Windowing functions (OVER, RANK, etc.) |
VARCHAR |
Sub-queries for IN/NOT IN, HAVING |
DATE |
EXISTS / NOT EXISTS |
INTERSECT, EXCEPT |
Hive Architecture
Issuing Commands |
Using the Hive CLI, a Web interface, or a Hive JDBC/ODBC client, a Hive query is submitted to the HiveServer |
Hive Query Plan |
The Hive query is compiled, optimized and planned as a MapReduce job |
MapReduce Job Executes |
The corresponding MapReduce job is executed on the Hadoop cluster |
HiveQL
Hive
queries are written using the HiveQL language, a SQL-like scripting language that simplifies the creation of MapReduce jobs. With HiveQL, data analysts can focus onanswering questions about the data, and let the Hive frameworkconvert
the HiveQL into a MapReduce job.
Hive
User-Defined Functions
Hive
has three different types of User-Defined Functions:
UDF |
A single row is input, and a single row is output |
UDAF (User-Defined Aggregate Function) |
Multiple rows are input, and a single row is output |
UDTF (User-Defined Table-generating Function) |
A single row is input, and multiple rows (i.e. a table) are output |
The Hive API contains parent classes for writing each type of User-Defined Function: the UDF class for UDF functions, the UDAF class for UDAF functions, and the GenericUDTF class for writing UDTF functions.
Writing
a Hive UDF
Extends the org.apache.hadoop.hive.ql.exec.UDF class
package hiveudfs;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.DoubleWritable;
public class ComputeShipping extends UDF {
public static int originZip = 11344;
public static double multiplier = 0.00045;
DoubleWritable shippingAmt = new DoubleWritable();
public DoubleWritable evaluate(int zip, double weight) {
long distance = Math.abs(originZip - zip);
double amt = (distance * multiplier) + weight;
shippingAmt.set(amt);
return shippingAmt;
}
}
Invoking a Hive UDF
To invoke a UDF from within a Hive script:
· Register the JAR file that contains the UDF class, and |
· Define an alias for the function using the CREATE TEMPORARY FUNCTION command. |
- ADD JAR /myapp/lib/myhiveudfs.jar;
- CREATE TEMPORARY FUNCTION ComputeShipping
- AS 'hiveudfs.ComputeShipping';
- FROM orders
- SELECT address,
- description,
- ComputeShipping(zip, weight);
Overview
of GenericUDF
The GenericUDF class
provides more features and benefits over UDF, including:
· The arguments passed in to a GenericUDF can be complex types, including non-writable types like struct, map and array. |
· The return value can also be a complex type. |
· A variable length of arguments can be passed in. |
· A GenericUDF can perform operations that a UDF cannot support. |
· Better performance, due to lazy evaluation andshort-circuiting. |
The GenericUDF class
declares three abstract methods:
ObjectInspector |
The ObjectInspector (OI) instances represent thearguments for the function. The initialize method is invoked once per GenericUDF instance and allows you tovalidate the arguments passed in. |
Object |
Similar to UDF, this method gets passed in the arguments and returns the result of the function call. |
String |
Return a string that gets displayed from the explain command. |
Example
of a GenericUDF
class ComplexUDFExample extends GenericUDF {
ListObjectInspector listOI;
StringObjectInspector elementOI;
public String getDisplayString(String[] arg0) {
return "arrayContainsExample()";
}
public ObjectInspector initialize(ObjectInspector[] arguments) {
if (arguments.length != 2) {
throw new UDFArgumentLengthException(
"method takes 2 arguments: List<T>, T");
}
// Verify we received the right object types.
ObjectInspector a = arguments[0];
ObjectInspector b = arguments[1];
if (!(a instanceof ListObjectInspector) ||
!(b instanceof StringObjectInspector)) {
throw new UDFArgumentException(
"first argument must be a list / array, second argument must be a string");
}
this.listOI = (ListObjectInspector) a;
this.elementOI = (StringObjectInspector) b;
// Verify that the list contains strings
if (!(listOI.getListElementObjectInspector() instanceof StringObjectInspector)) {
throw new UDFArgumentException(
"first argument must be a list of strings");
}
// the return type of our function is a boolean
return PrimitiveObjectInspectorFactory.javaBooleanObjectInspector;
}
public Object evaluate(DeferredObject[] arguments) {
// get the list and string from the deferred objects
// using the object inspectors
List<String> list = (List<String>) this.listOI.getList(arguments[0].get());
String arg = elementOI.getPrimitiveJavaObject(arguments[1].get());
// check for nulls
if ((list == null) || (arg == null)) {
return null;
}
// see if our list contains the value we need
for (String s : list) {
if (arg.equals(s)) {
return new Boolean(true);
}
}
return new Boolean(false);
}
}
Overview of HCatalog
HCatalog,a metadata and table management system, helps enableschema on read in Hadoop. HCatalog has the
following features:
· Makes the Hive metastore available to users of other toolson Hadoop. |
· Provides connectors for Map Reduce and Pig so that users of those tools can read data from and write data to Hive’s warehouse. |
· Allows users to share data and metadata across Hive, Pig, and MapReduce. |
· Provides a relational view, through SQL-like language (HiveQL), to data within Hadoop. |
· Allows users to write their applications without being concerned how or where the data is stored. |
· Insulates users from schema and storage format changes. |
HCatalog
in the Ecosystem
HCatalog
provides tableabstraction, which abstracts some of the details about data like:
· How the data is stored. |
· Where the data resides on the filesystem |
· What format that data is in |
· What the schema is of the data |
HCatInputFormat
and HCatOutputFormat
For Java MapReduce applications, the HCatInputFormat and HCatOutputFormat classes can be used to read and write data using HCatalog schemas and data types.
Here is an example of a Mapper that uses HCatInputFormat:public static class Map extends Mapper<WritableComparable, HCatRecord, Text, IntWritable> {
String name;
int age;
double gpa;
protected void map(WritableComparable key, HCatRecord value, Context context)
throws IOException, InterruptedException {
name = (String) value.get(0);
age = (Integer) value.get(1);
gpa = (Double) value.get(2);
context.write(new Text(name), new IntWritable(age));
}
}
The Job configuration looks like:
String principalID = System.getProperty(HCatConstants.HCAT_METASTORE_PRINCIPAL);
if (principalID != null)
conf.set(HCatConstants.HCAT_METASTORE_PRINCIPAL, principalID);
Job job = Job.getInstance (conf, "SimpleRead");
HCatInputFormat.setInput(job, InputJobInfo.create(dbName, tableName, null));
job.setInputFormatClass(HCatInputFormat.class);