HDPCD-Java-複習筆記（19）

阿新 • • 發佈：2018-12-26

Hive

Apache Hive maintains metadata information in a metastore to generate tables. A Hive table consists of：

· A schema stored in the metastore

· Data stored on HDFS

HiveQL

Hive converts HiveQL commands into MapReduce jobs。

Hive and Pig

Pig is designed to move and restructure data

, while Hive is built toanalyze data.For most use cases:

Pig -- Is a good choice for ETL jobs, where unstructured data is reformatted so that it is easier to define a structure to it.

Hive -- Is a good choice when you want to query data that has a certain known structure.

Comparing Hive to SQL

SQL Datatypes	SQL Semantics
INT	SELECT, LOAD, INSERT from query
TINYINT/SMALLINT/BIGINT	Expressions in WHERE and HAVING
BOOLEAN	GROUP BY, ORDER BY, SORT BY
FLOAT	CLUSTER BY, DISTRIBUTE BY
DOUBLE	Sub-queries in FROM clause
STRING	ROLLUP
BINARY	CUBE
TIMESTAMP	UNION
ARRAY, MAP, STRUCT , UNION	LEFT, RIGHT and FULL INNER/OUTER JOIN
DECIMAL	CROSS JOIN, LEFT SEMI JOIN
CHAR	Windowing functions (OVER, RANK, etc.)
VARCHAR	Sub-queries for IN/NOT IN, HAVING
DATE	EXISTS / NOT EXISTS
INTERSECT, EXCEPT

Hive Architecture

Issuing Commands	Using the Hive CLI, a Web interface, or a Hive JDBC/ODBC client, a Hive query is submitted to the HiveServer
Hive Query Plan	The Hive query is compiled, optimized and planned as a MapReduce job
MapReduce Job Executes	The corresponding MapReduce job is executed on the Hadoop cluster

HiveQL

Hive queries are written using the HiveQL language, a SQL-like scripting language that simplifies the creation of MapReduce jobs. With HiveQL, data analysts can focus onanswering questions about the data, and let the Hive frameworkconvert the HiveQL into a MapReduce job.

Hive User-Defined Functions

Hive has three different types of User-Defined Functions:

UDF	A single row is input, and a single row is output
UDAF (User-Defined Aggregate Function)	Multiple rows are input, and a single row is output
UDTF (User-Defined Table-generating Function)	A single row is input, and multiple rows (i.e. a table) are output

The Hive API contains parent classes for writing each type of User-Defined Function: the UDF class for UDF functions, the UDAF class for UDAF functions, and the GenericUDTF class for writing UDTF functions.

Writing a Hive UDF

Extends the org.apache.hadoop.hive.ql.exec.UDF class

package hiveudfs;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.DoubleWritable;


public class ComputeShipping extends UDF {
    public static int originZip = 11344;
    public static double multiplier = 0.00045;
    DoubleWritable shippingAmt = new DoubleWritable();

    public DoubleWritable evaluate(int zip, double weight) {
        long distance = Math.abs(originZip - zip);
        double amt = (distance * multiplier) + weight;
        shippingAmt.set(amt);

        return shippingAmt;
    }
}

Invoking a Hive UDF

To invoke a UDF from within a Hive script:

· Register the JAR file that contains the UDF class, and

· Define an alias for the function using the CREATE TEMPORARY FUNCTION command.

ADD JAR /myapp/lib/myhiveudfs.jar;
CREATE TEMPORARY FUNCTION ComputeShipping
AS 'hiveudfs.ComputeShipping';
FROM orders
SELECT address,
description,
ComputeShipping(zip, weight);

Overview of GenericUDF

The GenericUDF class provides more features and benefits over UDF, including:

· The arguments passed in to a GenericUDF can be complex types, including non-writable types like struct, map and array.

· The return value can also be a complex type.

· A variable length of arguments can be passed in.

· A GenericUDF can perform operations that a UDF cannot support.

· Better performance, due to lazy evaluation andshort-circuiting.

The GenericUDF class declares three abstract methods:

ObjectInspector initialize(ObjectInspector [] arguments)	The ObjectInspector (OI) instances represent thearguments for the function. The initialize method is invoked once per GenericUDF instance and allows you tovalidate the arguments passed in.
Object evaluate(GenericUDF.DeferredObject [] arguments)	Similar to UDF, this method gets passed in the arguments and returns the result of the function call.
String getDisplayString(String [] children)	Return a string that gets displayed from the explain command.

Example of a GenericUDF

class ComplexUDFExample extends GenericUDF {
    ListObjectInspector listOI;
    StringObjectInspector elementOI;

    public String getDisplayString(String[] arg0) {
        return "arrayContainsExample()";
    }

    public ObjectInspector initialize(ObjectInspector[] arguments) {
        if (arguments.length != 2) {
            throw new UDFArgumentLengthException(
                "method takes 2 arguments: List<T>, T");
        }

        // Verify we received the right object types.
        ObjectInspector a = arguments[0];
        ObjectInspector b = arguments[1];

        if (!(a instanceof ListObjectInspector) ||
                !(b instanceof StringObjectInspector)) {
            throw new UDFArgumentException(
                "first argument must be a list / array, second argument must be a string");
        }

        this.listOI = (ListObjectInspector) a;
        this.elementOI = (StringObjectInspector) b;

        // Verify that the list contains strings
        if (!(listOI.getListElementObjectInspector() instanceof StringObjectInspector)) {
            throw new UDFArgumentException(
                "first argument must be a list of strings");
        }

        // the return type of our function is a boolean
        return PrimitiveObjectInspectorFactory.javaBooleanObjectInspector;
    }

    public Object evaluate(DeferredObject[] arguments) {
        // get the list and string from the deferred objects
        // using the object inspectors
        List<String> list = (List<String>) this.listOI.getList(arguments[0].get());
        String arg = elementOI.getPrimitiveJavaObject(arguments[1].get());

        // check for nulls
        if ((list == null) || (arg == null)) {
            return null;
        }

        // see if our list contains the value we need
        for (String s : list) {
            if (arg.equals(s)) {
                return new Boolean(true);
            }
        }

        return new Boolean(false);
    }
}

Overview of HCatalog

HCatalog,a metadata and table management system, helps enableschema on read in Hadoop. HCatalog has the following features:

· Makes the Hive metastore available to users of other toolson Hadoop.

· Provides connectors for Map Reduce and Pig so that users of those tools can read data from and write data to Hive’s warehouse.

· Allows users to share data and metadata across Hive, Pig, and MapReduce.

· Provides a relational view, through SQL-like language (HiveQL), to data within Hadoop.

· Allows users to write their applications without being concerned how or where the data is stored.

· Insulates users from schema and storage format changes.

HCatalog in the Ecosystem

HCatalog provides tableabstraction, which abstracts some of the details about data like:

· How the data is stored.

· Where the data resides on the filesystem

· What format that data is in

· What the schema is of the data

HCatInputFormat and HCatOutputFormat

For Java MapReduce applications, the HCatInputFormat and HCatOutputFormat classes can be used to read and write data using HCatalog schemas and data types.

Here is an example of a Mapper that uses HCatInputFormat:

public static class Map extends Mapper<WritableComparable, HCatRecord, Text, IntWritable> {
    String name;
    int age;
    double gpa;

    protected void map(WritableComparable key, HCatRecord value, Context context)
        throws IOException, InterruptedException {
        name = (String) value.get(0);
        age = (Integer) value.get(1);
        gpa = (Double) value.get(2);
        context.write(new Text(name), new IntWritable(age));
    }
}

The Job configuration looks like:

String principalID = System.getProperty(HCatConstants.HCAT_METASTORE_PRINCIPAL);
if (principalID != null)
	 conf.set(HCatConstants.HCAT_METASTORE_PRINCIPAL, principalID);
Job job = Job.getInstance (conf, "SimpleRead");
HCatInputFormat.setInput(job, InputJobInfo.create(dbName, tableName, null));
job.setInputFormatClass(HCatInputFormat.class);

HDPCD-Java-複習筆記（19）

HDPCD-Java-複習筆記（19）

HDPCD-Java-複習筆記（22）- lab

HDPCD-Java-複習筆記（23）- lab

HDPCD-Java-複習筆記（16）

HDPCD-Java-複習筆記（20）

HDPCD-Java-複習筆記（21）- lab

HDPCD-Java-複習筆記（14）- lab

HDPCD-Java-複習筆記（17）

HDPCD-Java-複習筆記（13）- lab

HDPCD-Java-複習筆記（18）

Java複習筆記（二）

Java複習筆記（三）

JAVA複習筆記（五）

JAVA複習筆記（六）

Haskell語言學習筆記（19）File IO

JAVA學習筆記（1）——a++與++a的區別

JAVA學習筆記（三）

java學習筆記（二）圖形用戶接口

Java學習筆記（二）-------String，StringBuffer，StringBuilder區別以及映射到的同步，異步相關知識

java學習筆記（四）：import語法

HDPCD-Java-複習筆記（19）

相關推薦