【Hive】 cli 的基本用法

Hive can manage the addition of resources to a session where those resources need to be made available at query execution time. Any locally accessible file can be added to the session. Once a file is added to a session, hive query can refer to this file by its name (in map/reduce/transform clauses) and this file is available locally at execution time on the entire hadoop cluster. Hive uses Hadoop's Distributed Cache to distribute the added files to all the machines in the cluster at query execution time.

ADD { FILE[S] | JAR[S] | ARCHIVE[S] } <filepath1> [<filepath2>]* LIST { FILE[S] | JAR[S] | ARCHIVE[S] } [<filepath1> <filepath2> ..] DELETE { FILE[S] | JAR[S] | ARCHIVE[S] } [<filepath1> <filepath2> ..]
  • FILE resources are just added to the distributed cache. Typically, this might be something like a transform script to be executed.
  • JAR resources are also added to the Java classpath. This is required in order to reference objects they contain such as UDF's.
  • ARCHIVE resources are automatically unarchived as part of distributing them.


hive> add FILE /tmp/tt.py;hive> list FILES;/tmp/tt.pyhive> from networks a MAP a.networkid USING 'python tt.py' as nn where a.ds = '2009-01-04' limit 10;

It is not neccessary to add files to the session if the files used in a transform script are already available on all machines in the hadoop cluster using the same path name. For example:

  • ... MAP a.networkid USING 'wc -l' ...: here wc is an executable available on all machines
  • ... MAP a.networkid USING '/home/nfsserv1/hadoopscripts/tt.py' ...: here tt.py may be accessible via a nfs mount point that's configured identically on all the cluster nodes.


