一道hive面試題
阿新 • • 發佈:2019-02-02
該面試題的原文地址:http://blog.csdn.net/zolalad/article/details/10819749#
解決思路:根據使用者ID算出訪問次數,然後根據訪問次數算出fromurl和tourl
難點主要為計算使用者訪問次數,原文的計算方法看著有點複雜,於是就簡單寫了一個
import java.util.HashMap; import org.apache.hadoop.hive.ql.exec.UDF; public class UdfTest extends UDF { HashMap<Integer,Integer> hm = new HashMap<Integer, Integer>(); public int evaluate(int id){ Integer count = hm.get(id); if (count==null){ count=0; } count++; hm.put(id, count); return count; } }
把使用者的ID及訪問次數count寫入map集合,最後返回count
打包上傳,在hive中執行add jar /usr/local/udf.jar ,CREATE TEMPORARY FUNCTION num AS "udf.UdfTest";
SELECT t1.platform,t1.user_id,t1.n,t2.click_url FROM_URL,t1.click_url TO_URL FROM (select *,num(USER_ID) n from trlog)t1 LEFT OUTER JOIN (select *,num(USER_ID) n from trlog)t2 on t1.user_id = t2.user_id and t1.n = t2.n+1;
註釋:當訪問次數為1時,fromurl為null,此時t1.n為1,t2中應不存在次數為1的,所以t2中應該n+1
進行連表查詢,剛開始報錯java.io.FileNotFoundException(File does not exist/usr/local/......),於是手動把jar包傳到hdfs,成功執行
注:
最近發現僅用hive的分析函式就可實現:ROW_NUMBER+LAG
select
platform,
user_id,
CLICK_TIME,
ROW_NUMBER() OVER(PARTITION BY platform,user_id order by CLICK_TIME) AS rn,
lag(click_url,1) over(partition by platform,user_id order by CLICK_TIME) as from_url,
click_url from trlog;