同一個sql 在Hive和spark-sql 跑出結果不一樣記錄
阿新 • • 發佈:2018-12-17
表Schema
hive> desc gdm.dim_category;
name string 分類名稱
org_code string 分類code
hive> select name, org_code from gdm.dim_category limit 2;
OK
鞋 _8_
鞋/男 _8_21_
hive> desc gdm.dim_product_brand; brand_id bigint 品牌ID ch_name string 品牌中文名 hive> select brand_id, ch_name from gdm.dim_product_brand limit 2; OK 1 nb 2 np
待執行的SQL
select
t1.keyword,
t3.name,
t4.ch_name
from
(
select "categoryIds:_8_" as keyword
union all
select "categoryIds:_8_21_" as keyword
union all
select "brandId:1" as keyword
) t1
left join gdm.dim_category t3
on split(t1.keyword, ":")[1] = t3.org_code and split(t1.keyword, ":") [0] = "categoryIds"
left join gdm.dim_product_brand t4
on split(t1.keyword, ":")[1] = t4.brand_id and split(t1.keyword, ":")[0] = "brandId"
在Hive中跑出的結果 (錯誤)
categoryIds:_8_ NULL NULL
categoryIds:_8_21_ NULL NULL
brandId:1 NULL nb
在Spark-sql中跑出的結果 (正確)
categoryIds:_8_ 鞋 NULL
categoryIds:_8_21_ 鞋/男 NULL
brandId:1 NULL nb
原因
因為 gdm.dim_product_brand 表中 brand_id欄位是 bigint 型別 所以 在Hive中會把 keyword 轉成double型別 用來進行 join 匹配操 –> split(t1.keyword, “:”)[1] = t4.brand_id)
因此 split(t1.keyword, “:”)[1] = t3.org_code 匹配不成功,所以結果為NULL
解決辦法
split(t1.keyword, ":")[1] = t4.brand_id --> split(t1.keyword, ":")[1] = cast(t4.brand_id as string)