1. 程式人生 > >同一個sql 在Hive和spark-sql 跑出結果不一樣記錄

同一個sql 在Hive和spark-sql 跑出結果不一樣記錄

表Schema

hive> desc gdm.dim_category;                                
name                    string         分類名稱                                                   
org_code                string         分類code                             

hive> select name, org_code from gdm.dim_category limit
2; OK 鞋 _8_ 鞋/男 _8_21_
hive> desc gdm.dim_product_brand;
brand_id                bigint                  品牌ID                
ch_name                 string                  品牌中文名

hive> select brand_id, ch_name from gdm.dim_product_brand limit 2;
OK
1       nb
2       np               

待執行的SQL

select
  t1.keyword,
  t3.name,
  t4.ch_name
from
(
  select "categoryIds:_8_" as keyword
  union all
  select "categoryIds:_8_21_" as keyword
  union all
  select "brandId:1" as keyword
) t1
left join gdm.dim_category t3
on split(t1.keyword, ":")[1] = t3.org_code and split(t1.keyword, ":")
[0] = "categoryIds" left join gdm.dim_product_brand t4 on split(t1.keyword, ":")[1] = t4.brand_id and split(t1.keyword, ":")[0] = "brandId"

在Hive中跑出的結果 (錯誤)

categoryIds:_8_	NULL	NULL
categoryIds:_8_21_	NULL	NULL
brandId:1	NULL	nb

在Spark-sql中跑出的結果 (正確)

categoryIds:_8_	鞋	NULL
categoryIds:_8_21_	鞋/男	NULL
brandId:1	NULL	nb

原因

因為 gdm.dim_product_brand 表中 brand_id欄位是 bigint 型別 所以 在Hive中會把 keyword 轉成double型別 用來進行 join 匹配操 –> split(t1.keyword, “:”)[1] = t4.brand_id)

因此 split(t1.keyword, “:”)[1] = t3.org_code 匹配不成功,所以結果為NULL

解決辦法

split(t1.keyword, ":")[1] = t4.brand_id  -->  split(t1.keyword, ":")[1] = cast(t4.brand_id as string)