1. 程式人生 > 其它 >關於hive 分桶重排序的一個栗子

關於hive 分桶重排序的一個栗子

需求

原始資料

year tag
2014 1
2015 1
2016 0
2017 0
2018 0
2020 1
2021 1
2022 1

結果資料

2014 1 1
2015 1 2
2016 0 1
2017 0 2
2018 0 3
2020 1 1
2021 1 2
2022 1 3
說明:欄位1是有序的,按照欄位2分塊計數,每當欄位2變化,就重新開始計數,計數的結果當作欄位3返回

資料準備

create table temp_block_sort_0309
(
    year int,
    tag  int
) stored as orc
    tblproperties ('orc.compress' = 'snappy');
insert into temp_block_sort_0309
values (2014, 1),
       (2015, 1),
       (2016, 0),
       (2017, 0),
       (2018, 0),
       (2020, 1),
       (2021, 1),
       (2022, 1);
select * from
temp_block_sort_0309;

解題方案

解決這個問題可以分 3 步走 1、判定當前行和上一行的第二列是否相等,如果不等置為1,相等為0,作為flag列

select *, case when lag(tag, 1) over (order by year) != tag then 1 else 0 end as flag
from temp_block_sort_0309;

 

year tag flag
2014 1 0
2015 1 0
2016 0 1
2017 0 0
2018 0 0
2020 1 1
2021 1 0
2022 1 0
2、對 flag列進行累加,值作為分桶id
select *, sum(flag) over (order by year) tong_id
from (select *, case when lag(tag, 1) over (order by year) != tag then 1 else 0 end as flag
      from temp_block_sort_0309) t;
year tag flag tong_id
2014 1 0 0
2015 1 0 0
2016 0 1 1
2017 0 0 1
2018 0 0 1
2020 1 1 2
2021 1 0 2
2022 1 0 2
3、按照累加的值(分桶id)分組 & 組內重排序
select year, tag, row_number() over (partition by tong_id order by year) js_id, flag
from (select *, sum(flag) over (order by year) tong_id
      from (select *, case when lag(tag, 1) over (order by year) != tag then 1 else 0 end as flag
            from temp_block_sort_0309) t) t
order by 1;
year tag js_id flag_id
2014 1 1 0
2015 1 2 0
2016 0 1 1
2017 0 2 0
2018 0 3 0
2020 1 1 1
2021 1 2 0
2022 1 3 0