天池新人實戰賽之[離線賽]嘗試（一）

阿新 • • 發佈：2019-01-12

題目（https://tianchi.aliyun.com/getStart）就不貼了。經過一些百度的資料，可以將這個問題簡化為：某個U-I組合在觀察日是否有購買行為？(二分類問題)

接下來分幾個步驟來拆解整個過程：

一.簡單分析

將兩個資料表.tianchi_fresh_comp_train_item和tianchi_fresh_comp_train_user存入到資料庫中，

對應表名：vipfin.tianchi_fresh_comp_train_item 和vipfin.tianchi_fresh_comp_train_user

檢視前一天的使用者操作（瀏覽,收藏，加入購物車）對後一天的購買行為的影響程度。

參考部落格https://blog.csdn.net/snoopy_yuan/article/details/72850601 他提交了一份在前一日加入購物車，在後一日未購買的資料。我們來簡單驗證下他的可行性。

先看加入購物車未購買的操作，以11.18為例

select

count(1)

from (select * from

vipfin.tianchi_fresh_comp_train_user where substr( time,1,10)='2014-11-18' and behavior_type =3) a

left join

(select * from

vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-18' and behavior_type =4) b

on a.user_id=b.user_id

and a.item_id =b.item_id

where b.user_id is null

14998

在11.18有加入購物車，在11.19發生了購買行為的資料

select

count(1)

from (select * from

vipfin.tianchi_fresh_comp_train_user where substr( time,1,10)='2014-11-18' and behavior_type =3) a

inner join

(select * from

vipfin.tianchi_fresh_comp_train_user where substr( time,1,10)='2014-11-19' and behavior_type =4) b

on a.user_id=b.user_id

and a.item_id =b.item_id

614

前一天加入購物車的資料，第二天轉換為購買行為的機率為4%。

所以部落格中提高直接提交12.18日加入購物車的資料，準確率可想而知，肯定不會超過5%

統計前一天加入購物車這種操作的準確率都只有5%，可以想象的到瀏覽和收藏，轉化率更低。

所以單純的依靠前一天的操作來預測後一天購買行為是不行滴。

再進行其他的統計，單純依靠SQL，是無法有太高的準確率的。前一天加入購物車，第二天產生購買的記錄佔第二天所有購買記錄的比例小於10%。所以即使根據前一天加入購物車資料統計的準確率為100%，也只佔第二天總購買記錄的10%不到。

綜上更加堅定了需要用到機器學習了。

所以要考慮從tianchi_fresh_comp_train_user 每天的銷售記錄中，提取出一些可以衡量使用者行為，購買行為，商品屬性的特徵，用於機器學習模型的輸入。

二.資料預處理

幾點思路：

1.由於使用者行為對購買的影響隨時間減弱，根據分析，使用者在一週之前的行為對考察日是否購買的影響已經很小，故而只考慮距考察日（預測日）一週以內的特徵資料。

2.購買行為具有一定的週期性,選取訓練資料，驗證資料和預測資料集（排除掉雙十二的資料）

輸入	輸出
訓練資料	11.22~11.27U-I集合行為資料	11.28U-I集合購買記錄
驗證資料	11.29~12.04U-I集合行為資料	12.05 U-I集合購買記錄
預測資料	12.13~12.18U-I集合行為資料	12.19 U-I集合購買記錄

使用訓練資料訓練出模型，通過一些調引數，使模型損失函式最小，準確率較高。

再代入驗證資料，預測出結果和真實12.05的資料進行比對，驗證其泛化能力，如果驗證結果較為理想

則直接使用預測資料進行預測

3.針對當前業務場景，根據user和item資料進行組合構建出各種維度的特徵值

4.由於問題已被明確為 U-I 是否發生購買行為（標記label取｛0，1]）的分類問題。特徵集合都要以U-I為維度構建。預測時所考慮的U-I集合。如果是笛卡爾積式的(所有使用者*所有商品) 預測，資料量太大。這裡優先考慮在預測日前一個週期內出現過操作的U-I組合

（這裡也會存在問題，輸入資料的集合太小，可以擴大到出現過操作的item類別相同的U-I組合，

更嚴謹一些，類別相同，並且操作最頻繁的item（最受所有使用者歡迎的商品）產生的U-I組合,待後續探索）

參考https://blog.csdn.net/snoopy_yuan/article/details/75105724 簡單提取幾個維度的特徵值

5.資料集的範圍並不是一成不變的，根據預測目標，和訓練資料的分佈情況，可能需要對資料進行篩選等操作。

特徵名稱	所屬類別	特徵含義	特徵作用	數量
u_b_count	U	使用者在考察日的前一個週期內行為總數	使用者活躍度	1
u_bi_count （i=1/2/3/4）	U	使用者在前一個週期各種行為的計數	使用者活躍度(不同操作)	4
u_b4_rate	U	使用者購買轉換率	使用者購買習慣	1
i_u_count	I	商品在週期內的操作計數	商品熱度	1
i_b4_rate	I	商品的點選購買轉化率	反映了商品的購買決策操作特點	1
c_u_count	C	類別在週期內的操作計數	反映了item_category的熱度	1
c_b4_rate	C	類別的點選購買轉化率	反映了item_category的購買決策操作特點	1
ui_b_count	UI	使用者-商品對在週期內的行為總數計數	反映了U-I的活躍程度	1
uc_b_count	UC	使用者-類別對在週期內的行為總數計數	反映了U-C的活躍程度	1

以上特徵值提取，可選擇在python pandas裡面完成(原部落格好像是在excel中統計的)，也可選擇使用SQL統計。這裡我用後者，因為我對SQL操作更熟悉。

SQL操作

create table temp_fin.temp_tianchi_train1 as 
select a.user_id, a.item_id,a.item_category,1  as  flag
from 
(select *
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27' 
) a 
inner join 
(select *
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) ='2014-11-28' and  behavior_type =4 ) b
on a.user_id=b.user_id
and a.item_id =b.item_id 
union all
select a.user_id, a.item_id,a.item_category,0  as  flag
from 
(select *
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27' 
) a 
left join 
(select *
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) ='2014-11-28' and  behavior_type =4 ) b
on a.user_id=b.user_id
and a.item_id =b.item_id 
where b.user_id is null

create table temp_fin.temp_tianchi_train1_dist as
select   distinct  * from  temp_fin.temp_tianchi_train1
---特徵提取
create table temp_fin.temp_tianchi_train1_u_b_count as 
select  distinct a.user_id,b.l_count u_b_count from
temp_fin.temp_tianchi_train1_dist a 
inner join 
(select user_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27' 
 group by user_id
)  b 
on a.user_id=b.user_id

create table temp_fin.temp_tianchi_train1_u_b1_count  as 
select  distinct  a.user_id,b.l_count u_b_count from
temp_fin.temp_tianchi_train1_dist a 
inner join 
(select user_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type=1
 group by user_id
)  b 
on a.user_id=b.user_id;

create table temp_fin.temp_tianchi_train1_u_b2_count  as 
select distinct a.user_id,b.l_count u_b_count from
temp_fin.temp_tianchi_train1_dist a 
inner join 
(select user_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type=2
 group by user_id
)  b 
on a.user_id=b.user_id;

create table temp_fin.temp_tianchi_train1_u_b3_count  as 
select  distinct a.user_id,b.l_count u_b_count from
temp_fin.temp_tianchi_train1_dist a 
inner join 
(select user_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type=3
 group by user_id
)  b 
on a.user_id=b.user_id;

create table temp_fin.temp_tianchi_train1_u_b4_count  as 
select   distinct a.user_id,b.l_count u_b_count from
temp_fin.temp_tianchi_train1_dist a 
inner join 
(select user_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type=4
 group by user_id
)  b 
on a.user_id=b.user_id;

create table temp_fin.temp_tianchi_train1_u_b4_rate as 
select distinct a.user_id,  d.rate u_b4_rate  from
temp_fin.temp_tianchi_train1_dist a 
inner join 
(select   b.user_id , cast(COALESCE(c.l_count,0)  as double)/b.l_count  rate      from 
(select user_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type in (1,2,3,4)
 group by user_id
)  b 
left join 
(select   user_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type=4
  group by user_id
)  c
 on b.user_id=c.user_id
 )  d 
 on a.user_id =d.user_id

create table temp_fin.temp_tianchi_train1_i_u_count	 as 
select  distinct a.item_id,  b.l_count   i_u_count  from
temp_fin.temp_tianchi_train1_dist a 
inner join
(select item_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  
 group by item_id
)  b 
on a.item_id=b.item_id

create table temp_fin.temp_tianchi_train1_i_b4_rate as 
select  distinct a.item_id,  d.rate i_b4_rate  from
temp_fin.temp_tianchi_train1_dist a 
inner join 
(select   b.item_id , cast(COALESCE(c.l_count,0)  as double)/b.l_count  rate      from 
(select item_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type in (1,2,3,4)
 group by item_id
)  b 
left join 
(select   item_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type=4
  group by item_id
)  c
 on b.item_id=c.item_id
 )  d 
 on a.item_id =d.item_id

create table temp_fin.temp_tianchi_train1_c_u_count	 as 
select  distinct a.item_category,  b.l_count   c_u_count  from
temp_fin.temp_tianchi_train1_dist a 
inner  join
(select item_category,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  
 group by item_category
)  b 
on a.item_category=b.item_category

create table temp_fin.temp_tianchi_train1_c_b4_rate as 
select    distinct a.item_category,  d.rate c_b4_rate  from
temp_fin.temp_tianchi_train1_dist a 
left join 
(select   b.item_category , cast(COALESCE(c.l_count,0)  as double)/b.l_count  rate      from 
(select item_category,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type in (1,2,3,4)
 group by item_category
)  b 
inner join 
(select   item_category,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type=4
  group by item_category
)  c
 on b.item_category=c.item_category
 )  d 
 on a.item_category =d.item_category
 
 create table temp_fin.temp_tianchi_train1_ui_b_count	 as 
select   distinct a.user_id, a.item_id,  b.l_count   ui_b_count  from
temp_fin.temp_tianchi_train1_dist a 
inner join
(select user_id,item_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  
 group by user_id,item_id
)  b 
on a.user_id=b.user_id
and a.item_id=b.item_id

create table temp_fin.temp_tianchi_train1_uc_b_count  as 
select distinct  a.user_id,a.item_category  ,b.l_count   uc_b_count  from
temp_fin.temp_tianchi_train1_dist a 
inner join
(select user_id,item_category,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  
 group by user_id,item_category
)  b 
on a.user_id=b.user_id
and a.item_category=b.item_category

create table temp_fin.temp_tianchi_train1_data as 
select a.user_id, a.item_id,a.item_category
,u_b_count_table.u_b_count
,u_b1_count.u_b_count u_b1_count
,u_b2_count.u_b_count u_b2_count
,u_b3_count.u_b_count u_b3_count
,u_b4_count.u_b_count u_b4_count
,u_b4_rate.u_b4_rate
,i_u_count.i_u_count
,i_b4_rate.i_b4_rate
,c_u_count.c_u_count
,c_b4_rate.c_b4_rate
,ui_b_count.ui_b_count
,uc_b_count.uc_b_count
,a.flag
from temp_fin.temp_tianchi_train1_dist a 
left join temp_fin.temp_tianchi_train1_u_b_count u_b_count_table
on a.user_id =u_b_count_table.user_Id 
left join  temp_fin.temp_tianchi_train1_u_b1_count  u_b1_count 
on a.user_id =u_b1_count.user_Id 
left join  temp_fin.temp_tianchi_train1_u_b2_count  u_b2_count 
on a.user_id =u_b2_count.user_Id 
left join  temp_fin.temp_tianchi_train1_u_b3_count  u_b3_count 
on a.user_id =u_b3_count.user_Id 
left join  temp_fin.temp_tianchi_train1_u_b4_count  u_b4_count 
on a.user_id =u_b4_count.user_Id 
left join  temp_fin.temp_tianchi_train1_u_b4_rate  u_b4_rate 
on a.user_id =u_b4_rate.user_Id 
left join  temp_fin.temp_tianchi_train1_i_u_count i_u_count
on a.item_id =i_u_count.item_id
left join  temp_fin.temp_tianchi_train1_i_b4_rate  i_b4_rate 
on a.item_id =i_b4_rate.item_id
left join  temp_fin.temp_tianchi_train1_c_u_count c_u_count
on a.item_category=c_u_count.item_category
left join  temp_fin.temp_tianchi_train1_c_b4_rate c_b4_rate
on a.item_category=c_b4_rate.item_category
left join  temp_fin.temp_tianchi_train1_ui_b_count ui_b_count
on a.user_id =ui_b_count.user_Id and a.item_id=ui_b_count.item_id
left join  temp_fin.temp_tianchi_train1_uc_b_count uc_b_count
on a.user_id =uc_b_count.user_Id and a.item_category=uc_b_count.item_category;

 同理算出其他兩個資料集

三.特徵處理

處理好後的資料集依然分為三份，每一份大概有這麼些列

user_id,item_id,category,特徵值（u_b_count...uc_b_count）， label（標籤，在觀察日是否購買）

有了以上資料。做特徵處理，使用pyspark.ml.feature 包。該包下有多類特徵轉換為一個多維向量的方法，

比如VectorAssembler；也有做特徵值縮放，0值處理的方法，比如MaxAbsScaler，MinMaxScaler。

特徵處理的兩個步驟:

多列特徵值 =》一列多維向量 =》向量值縮放

（思考內容：第一步操作能否加入特徵權重的概念？畢竟上面那麼多特徵維度，有些維度更加重要，比如使用者活躍度比商品活躍度更加重要。使用者活躍度高，才更可能買商品，如果一個爆款商品遇到一個不怎麼操作的使用者，也是白搭）

注：如果使用sklearn API進行模型學習，輸入的特徵值格式是一個array，可直接將所有特徵值合併起來處理，過程略

過程程式碼待補充...

四.模型搭建

特徵值已經處理為模型可識別的向量，直接在pyspark.ml 中找不同的演算法模型，帶入計算。根據準確率調整超參。並根據驗證資料來驗證模型的可靠性。

過程程式碼待補充...

結尾：參考部落格地址https://blog.csdn.net/snoopy_yuan

天池新人實戰賽之[離線賽]嘗試（一）

天池新人實戰賽之[離線賽]嘗試（一）

Vue實戰之後臺管理系統（一）

《tensorflow實戰》之實現AlexNet網路（六）

SpringBoot專案實戰之開源部落格（一）多模組結構搭建

Android開發實戰之——ProgressDialog的使用（一）

python框架之 Tornado 學習筆記（一）

python大法之二-一些基礎（一）

Linux之Ubuntu環境配置（一）

數據結構之二叉樹（一）

vuex實踐之路——筆記本應用（一）

構建之法--探索篇（一）

solr搜索之入門及原理（一）

C#.Net 設計模式學習筆記之創建型（一）

.NET中使用Redis之ServiceStack.Redis學習（一）安裝與簡單的運行

構建之法學習回顧（一）

構建之法-----閱讀問題（一）

Linux基礎之常見命令用法（一）

初識Hibernate之關聯映射（一）

C#可擴展編程之MEF學習筆記（一）：MEF簡介及簡單的Demo（轉）

全棧開發之HTML快速入門（一）

天池新人實戰賽之[離線賽]嘗試（一）

相關推薦