【Python】聚類演算法應用 -- 廣告投放效果的離線評估
簡要說明
同樣是在實習期間做的,由於公司去年在廣告的投放上高達10億!!(黑臉=_=!),其中SEM的投放佔比不小,投了四個:baidu、360、搜狗和神馬,其中前三個是WAP和PC端都有投,神馬只投了WAP端。所以我想對歷史投放效果資料進行一下挖掘分析,看是否能找出一些有價值的東西,降低投放成本。昨天參加的一個數據分析峰會記得一嘉賓也這樣說過:資料探勘與分析的最重要的不是你的報表、圖表多麼得漂亮,而是你挖掘出的規律或結果能為公司業務帶來多少價值。
資料 – 挖掘與分析 – 價值!
呼叫SQL獲取所需資料
這裡,我寫了一個類以獲取相應的資料。
- city:傳入城市名稱
- start_time:傳入所需資料集的開始時間
- end_time:傳入所需資料集的結束時間
- platform:傳入平臺id(1為PC,2為WAP),default value=None,即資料集是獲取彙總資料
engine_type:傳入引擎id(1為百度、2為搜狗、3為360、4為神馬), default value=None,即資料集是獲取彙總的資料
OK,Talk is easy,show the code!我直接貼上我的程式碼吧:
# coding:utf-8
import pandas as pd
import pymysql, datetime
from pandas.io.sql import read_sql
class AD_Statistics_Select:
def __init__(self, city, start_time, end_time, platform=None, engine_type=None):
self.city = city
self.start_time = start_time
self.end_time = end_time
self.platform = platform
self.engine_type = engine_type
self.df = self.SqlToDataframe()
self.keyword_list = self.df['keyword']
def SqlToDataframe(self):
## 連線到資料庫
# 這裡我隱去了詳細的sql的host、使用者、密碼等詳細資訊,望大家理解
into_db = ['host', 'user','password', 'database', 'port', 'charset']
cnxn = pymysql.connect(host=into_db[0], user=into_db[1],
passwd=into_db[2], db=into_db[3],
port=int(into_db[4]), charset=into_db[5])
# 下面我也隱去了表名,請大家理解哈~
if self.platform == None and self.engine_type == None:
sql = "SELECT keyword, COUNT(DISTINCT log_date) AS date_count, SUM(present) AS present_total, \
SUM(click) AS click_total, SUM(click)/SUM(present) AS CTR, SUM(uv) AS uv_total, \
AVG(IF(price>0, price, NULL)) AS price_avg, AVG(replace(position,'\r','')) AS position_avg, \
COUNT (DISTINCT IF(click>0,log_date,NULL)) AS clicked_num \
FROM *表名* \
WHERE log_date BETWEEN '{}' AND '{}' \
AND city = '{}' \
GROUP BY 1 \
ORDER BY 1".format(self.start_time, self.end_time, self.city)
## 大家注意一下 ##
## 在SQL查詢語句這塊,由於多行存在,文章可能會顯示的語法格式有些錯誤,但是在編輯器中是沒問題的!##
if self.platform != None and self.engine_type == None:
sql = "SELECT keyword, COUNT(DISTINCT log_date) AS date_count, SUM(present) AS present_total, \
SUM(click) AS click_total, SUM(click)/SUM(present) AS CTR, SUM(uv) AS uv_total, \
AVG(IF(price>0, price, NULL)) AS price_avg, AVG(replace(position,'\r','')) AS position_avg, \
COUNT(DISTINCT IF(click>0,log_date,NULL)) AS clicked_num\
FROM *表名* \
WHERE log_date BETWEEN '{}' AND '{}' \
AND city = '{}' \
AND platform = '{}' \
GROUP BY 1 \
ORDER BY 1".format(self.start_time, self.end_time, self.city, self.platform)
if self.platform == None and self.engine_type != None:
sql = "SELECT keyword, COUNT(DISTINCT log_date) AS date_count, SUM(present) AS present_total, \
SUM(click) AS click_total, SUM(click)/SUM(present) AS CTR, SUM(uv) AS uv_total, \
AVG(IF(price>0, price, NULL)) AS price_avg, AVG(replace(position,'\r','')) AS position_avg, \
COUNT(DISTINCT IF(click>0,log_date,NULL)) AS clicked_num\
FROM *表名* \
WHERE log_date BETWEEN '{}' AND '{}' \
AND city = '{}' \
AND engine_type = '{}' \
GROUP BY 1 \
ORDER BY 1".format(self.start_time, self.end_time, self.city, self.engine_type)
if self.platform != None and self.engine_type != None:
sql = "SELECT keyword, COUNT(DISTINCT log_date) AS date_count, SUM(present) AS present_total, \
SUM(click) AS click_total, SUM(click)/SUM(present) AS CTR, SUM(uv) AS uv_total, \
AVG(IF(price>0, price, NULL)) AS price_avg, AVG(replace(position,'\r','')) AS position_avg, \
COUNT(DISTINCT IF(click>0,log_date,NULL)) AS clicked_num\
FROM *表名* \
WHERE log_date BETWEEN '{}' AND '{}' \
AND city = '{}' \
AND platform = '{}' \
AND engine_type = '{}' \
GROUP BY 1 \
ORDER BY 1".format(self.start_time, self.end_time, self.city, self.platform, self.engine_type)
## 使用pd介面一步到位啊!! ##
try:
frame = read_sql(sql, cnxn)
except Exception:
frame = pd.DataFrame([])
return frame
這裡只提醒一點的是:
由於後面會對資料進行很多操作,所以SQL查詢獲得的資料集直接利用pd的read_sql(剛開始我還自己傻傻地獲取,特地增加了一些處理才使之轉成dataframe)得到dataframe。所以啊,好好利用大神們已經造好的輪子很重要!
簡單的描述以及展現函式
當然,這一步其實是最基本的資料特徵描述以及展現,但是從某些特徵我們其實是可以獲取很多有價值的資訊的,所以這一步必不可少!
這裡,我也寫了一個類進行相應的資料處理,該類繼承了AD_Statistics_Select,所有其父類的city、engine_type、platform等資訊是全部繼承下來的。
- 對keyword按照CTR進行排序:
class AD_Statistics(AD_Statistics_Select):
def keyword_SortBy_CTR(self, keyword, detail=False, **kwargs):
"""
輸入需要查詢的關鍵詞,預設返回按照轉化率排序得出的排序結果
以及轉化率、平均單價、平均投放位置資訊;
如果需要輸出所有欄位資訊,則設定`detail=True`
param
========
keyword:必須;輸入的需要查詢的關鍵詞
detail:邏輯變數,預設為False;是否需要輸出該關鍵詞的所有欄位資訊
df:是否輸入dataframe,預設為所有未篩選過的資料
return
========
該關鍵詞按照轉化率的排序結果
"""
if 'df' in kwargs.keys():
df = kwargs['df'].sort_values('CTR', ascending=False)
else:
df = self.df.sort_values('CTR', ascengding=False)
df.index = range(1, len(df) + 1)
if detail:
return df.loc[df['keyword'] == keyword, df.columns]
return df.loc[df['keyword'] == keyword, ['CTR', 'price_avg', 'position_avg']]
- 對keyword按照present進行排序:
def keyword_SortBy_present(self, keyword, detail=False, **kwargs):
"""
輸入需要查詢的關鍵詞,預設返回按照展現次數排序得出的排序結果
以及轉化率、平均單價、平均投放位置資訊;
如果需要輸出所有欄位資訊,則設定`detail=True`
param
========
keyword:必須;輸入的需要查詢的關鍵詞
detail:邏輯變數,預設為False;是否需要輸出該關鍵詞的所有欄位資訊
df:是否輸入dataframe,預設為所有未篩選過的資料
return
========
該關鍵詞按照展現次數的排序結果
"""
if 'df' in kwargs.keys():
df = kwargs['df'].sort_values('present_total', ascending=False)
else:
df = self.df.sort_values('present_total', ascending=False)
df.index = range(1, len(df) + 1)
if detail:
return df.loc[df['keyword'] == keyword, df.columns]
return df.loc[df['keyword'] == keyword, ['CTR', 'price_avg', 'position_avg']]
- 對keyword按照click進行排序:
def keyword_SortBy_click(self, keyword, detail=False, **kwargs):
"""
輸入需要查詢的關鍵詞,預設返回按照點選次數排序得出的排序結果
以及轉化率、平均單價、平均投放位置資訊;
如果需要輸出所有欄位資訊,則設定`detail=True`
param
========
keyword:必須;輸入的需要查詢的關鍵詞
detail:邏輯變數,預設為False;是否需要輸出該關鍵詞的所有欄位資訊
df:是否輸入dataframe,預設為所有未篩選過的資料
return
========
該關鍵詞按照點選次數的排序結果
"""
if 'df' in kwargs.keys():
df = kwargs['df'].sort_values('click_total', ascending=False)
else:
df = self.df.sort_values('click_total', ascengding=False)
df.index = range(1, len(df) + 1)
if detail:
return df.loc[df['keyword'] == keyword, df.columns]
return df.loc[df['keyword'] == keyword, ['CTR', 'price_avg', 'position_avg']]
- 下面幾個top函式不用看,其實對前面對應的函式返回值再取head即可:
def present_top(self, n=100):
"""
返回展現量前n位的keyword資料,n預設為100
"""
return self.df.sort_values('present_total', ascending=False).head(n)
def click_top(self, n=100):
"""
返回點選量前n位的keyword資料,n預設為100
"""
return self.df.sort_values('click_total', ascending=False).head(n)
def price_top(self, n=100):
"""
返回price_avg前n位的keyword資料,n預設為100
"""
return self.df.sort_values('price_avg', ascending=False).head(n)
def date_count_top(self, n=100):
"""
返回點選天數前n位的keyword資料,n預設為100
"""
return self.df.sort_values('click_total', ascending=False).head(n)
def CTR_top(self, n=100):
"""
返回點選率前n位的keyword資料,n預設為100
"""
return self.df.sort_values('CTR', ascending=False).head(n)
- 由於資料很稀疏,可以寫個根據是否有click進行篩選的函式(可以不用看=_=!):
def click_none(self):
# 返回關鍵詞裡沒有點選量的
return self.df[self.df['click_total'] == 0]
def click_not_none(self):
# 返回關鍵詞裡有點選量的
return self.df[self.df['click_total'] != 0]
### 簡單展現一下 ###
def show_click_none(self):
show_df = self.click_none()
show_df['present_avg'] = show_df['present_total'] / show_df['date_count']
p = ggplot(aes(x='date_count',y='present_avg'), data=show_df) + geom_point()
p.show()
def show_click_not_none(self):
#show_df = self.click_top()
show_df = self.click_not_none()
show_df['present_avg'] = show_df['present_total'] / show_df['date_count']
p = ggplot(aes(x='date_count',y='present_avg'), data=show_df) + geom_point()
p.show()
這裡我再補充一下,我獲取的訓練資料的具體引數資訊:
- start_time:2017-05-01
- end_time:2017-07-20
- platform:彙總資料
- engine_type:彙總資料
- city:我選擇的是成交量前十的城市
我大概說一下,從keyword的click資料就可以看出,每個city的keyword大概在八、九千,有的在1萬以上,大概每個城市累積貢獻80%點選量以上的只有少量關鍵詞(100來詞,甚至只有幾十個詞)。不信?資料說話:
def click_part(self, before=0.81):
# 返回一個tuple
# 點選量before前的dataframe,before到1.0的dataframe,1.0(也即無點選的)的dataframe
click_df = self.df.sort_values('click_total', ascending=False)
add_rate = []
s = click_df['click_total'].sum()
for i in range(1, len(click_df)+1):
add_rate.append(click_df['click_total'][0:i].sum() / s)
click_df['c_add_rate'] = add_rate
#click_df['rate'] = click_df['click_total'] / click_df['click_total'].sum()
#click_df['rank'] = range(1, len(click_df)+1)
#click_df.head(200).plot('rank', 'add_rate', kind='line')
return click_df[click_df['c_add_rate'] < before], click_df[(click_df['c_add_rate']>0.81)&(click_df['c_add_rate']<1.0)], click_df[click_df['c_add_rate'] == 1.0]
由於現在我關注的是貢獻最多點選的那些詞(其實市場部那邊也是這樣回覆我的,他們著重關注的是前面的那些重點詞~),而SEM這塊是按照點選收費的,故在點選成本這塊兒,點選越多,公司其競價廣告付費也越多。所以我分析的是累積貢獻了80%的那些詞。
指標構建
為了分析競價price與最終點選率CTR之間的相關關係,我引入了一箇中間變數–position;其實我們可以直觀感受到,price越高,position值越大(position值越大,越靠後),而position越大,其帶來的CTR越低。所以我構建了price-position皮爾遜係數以及position-CTR皮爾遜係數。
- 為了得到keyword所對應的皮爾遜係數值,我在類AD_Statistics_Select中添加了一個獲取keyword詳細資訊的函式keyword_detail():
def keyword_detail(self, keyword):
# 如上,隱去了SQL連線相關資訊
into_db = []
cnxn = pymysql.connect(host=into_db[0], user=into_db[1],
passwd=into_db[2], db=into_db[3],
port=int(into_db[4]), charset=into_db[5])
## 同樣,大家注意一下 ##
## 這塊兒SQL查詢語句顯示出來可能會出現語法錯誤,但在編輯器中是沒問題的 ##
if self.platform == None and self.engine_type == None:
sql = "SELECT keyword, CAST(REPLACE(position, '\r', '') AS DECIMAL(4,2)) AS position, click/present AS CTR , price \
FROM *表名* \
WHERE log_date BETWEEN '{}' AND '{}' \
AND city = '{}' \
AND click <> 0 \
AND present <> 0 \
AND keyword = '{}' \
AND click < present \
ORDER BY 1".format(self.start_time, self.end_time, self.city, keyword)
if self.platform != None and self.engine_type == None:
sql = "SELECT keyword, CAST(REPLACE(position, '\r', '') AS DECIMAL(4,2)) AS position, click/present AS CTR , price \
FROM *表名* \
WHERE log_date BETWEEN '{}' AND '{}' \
AND city = '{}' \
AND click <> 0 \
AND present <> 0 \
AND platform = '{}' \
AND keyword = '{}' \
AND click < present \
ORDER BY 1".format(self.start_time, self.end_time, self.city, self.platform, keyword)
if self.platform == None and self.engine_type != None:
sql = "SELECT keyword, CAST(REPLACE(position, '\r', '') AS DECIMAL(4,2)) AS position, click/present AS CTR , price \
FROM *表名* \
WHERE log_date BETWEEN '{}' AND '{}' \
AND city = '{}' \
AND click <> 0 \
AND present <> 0 \
AND engine_type = '{}' \
AND keyword = '{}' \
AND click < present \
ORDER BY 1".format(self.start_time, self.end_time, self.city, self.engine_type, keyword)
if self.platform != None and self.engine_type != None:
sql = "SELECT keyword, CAST(REPLACE(position, '\r', '') AS DECIMAL(4,2)) AS position, click/present AS CTR , price \
FROM *表名* \
WHERE log_date BETWEEN '{}' AND '{}' \
AND city = '{}' \
AND click <> 0 \
AND present <> 0 \
AND platform = '{}' \
AND engine_type = '{}' \
AND keyword = '{}' \
AND click < present \
ORDER BY 1".format(self.start_time, self.end_time, self.city, self.platform,self.engine_type, keyword)
try:
keyword_detail_frame = read_sql(sql, cnxn)
except Exception:
keyword_detail_frame = pd.DataFrame([])
return keyword_detail_frame
- 獲取到keyword詳細資訊之後,由於在SQL 查詢中已經對異常的click、present以及position值進行了篩選,故我們直接構建其皮爾遜相關係數:
def pearson_points(self, keyword_df):
# 返回keyword的position-price的皮爾遜相關係數以及對應的P值,position-CTR的皮爾遜相關係數以及對應的P值,price-CTR的皮爾遜相關係數以及對應的P值
grouped_by_position = keyword_df.groupby('position') # groupby
avg_grouped_by_position = grouped_by_position['CTR', 'price'].agg([np.mean]) # 求均值
avg_grouped_by_position['position'] = avg_grouped_by_position.index # 加index
array_position = np.array(avg_grouped_by_position['position'])
array_price = np.array(avg_grouped_by_position['price']).reshape(len(array_position), )
array_CTR = np.array(avg_grouped_by_position['CTR']).reshape(len(array_position), )
p_p = pearsonr(array_position, array_price)
po_C = pearsonr(array_position, array_CTR)
pr_C = pearsonr(array_price, array_CTR)
return [keyword_df['keyword'][0], p_p[0], p_p[1], po_C[0], po_C[1], pr_C[0], pr_C[1]]
- 同時,可以增加一個展現函式,展現keyword的price-position資訊、position-CTR資訊以及最終要求的price-CTR資訊:
def keyword_show(self, keyword_df):
"""
根據keyword繪出其price-position、CTR-position以及price-CTR的關係圖
"""
grouped_by_position = keyword_df.groupby('position') # groupby
avg_grouped_by_position = grouped_by_position['CTR', 'price'].agg([np.mean]) # 求均值
avg_grouped_by_position['position'] = avg_grouped_by_position.index
array_position = np.array(avg_grouped_by_position['position'])
array_price = np.array(avg_grouped_by_position['price']).reshape(len(array_position), )
array_CTR = np.array(avg_grouped_by_position['CTR']).reshape(len(array_position), )
fig = plt.figure()
fig.suptitle(u'%s' % keyword_df['keyword'][0], fontproperties=font)
ax1 = fig.add_subplot(2, 2, 1)
p_p = pearsonr(array_position, array_price)
ax1.set_title(u'pearson係數:%s,伴隨P值:%s' % (p_p[0], p_p[1]), fontproperties=font)
ax2 = fig.add_subplot(2, 2, 3)
po_C = pearsonr(array_position, array_CTR)
ax2.set_title(u'pearson係數:%s,伴隨P值:%s' % (po_C[0], po_C[1]), fontproperties=font)
ax3 = fig.add_subplot(1, 2, 2)
pr_C = pearsonr(array_price, array_CTR)
ax3.set_title(u'pearson係數:%s,伴隨P值:%s' % (pr_C[0], pr_C[1]), fontproperties=font)
avg_grouped_by_position.plot(x='position', y='price', kind='scatter', ax=ax1, figsize=(12, 6),
alpha=0.5, color='#33CC99', sharex=True)
avg_grouped_by_position.plot(x='position', y='CTR', kind='scatter', ax=ax2, figsize=(12, 6),
alpha=0.5, color='#FF3366')
avg_grouped_by_position.plot(x='price', y='CTR', kind='scatter', ax=ax3, alpha=0.5, color='#FF4500')
return plt
- 遍歷展現:
def keyword_list_show(self, keyword_list):
# yield返回在list中每一個keyword的視覺化圖
for i in keyword_list:
keyword_df = self.keyword_detail(i)
yield self.keyword_show(keyword_df)
- 我們已經定義了對每個keyword求其對應pearson係數的函式,接下來就是獲取累積貢獻了80%點選的那些詞的對應係數。比較簡單,求得一個keywords list,遍歷即可:
def keyword_list_pearson(self, keyword_list):
# 返回list中每一個keyword的詳細pearson係數及伴隨P值資訊
result = []
for i in keyword_list:
keyword_df = self.keyword_detail(str(i.encode('utf-8')))
result.append(self.pearson_points((keyword_df)))
pp_df = pd.DataFrame(result)
pp_df.columns = [u'keyword', u'position-price_pearson', u'position-price_pvalue', u'position-CTR_pearson', u'position-CTR_pvalue', u'price-CTR_pearson', u'price-CTR_pvalue']
# pp_df.to_csv('keywords_pearson_point/%s-pearson.csv') % self.city
return pp_df
K-Means演算法訓練
得到了所需keywords的pearson係數相關資訊後,接下來最重要的一步便是模型的構建及訓練。由於keyword是無標籤的,所以我採用了無監督學習中最典型的K-Means聚類演算法進行訓練,並給出其最終分類結果。
首先,對這些詞進行聚類演算法分析的兩個維度是:price-position pearson相關係數和position-CTR pearson相關係數。
-
第一類:price-position和position-CTR pearson係數均為負,這類是跟我們主觀感受相符合的,應該大部分集中在這兒;
第二類:price-position和position-CTR pearson係數均為正,這類完全與我們主觀感受不相符合的,數量應該很少;
第三、四類:price-position和position-CTR pearson係數一正一負,這類詞是我們分析的目標詞,需要嘗試進行競價調整的;
code
def keyword_pearson_kmeans(self, keyword_list):
## 聚類訓練
## 將結果視覺化
pp_df = self.keyword_list_pearson(keyword_list)
df_train = np.array(pp_df[['position-price_pearson', 'position-CTR_pearson']])
kmeans_model = KMeans(n_clusters=4).fit(df_train)
# print kmeans_model.labels_:每個點對應的標籤值
colors = ['#33CC99', '#FF3366', '#FF4500', '#00BFFF'] # 顏色
markers = ['o', 's', 'D', 'v'] # 資料標識
plt.figure(figsize=(16, 8))
## 將每個點畫出
for i, l in enumerate(kmeans_model.labels_):
plt.plot(df_train[i][0], df_train[i][1], color=colors[l],
marker=markers[l], ls='None', alpha=0.5)
plt.text(df_train[i][0], df_train[i][1], '%s' % pp_df['keyword'][i], fontproperties=font, fontsize=6)
plt.title(u'K = 4, 輪廓係數 = %.03f' %
metrics.silhouette_score(df_train, kmeans_model.labels_, metric='euclidean')
, fontproperties=font)
plt.xlabel(u'position-price皮爾遜係數', fontproperties=font)
plt.ylabel(u'position-CTR皮爾遜係數', fontproperties=font)
plt.savefig('ad_kmeans1.png', dpi=500)
plt.show()
訓練結果
OK,最終訓練結果怎樣,我們貼上圖表說話:
if __name__ == "__main__":
'''
## 互動輸入 ##
city = raw_input(u'請輸入需要查詢的城市:')
start_time = raw_input(u'請輸入開始時間:')
end_time = raw_input(u'請輸入結束時間:')
f_p = lambda a: None if a == '' else a
platform = f_p(raw_input(u"""請輸入平臺id:
其中,1為PC,2位WAP;直接enter則預設為所有平臺"""))
f_e = lambda a: None if a == '' else a
engine_type = f_e(raw_input(u"""請輸入搜尋引擎id:
其中,1為baidu,2為sougou,3為360,4為shenma;直接enter則預設為所有搜尋引擎"""))
'''
對深圳市的訓練結果
接下來是多圖預警啊!!
AD_STA = AD_Statistics('深圳', '2017-05-01', '2017-07-20')
df = AD_STA.SqlToDataframe()
ret = AD_STA.click_part()
print AD_STA.keyword_pearson_kmeans(ret[0]['keyword'])
這裡我上傳的圖片看起來可能不是很清楚,但不妨礙我們從中獲取到的重要資訊,關鍵詞聚成了四類,是比較符合剛才的猜測的
第一類,price-position和position-CTR pearson係數均為負,包含了大部分廣告關鍵詞。
第二類:price-position和position-CTR pearson係數均為正,僅有少量廣告關鍵詞;
第三、四類:price-position和position-CTR pearson係數一正一負,這類詞是我們分析的目標詞,需要嘗試進行競價調整的;
我們再看一下,最後兩類詞到底有哪些:
- price-position pearson係數為正,position-CTRpearson係數為負
# 前面定義了一個函式
'''
def keyword_list_show(self, keyword_list):
# yield返回在list中每一個keyword的視覺化圖
for i in keyword_list:
keyword_df = self.keyword_detail(i)
yield self.keyword_show(keyword_df)
'''
# 構建一個keyword的list即可
# keyword_list = ['', '', '',...]
pictures = AD_STA.keyword_list_show(keyword_list)
for i in pictures:
print i.show()
再看各個詞的詳細表現:
可以看到,這類詞的price與CTR之間是不呈現正相關關係的,甚至有的詞是存在著負相關關係!所以我們可以對這部分詞降低其競價,驗證是否對CTR有顯著性影響
- price-position pearson係數為負,position-CTRpearson係數為正
再看各個詞的詳細表現:
對重慶市的訓練
由於篇幅所限,下面我不再贅述,僅對每種情況輸出一張圖以示說明即可。
- price-position pearson係數為正,position-CTRpearson係數為負
- price-position pearson係數為負,position-CTRpearson係數為正
對成都市的訓練結果
- price-position pearson係數為正,position-CTRpearson係數為負
- price-position pearson係數為負,position-CTRpearson係數為正