1. 程式人生 > >搜狗日誌分析

搜狗日誌分析

Mapreduce程式碼:https://github.com/pickLXJ/analysisSogou.git

Log日誌:https://pan.baidu.com/s/112P_hR9FlQq7htyTVjxgwg

 

一、日誌格式

搜狗格式查詢https://www.sogou.com/labs/resource/q.php

原始資料

20111230000418  e686beaf83faa9a106b1a023923edd74        黑鏡頭  9       2       http://bbs.tiexue.net/post_4161367_1.html
20111230000418  5467c699d1ae4a61b6d53bb2fe83c04a        搜尋 WWW.MMPPTV.COM     6       3       http://9bc947d.he.artseducation.com.cn/
20111230000418  55623d0852a5161063c6d01f0856a814        百裡挑一主題歌是什麼    5       1       http://zhidao.baidu.com/question/169708995
20111230000418  8d737be3a9c125181bdd422287bee65f        鑽石價格查詢    4       2       http://tool.wozuan.com/
20111230000419  bbe344592ade912de81595d2ec140c0d        眉山電信        9       1       http://www.aibang.com/detail/1232487017-414995109
20111230000419  df79cc0c9a4c9faa1656023c5c12265e        好看的高幹文    8       2       http://www.tianya.cn/publicforum/content/funinfo/1/1643841.shtml
20111230000419  ec0363079f36254b12a5e30bdc070125        AQVOX   8       7       http://www.erji.net/simple/index.php?t122047.html

 

二、資料清洗

 指令碼去除空白資料,轉化部分資料

 

擴充套件指令碼 (年月日)

vim log-extend.sh

 [[email protected] ~]#log-extend.sh /home/samba/sample/file/sogou.500w.utf8  /home/samba/sample/file/sogou_log.txt

過濾指令碼(過濾搜尋為空)

Vim log-filter.sh

#!/bin/bash
#infile=/home/sogou_log.txt
infile=$1
#outfile=/home/sogou_log.txt.flt
outfile=$2
awk -F "\t" '{if($2 != "" && $3 != "" && $2 != " " && $3 != " ") print $0}' $infile > $outfile

[[email protected] ~]# log-filter.sh /home/samba/sample/file/sogou_log.txt  /home/samba/sample/file/sogou_log.txt.flt

 

基於HIve構建日誌資料的資料倉庫

  • 建立資料庫

hive> create database sogou;

  • 使用資料庫

Hive> use sogou;

  • 建立擴充套件 4 個欄位(年、月、日、小時)資料的外部表: 
hive> CREATE EXTERNAL TABLE sogou_data(

ts string,

uid string,

keyword string,

rank int,

sorder int,

url string,

year int,

month int,

day int,

hour int)

    > ROW FORMAT DELIMITED

    > FIELDS TERMINATED BY '\t'

    > STORED AS TEXTFILE;

OK

Time taken: 0.412 seconds
  • Hive表載入本地資料

load data local inpath '/home/samba/sample/file/sogou_log.txt.flt' into table sogou_data;

  • 建立帶分割槽的表:
hive> CREATE EXTERNAL TABLE sogou_partitioned_data(

ts string,

uid string,

keyword string,

rank int,

sorder int,

url string)

    > PARTITIONED BY(year int,month int,day int,hour int)

    > ROW FORMAT DELIMITED

    > FIELDS TERMINATED BY '\t'

    > STORED AS TEXTFILE;
  • 設定動態分割槽

hive> set hive.exec.dynamic.partition.mode=nonstrict;

hive> INSERT OVERWRITE TABLE sogou_partitioned_data partition(year,month,day,hour) SELECT * FROM sogou_data;

 

 

 

查詢測試

  • 查詢前十個資料:
    > select * from sogou_data limit 10;
OK
20111230000005  57375476989eea12893c0c3811607bcf        奇藝高清        1       1       http://www.qiyi.com/       2011    11      23      0
20111230000005  66c5bb7774e31d0a22278249b26bc83a        凡人修仙傳      3       1       http://www.booksky.org/BookDetail.aspx?BookID=1050804&Level=1      2011    11      23      0
20111230000007  b97920521c78de70ac38e3713f524b50        本本聯盟        1       1       http://www.bblianmeng.com/ 2011    11      23      0
20111230000008  6961d0c97fe93701fc9c0d861d096cd9        華南師範大學圖書館      1       1       http://lib.scnu.edu.cn/    2011    11      23      0
20111230000008  f2f5a21c764aebde1e8afcc2871e086f        線上代理        2       1       http://proxyie.cn/ 2011    11      23      0
20111230000009  96994a0480e7e1edcaef67b20d8816b7        偉大導演        1       1       http://movie.douban.com/review/1128960/    2011    11      23      0
20111230000009  698956eb07815439fe5f46e9a4503997        youku   1       1       http://www.youku.com/      2011    11      23      0
20111230000009  599cd26984f72ee68b2b6ebefccf6aed        安徽合肥365房產網       1       1       http://hf.house365.com/    2011    11      23      0
20111230000010  f577230df7b6c532837cd16ab731f874        哈薩克網址大全  1       1       http://www.kz321.com/      2011    11      23      0
20111230000010  285f88780dd0659f5fc8acc7cc4949f2        IQ數碼  1       1       http://www.iqshuma.com/    2011    11      23      0
Time taken: 2.522 seconds, Fetched: 10 row(s)

 

  • 查詢使用者搜尋的內容
    > select * from sogou_data limit 10;
OK
20111230000005  57375476989eea12893c0c3811607bcf        奇藝高清        1       1       http://www.qiyi.com/       2011    11      23      0
20111230000005  66c5bb7774e31d0a22278249b26bc83a        凡人修仙傳      3       1       http://www.booksky.org/BookDetail.aspx?BookID=1050804&Level=1      2011    11      23      0
20111230000007  b97920521c78de70ac38e3713f524b50        本本聯盟        1       1       http://www.bblianmeng.com/ 2011    11      23      0
20111230000008  6961d0c97fe93701fc9c0d861d096cd9        華南師範大學圖書館      1       1       http://lib.scnu.edu.cn/    2011    11      23      0
20111230000008  f2f5a21c764aebde1e8afcc2871e086f        線上代理        2       1       http://proxyie.cn/ 2011    11      23      0
20111230000009  96994a0480e7e1edcaef67b20d8816b7        偉大導演        1       1       http://movie.douban.com/review/1128960/    2011    11      23      0
20111230000009  698956eb07815439fe5f46e9a4503997        youku   1       1       http://www.youku.com/      2011    11      23      0
20111230000009  599cd26984f72ee68b2b6ebefccf6aed        安徽合肥365房產網       1       1       http://hf.house365.com/    2011    11      23      0
20111230000010  f577230df7b6c532837cd16ab731f874        哈薩克網址大全  1       1       http://www.kz321.com/      2011    11      23      0
20111230000010  285f88780dd0659f5fc8acc7cc4949f2        IQ數碼  1       1       http://www.iqshuma.com/    2011    11      23      0
Time taken: 2.522 seconds, Fetched: 10 row(s)
hive>  select * from sogou_data where uid='6961d0c97fe93701fc9c0d861d096cd9';
OK
20111230000008  6961d0c97fe93701fc9c0d861d096cd9        華南師範大學圖書館      1       1       http://lib.scnu.edu.cn/    2011    11      23      0
20111230065007  6961d0c97fe93701fc9c0d861d096cd9        華南師範大學圖書館      1       1       http://lib.scnu.edu.cn/    2011    11      23      0
Time taken: 0.653 seconds, Fetched: 2 row(s)
hive> 

 

  • 查詢總條數
hive> select count(*) from sogou_partitioned_data;
Query ID = root_20181214010000_020e4437-b637-4861-bac3-21be3a0754b5
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1544683093139_0001, Tracking URL = http://bigdata000:8088/proxy/application_1544683093139_0001/
Kill Command = /app/hadoop-2.6.0-cdh5.15.1/bin/hadoop job  -kill job_1544683093139_0001
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
Ended Job = job_1544683093139_0001
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 70.68 sec   HDFS Read: 573691364 HDFS Write: 8 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 10 seconds 680 msec
OK
5000000
Time taken: 236.402 seconds, Fetched: 1 row(s)
hive> 

hive> select count(*) from sogou_partitioned_data;

  • 非空查詢條數
    > select count(*) from sogou_partitioned_data where keyword is not null and keyword!='';
Query ID = root_20181214010606_d8a11bd2-3cbc-482b-ba0d-27bf65d1589c
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1544683093139_0002, Tracking URL = http://bigdata000:8088/proxy/application_1544683093139_0002/
Kill Command = /app/hadoop-2.6.0-cdh5.15.1/bin/hadoop job  -kill job_1544683093139_0002
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1

MapReduce Total cumulative CPU time: 1 minutes 12 seconds 720 msec
Ended Job = job_1544683093139_0002
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 72.72 sec   HDFS Read: 573693021 HDFS Write: 8 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 12 seconds 720 msec
OK
5000000
Time taken: 90.678 seconds, Fetched: 1 row(s)

hive> select count(*) from sogou_partitioned_data where keyword is not null and keyword!='';

  • 無重複總條數
hive> select count(*) from(select count(*) as no_repeat_count from sogou_partitioned_data group by ts,uid,keyword,url having no_repeat_count=1) a;
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2  Reduce: 3   Cumulative CPU: 383.06 sec   HDFS Read: 573702274 HDFS Write: 351 SUCCESS
Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 12.22 sec   HDFS Read: 5186 HDFS Write: 8 SUCCESS
Total MapReduce CPU Time Spent: 6 minutes 35 seconds 280 msec
OK
4999272
Time taken: 448.265 seconds, Fetched: 1 row(s)
hive> 

hive> select count(*) from(select count(*) as no_repeat_count from sogou_partitioned_data group by ts,uid,keyword,url having no_repeat_count=1) a;

  • 獨立UID總數
hive> select count(distinct(uid)) from sogou_partitioned_data;
Ended Job = job_1544683093139_0006
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 88.13 sec   HDFS Read: 573691789 HDFS Write: 8 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 28 seconds 130 msec
OK
1352664
Time taken: 91.419 seconds, Fetched: 1 row(s)

hive> select count(distinct(uid)) from sogou_partitioned_data;

 

實現資料分析需求二:關鍵字分析

(1)查詢頻度排名(頻度最高的前50詞)

    > select keyword,count(*)query_count from sogou_partitioned_data group by keyword 
Total MapReduce CPU Time Spent: 3 minutes 10 seconds 30 msec
OK
百度    38441
baidu   18312
人體藝術        14475
4399小遊戲      11438
qq空間  10317
優酷    10158
新亮劍  9654
館陶縣縣長閆寧的父親    9127
公安賣萌        8192
百度一下 你就知道       7505
百度一下        7104
4399    7041
魏特琳  6665
qq網名  6149
7k7k小遊戲      5985
黑狐    5610
兒子與母親不正當關係    5496
新浪微博        5369
李宇春體        5310
新疆暴徒被擊斃圖片      4997
hao123  4834
123     4829
4399洛克王國    4112
qq頭像  4085
nba     4027
龍門飛甲        3917
qq個性簽名      3880
張去死  3848
cf官網  3729
凰圖騰  3632
快播    3423
金陵十三釵      3349
吞噬星空        3330
dnf官網 3303
武動乾坤        3232
新亮劍全集      3210
電影    3155
優酷網  3115
兩次才處決美女罪犯      3106
電影天堂        3028
土豆網  2969
qq分組  2940
全國各省最低工資標準    2872
清代姚明        2784
youku   2783
爭產案  2755
dnf     2686
12306   2682
身份證號碼大全  2680
火影忍者        2604
Time taken: 240.291 seconds, Fetched: 50 row(s)

hive> select keyword,count(*)query_count from sogou_partitioned_data group by keyword order by query_count desc limit 50;

 

實現資料分析需求三:UID分析

  • 查詢次數大於2次的使用者總數
hive> 
    > 
    > select count(*) from( select count(*) as query_count from sogou_partitioned_data group by uid having query_count > 2) a;
Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 3 minutes 19 seconds 420 msec
OK
546353
Time taken: 249.635 seconds, Fetched: 1 row(s)

hive> select count(*) from( select count(*) as query_count from sogou_partitioned_data group by uid having query_count > 2) a;

  • 查詢次數大於2次的使用者佔比

A:

hive> select count(*) from(select count(*) as query_count from sogou_partitioned_data group by uid having query_count > 2) a;
Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 3 minutes 13 seconds 250 msec
OK
546353
Time taken: 239.699 seconds, Fetched: 1 row(s)

hive> select count(*) from(select count(*) as query_count  from sogou_partitioned_data group by uid having query_count > 2)  

B:

    > select count(distinct(uid)) from sogou_partitioned_data;

Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 106.46 sec   HDFS Read: 573691789 HDFS Write: 8 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 46 seconds 460 msec
OK
1352664
Time taken: 109.001 seconds, Fetched: 1 row(s)

hive> select count(distinct(uid)) from sogou_partitioned_data;

 

A/B

hive> select 546353/1352664;

OK

0.40390887907122536

Time taken: 0.255 seconds, Fetched: 1 row(s)

hive> select 546353/1352664;

 

 

  • rank次數在10以內的點選次數佔比(rank既是第四列的內容)
    A:
    hive> select count(*) from sogou_partitioned_data where rank < 11;
    4999869
    Time taken: 29.653 seconds, Fetched: 1 row(s)
    B:
    hive> select count(*) from sogou_partitioned_data;
    5000000
    A/B
    hive> select 4999869/5000000;
    OK
    0.9999738 

 

  • 直接輸入URL查詢的比例
    A:
    hive> select count(*) from sogou_partitioned_data where keyword like '%www%';
    OK
    73979
    B:
    hive> select count(*) from sogou_partitioned_data;
    OK
    5000000
    A/B
    hive> select 73979/5000000;
    OK
    0.0147958
     
     

實現資料分析需求四:獨立使用者行為分析

(1)查詢搜尋過”仙劍奇俠傳“的uid,並且次數大於3

    >  select uid,count(*) as cnt from sogou_partitioned_data where keyword='仙劍奇俠傳' group by uving cnt > 3;
Query ID = root_20181214020303_dbf96d64-9f8e-4ed5-844d-711de957e8b8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 3
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1544683093139_0015, Tracking URL = http://bigdata000:8088/proxy/application_1544683093139_0015/
Kill Command = /app/hadoop-2.6.0-cdh5.15.1/bin/hadoop job  -kill job_1544683093139_0015
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 3
MapReduce Total cumulative CPU time: 1 minutes 37 seconds 730 msec
Ended Job = job_1544683093139_0015
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2  Reduce: 3   Cumulative CPU: 97.73 sec   HDFS Read: 573703160 HDFS Write: 70 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 37 seconds 730 msec
OK
653d48aa356d5111ac0e59f9fe736429        6
e11c6273e337c1d1032229f1b2321a75        5
Time taken: 106.129 seconds, Fetched: 2 row(s)

hive> select uid,count(*) as cnt from sogou_partitioned_data where keyword='仙劍奇俠傳' group by uid having cnt > 3;