Hadoop專案實戰---日誌分析

阿新 • • 發佈：2019-02-15

1、專案描述
通過對apache common日誌進行分析
2、資料情況
每行記錄有5部分組成：
1.訪問ip
2.訪問時間
3.訪問資源【跟著兩個訪問的Url】
4.訪問狀態
5.本次流量
擷取部分資料如下：

27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/shy.gif HTTP/1.1" 200 2663
8.35.201.163 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/nv_a.png HTTP/1.1" 200 2076
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/titter.gif HTTP/1.1" 200 1398
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/sweat.gif HTTP/1.1" 200 1879
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/mad.gif HTTP/1.1" 200 2423
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/hug.gif HTTP/1.1" 200 1054
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/lol.gif HTTP/1.1" 200 1443
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/victory.gif HTTP/1.1" 200 1275
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/time.gif HTTP/1.1" 200 687
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/kiss.gif HTTP/1.1" 200 987
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/handshake.gif HTTP/1.1" 200 1322
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/loveliness.gif HTTP/1.1" 200 1579
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/call 
.gif HTTP/1.1" 200 603
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/funk.gif HTTP/1.1" 200 2928
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/curse.gif HTTP/1.1" 200 1543
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/dizzy.gif HTTP/1.1" 200 1859
27.19.74.143 - - [30/May/2013:17:38:21 +0800] " 
GET /static/image/smiley/default/shutup.gif HTTP/1.1" 200 2500
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/sleepy.gif HTTP/1.1" 200 2375
8.35.201.164 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/pn.png HTTP/1.1" 200 592
8.35.201.165 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/avatar.php?uid=56212 
&size=middle HTTP/1.1

3、關鍵指標
⊙瀏覽量PV

定義：頁面瀏覽量即為PV(Page View)，是指所有使用者瀏覽頁面的總和，一個獨立使用者每開啟一個頁面就被記錄1 次。

分析：網站總瀏覽量，可以考核使用者對於網站的興趣，就像收視率對於電視劇一樣。但是對於網站運營者來說，更重要的是，每個欄目下的瀏覽量。

計算公式：記錄計數

⊙註冊使用者數

計算公式：對訪問member.PHP?mod=register的url，計數

⊙IP數

定義：一天之內，訪問網站的不同獨立IP 個數加和。其中同一IP無論訪問了幾個頁面，獨立IP 數均為1。

分析：這是我們最熟悉的一個概念，無論同一個IP上有多少電腦，或者其他使用者，從某種程度上來說，獨立IP的多少，是衡量網站推廣活動好壞最直接的資料。

公式：對不同ip，計數

⊙跳出率

定義：只瀏覽了一個頁面便離開了網站的訪問次數佔總的訪問次數的百分比，即只瀏覽了一個頁面的訪問次數 / 全部的訪問次數彙總。

分析：跳出率是非常重要的訪客黏性指標，它顯示了訪客對網站的興趣程度：跳出率越低說明流量質量越好，訪客對網站的內容越感興趣，這些訪客越可能是網站的有效使用者、忠實使用者。

該指標也可以衡量網路營銷的效果，指出有多少訪客被網路營銷吸引到宣傳產品頁或網站上之後，又流失掉了，可以說就是煮熟的鴨子飛了。比如，網站在某媒體上打廣告推廣，分析從這個推廣來源進入的訪客指標，其跳出率可以反映出選擇這個媒體是否合適，廣告語的撰寫是否優秀，以及網站入口頁的設計是否使用者體驗良好。

計算公式：(1)統計一天內只出現一條記錄的ip，稱為跳出數
(2)跳出數/PV

⊙版塊熱度排行榜

定義：版塊的訪問情況排行。

分析：鞏固熱點版塊成績，加強冷清版塊建設。同時對學科建設也有影響。

計算公式：按訪問次數統計排序

4、專案開發步驟
1.使用flume把日誌資料匯入到hdfs中
2.對資料進行清洗、清洗後的資料易於我們使用
3.明細日誌使用Hbase儲存，能夠利用ip、時間查詢
4.使用Hive進行資料的多維分析
5.把hive分析結果使用sqoop匯出到MySQL中
6.提供檢視工具供使用者使用

上面介紹了專案的一些基本情況，下面我們將詳細介紹專案的開發過程：
1、把日誌資料上傳到HDFS中進行處理
如果是日誌伺服器資料較小、壓力較小，可以直接使用shell命令把資料上傳到HDFS中；
如果是日誌伺服器資料較大、壓力較答，使用NFS在另一臺伺服器上上傳資料；
如果日誌伺服器資料非常多、資料量大，使用flume進行資料處理；

[root@hadoop11 mydata]# hadoop fs -put access_2013_05_30.log  /hmbbs_logs/
[root@hadoop11 mydata]# hadoop fs -lsr /hmbbs_logs/
lsr: DEPRECATED: Please use 'ls -R' instead.
-rw-r--r--   3 root supergroup   61084192 2016-07-22 14:42 /hmbbs_logs/access_2013_05_30.log

2、使用MapReduce對資料進行清洗，清洗後的資料易於我們的使用
清洗標準：將Apache log日誌清洗成我們熟悉的日誌格式。
①
訪問狀態和本次流量欄位去掉
②
過濾掉靜態記錄，只要動態記錄；
GET /static 或者 GET /uc_server 開頭的行文字捨棄
過濾掉了開頭和結尾的標誌資訊；
GET / POST /
HTTP/1.1 HTTP/1.0
清洗日誌的MapReduce原始碼：

package Hmbbs;

import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Locale;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;
;

//本程式的目的是通過MapReduce對Hmbbs中的日誌資料進行清洗時
public class HmbbsCleaner  
{
        public static String path1="";//指定檔案的輸入路徑
        public static String path2="";//指定日誌的輸出路徑
        public static void main(String[] args) throws Exception
        {
            path1 = args[0];
            path2 = args[1];

            Configuration conf = new Configuration();
            conf.set("fs.defaultFS", "hdfs://hadoop11:9000/");
            FileSystem fileSystem = FileSystem.get(conf);
            if(fileSystem.exists(new Path(path2)))
            {
                fileSystem.delete(new Path(path2), true);
            }
            Job job = Job.getInstance(conf, "HmbbsCleaner");
            job.setJarByClass(HmbbsCleaner.class);//jar包
            //編寫驅動
            FileInputFormat.setInputPaths(job, new Path(path1));
            job.setInputFormatClass(TextInputFormat.class);
            job.setMapperClass(MyMapper.class);
            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(NullWritable.class);
            job.setNumReduceTasks(1);//指定Reducer的任務數量為1
            job.setPartitionerClass(HashPartitioner.class);
            job.setReducerClass(MyReducer.class);
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(NullWritable.class);
            FileOutputFormat.setOutputPath(job, new Path(path2));
            job.setOutputFormatClass(TextOutputFormat.class);

            //提交任務
            job.waitForCompletion(true);
        }
      public static class MyMapper extends Mapper<LongWritable, Text, Text, NullWritable>
      {
            protected void map(LongWritable k1, Text v1,Context context)throws IOException, InterruptedException
            {
                 String string = v1.toString();//獲取待記錄
                 Parselogs parselogs = new Parselogs();
                try
                {
                     String[] sub = parselogs.parseString(string);
                     if(sub[2].startsWith("GET /static")||sub[2].startsWith("GET /uc_server"))
                            return ;//對於靜態的記錄直接過濾掉，不進行任何處理

                     if(sub[2].startsWith("GET /"))
                     {
                            sub[2] = sub[2].substring("GET /".length());
                     }
                     if(sub[2].startsWith("POST /"))
                     {
                            sub[2] = sub[2].substring("POST /".length());
                     }//過濾掉了開頭和結尾的標誌資訊
                     if(sub[2].endsWith(" HTTP/1.1"))
                     {
                            sub[2] = sub[2].substring(0, sub[2].length()-" HTTP/1.1".length());
                     }
                     if(sub[2].endsWith(" HTTP/1.0"))
                     {
                            sub[2] = sub[2].substring(0, sub[2].length()-" HTTP/1.0".length());
                     }
                     Text k2 = new Text();
                     k2.set(sub[0]+"\t"+sub[1]+"\t"+sub[2]);//三個欄位之間以製表符進行分開
                     context.write(k2, NullWritable.get()); 
                } 

                catch (ParseException e)
                {
                    e.printStackTrace();
                }
            }
      }
      public static class MyReducer extends Reducer<Text, NullWritable, Text, NullWritable>
      {
                protected void reduce(Text k2, Iterable<NullWritable> v2s,Context context)throws IOException, InterruptedException
                {
                     for (NullWritable v2 : v2s)
                    {
                         Text k3 = k2;
                         context.write(k3, NullWritable.get());
                    }
                }
      }
}
class Parselogs //Parselogs這個類用來對字串進行解析
{
     public String[]  parseString(String str) throws ParseException
     {
        String  str1 = parseIp(str);
        String  str2 = parseDate(str);
        String  str3 = parseUrl(str);
        //String  str4 = parseStatus(str);
        //String  str5 = parseFlow(str);
        //String[] str66 = new String[]{str1,str2,str3,str4,str5};
        String[] str66 = new String[]{str1,str2,str3};//在這裡只獲取與本次專案有關的資料
        return str66;
     }
     public String parseIp(String str)//對ip地址進行解析的方法
     {
          String[] splited = str.split(" - - ");//用指定的正則表示式進行切分，獲取我們需要的欄位
          return splited[0];
     }
     public String parseDate(String str) throws ParseException
     {
         String[] splited = str.split(" - - ");//用指定的正則表示式進行切分，獲取我們需要的欄位
         int index1 = splited[1].indexOf("[");
         int index2 = splited[1].indexOf("]");
         String substring = splited[1].substring(index1+1, index2);//到此獲取了時間欄位30/May/2013:17:38:20 +0800
         SimpleDateFormat simple1 = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss", Locale.ENGLISH);//匹配我們給定的字串，並將其解析成對應的時間
         SimpleDateFormat simple2 = new SimpleDateFormat("yyyyMMddHHmmss");//匹配我們給定的字串
         Date parse = simple1.parse(substring);
         String format = simple2.format(parse);
         return format;
     }
     public String parseUrl(String str)//獲取訪問的url
     {
          int index1 = str.indexOf("]");
          int index2= str.lastIndexOf("\"");
          String substring = str.substring(index1+3, index2);
          return substring;
     }
     public String parseStatus(String str)//獲取訪問的狀態
     {
         int index1= str.lastIndexOf("\"");
         String str2 = str.substring(index1+1).trim();
         String[] splited = str2.split(" ");
         return splited[0];
     }
     public String parseFlow(String str)//獲取訪問的狀態
     {
         int index1= str.lastIndexOf("\"");
         String str2 = str.substring(index1+1).trim();
         String[] splited = str2.split(" ");
         return splited[1];
     }
}

經過mapreduce清洗後的資料如下(擷取部分)

60.10.5.65  20130530220737  source/plugin/wmff_wxyun/img/wmff_zk.css
60.10.5.65  20130530220738  source/plugin/study_nge/js/HoverLi.js
60.10.5.65  20130530220741  home.php?mod=misc&ac=sendmail&rand=1369922680
60.10.5.65  20130530220742  favicon.ico
60.10.5.65  20130530220742  forum.php
60.10.5.65  20130530220742  source/plugin/wmff_wxyun/img/wx_jqr.gif
60.10.5.65  20130530220742  template/newdefault/style/t5/bgimg.jpg
60.10.5.65  20130530220744  data/attachment/common/cf/104854ejrssrbbfsfv6cn5.jpg
60.10.5.65  20130530220744  source/plugin/wmff_wxyun/img/wx_jqr.gif
60.10.5.65  20130530220744  template/newdefault/style/t5/bgimg.jpg
60.10.5.65  20130530220744  template/newdefault/style/t5/nv.png
60.10.5.65  20130530220744  template/newdefault/style/t5/nv_a.png
60.10.5.65  20130530220745  data/attachment/common/cf/104950hio3tgww8tgpqtcz.jpg
60.10.5.65  20130530220745  data/attachment/common/cf/105041vvvi7pgez0w1mvxv.jpg
60.10.5.65  20130530220745  data/attachment/common/cf/180036e72352fq3reerq13.jpg
60.10.5.65  20130530220745  home.php?mod=misc&ac=sendmail&rand=1369922680
60.10.5.65  20130530220745  source/plugin/study_nge/images/list10.gif
60.10.5.65  20130530220746  source/plugin/study_nge/images/listbg.gif
60.10.5.65  20130530220747  api/connect/like.php

3、使用hive對清洗後的資料進行多維分析
(1)統計每日的pv(瀏覽量)

hive> create table hmbbs_pv
    > as select count(1) as pv from hmbbs_table;

(2)統計每日的register(註冊使用者數)

hive> create table hmbbs_register                       
    > as select count(1) as register                    
    > from hmbbs_table                                  
    > where instr(urllog,'member.php?mod=register') > 0;

(3)統計每日的獨立的ip

hive> create table hmbbs_ip as   
    > select count(distinct iplog)  as ip 
    > from hmbbs_table;

(4)統計每日的獨立的跳出率

hive> CREATE TABLE hmbbs_jumper AS SELECT COUNT(1) AS jumper FROM (SELECT COUNT(iplog) AS times FROM   hmbbs_table  GROUP BY iplog  HAVING times=1) e ;

到此獲得了各個引數的結果：

hive> show tables;
OK
hmbbs_ip
hmbbs_jumper
hmbbs_pv
hmbbs_register
hmbbs_table
Time taken: 0.081 seconds
hive> select * from hmbbs_ip;
OK
10411
Time taken: 0.111 seconds
hive> select * from hmbbs_jumper;
OK
3749
Time taken: 0.107 seconds
hive> select * from hmbbs_pv;    
OK
169857
Time taken: 0.108 seconds
hive> select * from hmbbs_register;
OK
28
Time taken: 0.107 seconds

4、將hive分析的結果使用sqoop匯出到mysql中

[[email protected] mydata]# sqoop export --connect jdbc:mysql://hadoop11:3306/mydata  --table hmresult  --username root  --password admin    --export-dir  /hmbbs_dir/ --fields-terminated-by '\t'  -m 1

接下來我們在mysql中檢視資料：
到此，基於Hadoop的日誌就大致介紹完了

Hadoop專案實戰---日誌分析

Hadoop專案實戰---日誌分析

專案實戰-日誌分析之ELK stack實戰

hadoop程式設計實戰——日誌分析

Hadoop 案例7-----日誌分析：分析非結構化檔案

大資料專案實戰之 --- 某App管理平臺的手機app日誌分析系統（三）

Spark專案學習-慕課網日誌分析-days1-hadoop

[專案實戰]流量日誌分析系統

Hadoop學習筆記—20.網站日誌分析專案案例（三）統計分析

Hadoop學習筆記—20.網站日誌分析專案案例（一）專案介紹

Hadoop學習筆記—20.網站日誌分析專案案例（二）資料清洗

（1）Storm實時日誌分析實戰--專案準備

Cloudera Hadoop 4 實戰課程(Hadoop 2.0、叢集介面化管理、電商線上查詢+日誌離線分析

Hadoop實戰: 論壇點選流日誌分析

web點選流日誌分析系統實戰專案

基於Hadoop離線大數據分析平臺項目實戰

38套大資料，雲端計算，架構，資料分析師，Hadoop，Spark，Storm，Kafka，人工智慧，機器學習，深度學習，專案實戰視訊教程

深度學習專案實戰--對於評論的情感分析

基於SparkSQL的網站日誌分析實戰

【備忘】大資料spark SQL專案實戰分析視訊

大資料專案實戰之 --- 使用者畫像專案分析

Hadoop專案實戰---日誌分析

相關推薦