pyspark練習--進行日誌提取IP並列印排行前五的訪問次數的IP
阿新 • • 發佈:2018-12-11
拿到測試用日誌檔案並分析
27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET/static/image/common/faq.gif HTTP/1.1" 200 1127 110.52.250.126 - - [30/May/2013:17:38:20 +0800] "GET /data/cache/style_1_widthauto.css?y7a HTTP/1.1" 200 1292 27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /static/image/common/hot_1.gif HTTP/1.1" 200 680 27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /static/image/common/hot_2.gif HTTP/1.1" 200 682 27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /static/image/filetype/common.gif HTTP/1.1" 200 90 110.52.250.126 - - [30/May/2013:17:38:20 +0800] "GET /source/plugin/wsh_wx/img/wsh_zk.css HTTP/1.1" 200 1482 110.52.250.126 - - [30/May/2013:17:38:20 +0800] "GET /data/cache/style_1_forum_index.css?y7a HTTP/1.1" 200 2331
發現IP為每段日誌開頭並使用“ ”進行分割即可
利用本地IDE進行類wordcount開發 程式碼如下 import os import sys from pyspark import SparkConf, SparkContext from operator import add
os.environ[‘PYSPARK_PYTHON’] = ‘/home/hadoop/app/python3/bin/python3’
if name == ‘main’: if len(sys.argv) != 2: print(‘Usage: TopN’, file=sys.stderr) sys.exit(-1) # 初始化 conf = SparkConf() sc = SparkContext(conf=conf)
# ip
data = sc.textFile(sys.argv[1]).map(lambda x: x.split(' '))
# ip賦值為1
ip = data.map(lambda x: (x[0], 1))
# 同ip計數
count_ip = ip.reduceByKey(add)
# ip排序
sort = count_ip.map(lambda x: (x[1], x[0])).sortByKey(False).map(lambda x: (x[1], x[0]))
# 列印到控制檯
print(sort.take(5))
sc.stop()
執行spark-submit指令
./spark-submit --master local[2] --name loganglice /home/hadoop/data/5/log.py hdfs:///test/access_2013_05_30.log
根據不同檔案地址進行不同輸入,此處為個人HDFS檔案 等待輸出結果 結果如下 可以發現IP 222.133.189.179訪問次數最多,為29948次 其次是61.50.141.7為22836次, 第三為123.147.245.79為9999次, 第四為49.72.74.77為8879次, 第五為60.10.5.65為6341次