【大資料】hive 分析 nginx 日誌

阿新 • • 發佈：2020-08-25

1.nginx 日誌收集
2.hive 建表載入資料
3.分析資料
4.資料視覺化

1.nginx 日誌收集

# 檢查 nginx 配置
nginx -t

# 檢視日誌配置
less /etc/nginx/nginx.conf

# 檢視日誌
cd /var/log/nginx;
ll

# 合併打包日誌
cat access.log > nginx.log;
gunzip -c access.log*gz > nginx.log;
gzip nginx.log;
sz nginx.log.gz;

2.hive 建表載入資料

-- 使用正則序列化解析，一個（）是一個欄位，注意轉義
drop table if exists spider.nginx_log;
create table spider.nginx_log(
remote_addr STRING,
remote_user STRING,
time_local STRING,
request STRING,
status STRING,
body_bytes_sent STRING,
http_referer STRING,
http_user_agent STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = '(.*?) - (.*?) \\[(.*?)\\] "(.*?)" (\\d+) (\\d+) "(.*?)" "(.*?)"',
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s"
);

-- 載入資料，總共 3236712 條
load data local inpath '/home/getway/tmp/way/nginx.log' into table spider.nginx_log;

3.分析資料

-- 檢視資料示例
select * from spider.nginx_log limit 10;

-- ip 統計
select remote_addr, count(1)
from spider.nginx_log
group by remote_addr
order by 2 desc

-- pv 最高的頁面
select request, count(1)
from spider.nginx_log
where request rlike 'comics'
group by request
order by 2 desc

-- 每天的訪問數
select substring(time_local, 0, 11), count(1)
from spider.nginx_log
group by substring(time_local, 0, 11)
order by 1

-- 每小時的訪問數
select substring(time_local, 13, 2), count(1)
from spider.nginx_log
group by substring(time_local, 13, 2)
order by 1

4.資料視覺化

後續處理

【大資料】hive 分析 nginx 日誌

目錄1.nginx 日誌收集2.hive 建表載入資料3.分析資料4.資料視覺化 1.nginx 日誌收集

【大資料】Hadoop的偽分散式安裝

這幾天開始學習大資料，這離不開Hadoop這個Apache的經典專案。 Hadoop官網：https://hadoop.apache.org/

【大資料】Hadoop的全分散式安裝

準備叢集伺服器準備在虛擬機器中建立四個配置為1核，2G記憶體，20G儲存的虛擬機器。（在這裡有任何問題，請參考上篇博文——>傳送門）

【大資料】Hadoop的高可用叢集(HA)部署

這裡基於之前的博文，即在全分散式安裝的基礎上增量部署高可用叢集。叢集部署表如下：

【大資料】Hadoop的HDFS的API開發小實戰

在部署完了高可用的叢集的基礎上，開始對目前的叢集做一次小開發，練練手。

【大資料】MapReduce開發小實戰

Before：前提:hadoop叢集應部署完畢。一、實戰科目：做一個Map Reduce分散式開發，開發內容為統計檔案中的單詞出現次數。

【大資料】那些簡化操作的輔助指令碼

技術標籤：大資料大資料hadoophivezookeeperkafka 【大資料】那些簡化操作的輔助指令碼

【大資料】Hadoop實驗報告

連結地址：【大資料】Hadoop實驗報告目錄實驗一熟悉常用的Linux操作和Hadoop操作1.實驗目的2.實驗平臺3.實驗內容和要求實驗二熟悉常用的HDFS操作1.實驗目的2.實驗平臺3.實驗步驟實驗三熟悉常用的HBase操作1.實驗

【網路排查】揮手：Nginx日誌報connection reset by peer是怎麼回事？

我們經常看到Nginx 的日誌裡面，可能就有 connection reset by peer 這種報錯。“連線被對端 reset（重置）”，這個字面上的意思是看明白了。但是，心裡不免發毛：

【大資料】RDD計算常見場景

一、目的本文主要用於記錄大資料學習過程中一些沉澱from pyspark import SparkContext,SparkConfspconf =SparkConf().setAppName(\"ji\").setMaster(\"local[*]\")sc = SparkContext(conf=spconf)#1.求平均數df = [1