大資料專案實戰之 --- 某App管理平臺的手機app日誌分析系統(三)
阿新 • • 發佈:2018-11-20
一、建立hive分割槽表 ---------------------------------------------------- 1.建立資料庫 $hive> create database applogsdb; 2.建立分割槽表 編寫指令碼。 [applogs_create_table.sql] use applogsdb; --startup CREATE external TABLE ext_startup_logs(createdAtMs bigint,appId string,tenantId string,deviceId string,appVersion string,appChannel string,appPlatform string,osType string,deviceStyle string,country string,province string,ipAddress string,network string,carrier string,brand string,screenSize string)PARTITIONED BY (ym string, day string,hm string) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE; --error CREATE external TABLE ext_error_logs(createdAtMs bigint,appId string,tenantId string,deviceId string,appVersion string,appChannel string,appPlatform string,osType string,deviceStyle string,errorBrief string,errorDetail string)PARTITIONED BY (ym string, day string,hm string) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE; --event CREATE external TABLE ext_event_logs(createdAtMs bigint,appId string,tenantId string,deviceId string,appVersion string,appChannel string,appPlatform string,osType string,deviceStyle string,eventId string,eventDurationSecs bigint,paramKeyValueMap Map<string,string>)PARTITIONED BY (ym string, day string,hm string) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE; --page CREATE external TABLE ext_page_logs(createdAtMs bigint,appId string,tenantId string,deviceId string,appVersion string,appChannel string,appPlatform string,osType string,deviceStyle string,pageViewCntInSession int,pageId string,visitIndex int,nextPage string,stayDurationSecs bigint)PARTITIONED BY (ym string, day string,hm string) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE; --usage CREATE external TABLE ext_usage_logs(createdAtMs bigint,appId string,tenantId string,deviceId string,appVersion string,appChannel string,appPlatform string,osType string,deviceStyle string,singleUseDurationSecs bigint,singleUploadTraffic bigint,singleDownloadTraffic bigint)PARTITIONED BY (ym string, day string,hm string) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE; 3.執行applogs.sql指令碼 $> hive -f /share/umeng/applogs_create_table.sql 二、使用Linux cron排程,週期性load HDFS的資料到hive的分割槽表 ---------------------------------------------------------------- 1.解釋 排程就是週期執行指定的任務。 2.Ubuntu安裝Cron apt-get install cron 3.排程命令[Ubuntu] $> /usr/sbin/service cron start $> /usr/sbin/service cron status $> /usr/sbin/service cron restart $> /usr/sbin/service cron stop 4.排程命令[centos] //檢視狀態 $> service crond status //停止 $>service crond stop //啟動 $> service crond start 5.配置排程任務 a.[/etc/crontab下] 0-59|0-23 1-31 1-12 0-6 分 時 天 月 星期 * * * * * ubuntu source /etc/profile;echo `date` >> ~/1.log //五個* 表示通配,每分鐘執行一次後面的命令 --> source /etc/profile;echo `date` >> ~/1.log //Linux最小執行間隔為1分鐘 6.date操作 date -d "-3 minute" +%Y%m-%d-%H%M //得到3分鐘之前的時間 date -d "3 minute" +%Y%m-%d-%H%M //得到3分鐘之後的時間 date -d "3 hour" +%Y%m-%d-%H%M //得到3分鐘之後的時間 7.使用sed命令編輯檔案 //刪除第一行 $>sed '1d' 1.log //刪除最後一行 $>sed '$d' 1.log //刪除區間行 $>sed '1,3d' 1.log //刪除所有行 $>sed '1,$d' 1.log //p:print -- 複製每一行,然後列印輸出。也就是每一行列印兩遍 $>sed '1,$p' 1.log //-n:安靜模式,只顯示處理的行 -- 只打印第一行 $>sed -n '1,$p' 1.log //-i:對原始檔進行修改[1,$p] $>sed -i '1,$p' 1.log //顯示含有hello的行[/.../p] $>sed -n '/hello/p' 1.log //追加內容第1行之後追加新行[1a] $>sed -i '1ahello' 1.log //追加新行,指定前置字元 [1a] $>sed -i '1a\ hello' 1.log //1-3行每行下面都追加新行hello[1,3a] --- append $>sed -i '1,3ahello' 1.log //替換,針對整行[1,2c] -- cover $>sed -i '1,2ckkk' 1.log //替換,針對特定字串,用how替換掉hello [s/../../g] $>sed -i 's/hello/how/g' 1.log 8.編寫指令碼,週期性匯入hdfs的檔案到hive的分割槽表 [~/Downloads/.exportData.sql] load data inpath '/data/applogs/startup/${ym}/${day}/${hm}' into table applogsdb.ext_startup_logs partition(ym='${ym}',day='${day}',hm='${hm}'); load data inpath '/data/applogs/error/${ym}/${day}/${hm}' into table applogsdb.ext_error_logs partition(ym='${ym}',day='${day}',hm='${hm}'); load data inpath '/data/applogs/event/${ym}/${day}/${hm}' into table applogsdb.ext_event_logs partition(ym='${ym}',day='${day}',hm='${hm}'); load data inpath '/data/applogs/page/${ym}/${day}/${hm}' into table applogsdb.ext_page_logs partition(ym='${ym}',day='${day}',hm='${hm}'); load data inpath '/data/applogs/usage/${ym}/${day}/${hm}' into table applogsdb.ext_usage_logs partition(ym='${ym}',day='${day}',hm='${hm}'); 9.編寫執行指令碼 -- 每次只拷貝1分鐘時間片的資料。但是資料是3分鐘前的那1分鐘的資料。 [~/Downloads/exec.sh] #!/bin/bash systime=`date -d "-3 minute" +%Y%m-%d-%H%M` ym=`echo ${systime} | awk -F '-' '{print $1}'` day=`echo ${systime} | awk -F '-' '{print $2}'` hm=`echo ${systime} | awk -F '-' '{print $3}'` cp ~/Downloads/.exportData.sql ~/Downloads/exportData.sql sed -i 's/${ym}/'${ym}'/g' ~/Downloads/exportData.sql sed -i 's/${day}/'${day}'/g' ~/Downloads/exportData.sql sed -i 's/${hm}/'${hm}'/g' ~/Downloads/exportData.sql #執行hive的命令,注意此處的hive命令一定要寫全路徑,不然找不到hive /soft/hive/bin/hive -f ~/Downloads/exportData.sql rm ~/Downloads/exportData.sql 10.設定每隔1分鐘自動執行指令碼exec.sh一次[生產環境一般為一天執行一次。每天的凌晨2點] $> sudo nano /etc/crontab * * * * * ubuntu source /etc/profile;~/Downloads/exec.sh //開啟服務 $> /usr/sbin/service cron start $> /usr/sbin/service cron status $> /usr/sbin/service cron stop 三、匯出web專案的war包,部署到ubuntu的tomcat上 --------------------------------------------------------------------- 1.安裝tomcat a.下載安裝 apache-tomcat-7.0.72.tar.gz b.tar開 tar -xzvf ~/Downloads/apache-tomcat-7.0.72.tar.gz -C /soft c.軟連線 $>ln -s /soft/apache-tomcat-7.0.72 /soft/tomcat 2.匯出web專案的war包 a.找到web專案,在pom.xml中新增外掛和common依賴 <?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.test</groupId> <artifactId>app-logs-collect-web</artifactId> <version>1.0-SNAPSHOT</version> <packaging>war</packaging> <build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-surefire-plugin</artifactId> <version>2.12.4</version> <configuration> <skipTests>true</skipTests> </configuration> </plugin> <plugin> <artifactId>maven-war-plugin</artifactId> <version>2.6</version> <configuration> <warSourceDirectory>web</warSourceDirectory> <failOnMissingWebXml>false</failOnMissingWebXml> <excludes>css/*,images/*,js/*,png/*,phone/*</excludes> </configuration> </plugin> </plugins> </build> <dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.11</version> </dependency> <dependency> <groupId>com.fasterxml.jackson.core</groupId> <artifactId>jackson-core</artifactId> <version>2.8.8</version> </dependency> <dependency> <groupId>com.fasterxml.jackson.core</groupId> <artifactId>jackson-databind</artifactId> <version>2.8.3</version> </dependency> <dependency> <groupId>com.maxmind.db</groupId> <artifactId>maxmind-db</artifactId> <version>1.0.0</version> </dependency> <dependency> <groupId>org.springframework</groupId> <artifactId>spring-webmvc</artifactId> <version>4.3.5.RELEASE</version> </dependency> <dependency> <groupId>javax.servlet</groupId> <artifactId>servlet-api</artifactId> <version>2.5</version> </dependency> <dependency> <groupId>com.alibaba</groupId> <artifactId>fastjson</artifactId> <version>1.2.24</version> </dependency> <dependency> <groupId>com.maxmind.db</groupId> <artifactId>maxmind-db</artifactId> <version>1.0.0</version> </dependency> <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka_2.11</artifactId> <version>0.10.0.1</version> </dependency> <dependency> <groupId>com.test</groupId> <artifactId>app-analyze-common</artifactId> <version>1.0-SNAPSHOT</version> </dependency> </dependencies> </project> b.因為涉及到關聯的公共模組common,所以,匯出war包之前要先安裝common模組,使common模組重新打包放到.m2倉庫下。 maven --> install common module ... c.然後打包web伺服器程式成war包app-web.war 3.複製war檔案到centos下${tomcat}/webapps 4.啟動tomcat $>tomcat/bin/startup.sh 5.驗證 $>netstat -anop | grep 8080 6.開啟flume flume-ng agent -f applog.conf -n a1 7.修改手機程式連線伺服器的地址。 UploadUtil.java 21行:URL url = new URL("http://s100:8080/app-web/coll/index"); http://s100:8080/app-web/coll/index 8.至此,資料就收集並上傳到hive上了 四、Hive查詢 ----------------------------------------------------------- 1.通過hive查詢指定app的使用者數[去重] hive> select count(distinct deviceid) from ext_startup_logs where appid = 'sdk34734';