一種分析HDFS檔案變化及小檔案分佈情況的方法

阿新 • • 發佈：2021-01-30

文件編寫目的

目前各個企業都在利用Hadoop大資料平臺，每天都會通過ETL產生大量的檔案到hdfs上，如何有效的去監測資料的有效性，防止資料的無限增長導致物理資源跟不上節奏，我們必須控制成本，讓有限的資源發揮大資料的極致功能。本文介紹如何去分析hdfs上的檔案變化情況，以及老生常談的小檔案的監控情況的一種實現方式。

實現方式說明

本次分析方案有兩種：

利用hdfs的api文件，通過hdfs例項的listStatus方法遞迴出hdfs上所有的檔案及目錄的具體情況，包括path、ower、size等重要屬性。然後將這些資料寫到本地檔案中，上傳到hdfs上，然後在hive上建一個外表來對映這些資料，最後利用sql進行各種分析；

第二種方式主要是在獲取源資料時跟第一種不同，這次採用的是hdfs自帶的分析fsimage檔案的命令hdfs oiv -i + fsimage檔案 -o +輸出檔案 -p Delimited，該命令將fsimage檔案解析成可閱讀的csv檔案，後續操作跟第一種一樣都是上傳到hdfs建外表用sql來分析各種指標。

程式碼講解

第一種用java程式碼通過hdfs的api文件獲取完整資料

原始碼如下

package com.mljr.hdfs;
import java.io.*;
import java.net.URI;
import org.apache.hadoop.conf.Configuration; 

import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

public class HdfsStatus {
    public static void main(String[] args) {
        FileSystem hdfs = null;
        try{
            Configuration config = new Configuration();
            config. 
set("fs.default.name", "nameservice1");
            hdfs = FileSystem.get(new URI("nameservice1"),//主節點ip或者hosts
                    config, "hdfs");
            Path path = new Path("/");//這裡定義從hdfs的根節點開始計算
            String content_csv = "/tmp/content.csv";
            long startTime=System.currentTimeMillis();   //獲取開始時間
            BufferedOutputStream out =new BufferedOutputStream(new FileOutputStream(new File(content_csv)));
            iteratorShowFiles(hdfs, path,out);
            out.close();
            long endTime=System.currentTimeMillis(); //獲取結束時間
            long runTime = (endTime-startTime)/1000/60;
            System.out.println("程式執行時間： "+runTime+"min");

        }catch(Exception e){
            e.printStackTrace();
        }finally{
            if(hdfs != null){
                try {
                    hdfs.closeAll();
                } catch (Exception e) {
                    e.printStackTrace();
                }
            }
        }
    }

    /**
     *
     * @param hdfs FileSystem 物件
     * @param path 檔案路徑
     */
    public static void iteratorShowFiles(FileSystem hdfs, Path path,BufferedOutputStream out){


        String line = System.getProperty("line.separator");
        try{
            if(hdfs == null || path == null){
                return;
            }
            //獲取檔案列表
            FileStatus[] files = hdfs.listStatus(path);
            //建立輸出檔案
            //展示檔案資訊
            for (int i = 0; i < files.length; i++) {
                try{
                    if(files[i].isDirectory()){
                        String text = (files[i].getPath().toString().replace("hdfs://nameservice1","")
                                + "," + files[i].getOwner()
                                + "," + "0"
                                + "," + "0"
                                + "," + files[i].getBlockSize()
                                + "," + files[i].getPermission()
                                + "," + files[i].getAccessTime()
                                + "," + files[i].getModificationTime()
                                + "," + files[i].getReplication()+line);
                        out.write(text.getBytes());
                        //遞迴呼叫
                        iteratorShowFiles(hdfs, files[i].getPath(),out);
                    }else if(files[i].isFile()){
                        String text=files[i].getPath().toString().replace("hdfs://nameservice1","")
                                + "," + files[i].getOwner()
                                + "," + "1"
                                + "," + files[i].getLen()
                                + "," + files[i].getBlockSize()
                                + "," + files[i].getPermission()
                                + "," + files[i].getAccessTime()
                                + "," + files[i].getModificationTime()
                                + "," + files[i].getReplication()+line;
                        out.write(text.getBytes());
                    }
                }catch(Exception e){
                    e.printStackTrace();
                }
            }
        }catch(Exception e){
            e.printStackTrace();
        }
    }
}

將本地的檔案上傳到hdfs上，然後建hive外表

#!/bin/bash
source /etc/profile
cd /home/dmp/hdfs

#生成hdfs目錄檔案和節點資訊
java -cp ./HdfsStatus-1.0-SNAPSHOT.jar com.mljr.hdfs.HdfsStatus
#將檔案上傳到hdfs(hdfs目錄需要提前建立好)
hadoop fs -rm -r /tmp/dfs/content/content.csv /tmp/dfs/nodes/nodes.csv
hadoop fs -put /tmp/content.csv /tmp/dfs/content

於Hive建立外部表

CREATE EXTERNAL TABLE `default.hdfs_info`(
  `path` string, 
  `owner` string, 
  `is_dir` string, 
  `filesize` string, 
  `blocksize` string, 
  `permisson` string, 
  `acctime` string, 
  `modificatetime` string, 
  `replication` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
WITH SERDEPROPERTIES ( 
  'field.delim'=',', 
  'serialization.format'=',') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://nameservice1/tmp/dfs/content'

SQL分析計算

#sql分析一級目錄大小
select joinedpath, sumsize
from 
(
select joinedpath,round(sum(filesize)/1024/1024/1024,2) as sumsize
from
(select concat('/',split(path,'\/')[1]) as joinedpath,accTime,filesize,owner 
from default.hdfs_info
)t
group by joinedpath
)h
order by sumsize desc

#sql分析二級目錄大小
select joinedpath, sumsize
from 
(
select joinedpath,round(sum(filesize)/1024/1024/1024,2) as sumsize
from
(select concat('/',split(path,'\/')[1],'/',split(path,'\/')[2]) as joinedpath,accTime,filesize,owner 
from default.hdfs_info
)t
group by joinedpath
)h
order by sumsize desc 
###後面的各級目錄方式類似，就不再詳述了，下面說下各級目錄小檔案統計的sql

#三級目錄下小於100k檔案數量的統計
 SELECT concat('/',split(path,'\/')[1],'/',split(path,'\/')[2],'/',split(path,'\/')[3]) as path ,count(*) as small_file_num
  FROM 
  (SELECT relative_size,path 
  FROM 
  (SELECT (case filesize < 100*1024 WHEN true THEN 'small' ELSE 'large' end) 
  AS 
  relative_size, path 
  FROM default.hdfs_info WHERE is_dir='1') tmp 
  WHERE 
  relative_size='small') tmp2 
  group by concat('/',split(path,'\/')[1],'/',split(path,'\/')[2],'/',split(path,'\/')[3]) 
  order by small_file_num desc;
  
###其他各級目錄小檔案數量的統計，方法類似，下面說下hive某個庫下面表大小以及修改時間的統計
SELECT joinedpath,
       from_unixtime(ceil(acctime/1000),'yyyy-MM-dd HH:mm:ss') AS acctime,
       from_unixtime(ceil(modificatetime/1000),'yyyy-MM-dd HH:mm:ss') AS modificatetime,
       sumsize
FROM
  (SELECT joinedpath,
          min(accTime) AS acctime,
          max(modificatetime) AS modificatetime,
          round(sum(filesize)/1024/1024/1024,2) AS sumsize
   FROM
     (SELECT concat('/',split(path,'\/')[1],'/',split(path,'\/')[2],'/',split(path,'\/')[3],'/',split(path,'\/')[4],'/',split(path,'\/')[5]) AS joinedpath,
             accTime,
             modificatetime,
             filesize,
             OWNER
      FROM default.hdfs_info
      WHERE concat('/',split(path,'\/')[1],'/',split(path,'\/')[2],'/',split(path,'\/')[3],'/',split(path,'\/')[4])='/user/hive/warehouse/default.db')t
   WHERE joinedpath != 'null'
   GROUP BY joinedpath)h
ORDER BY sumsize DESC

HDFS元資料可用來分析的太多了，本文只是拋磚引玉給出了一些基本的sql分析。

使用Shell指令碼獲取HDFS元資料映象FSImage檔案

首先，我們看下HDFS元資料映象檔案FSImage有哪些欄位內容，使用以下命令將其轉換為可讀的csv格式檔案。

nohup bin/hdfs oiv -i ./fsimage_XXXXX -o ./fsimage_0127.csv -p Delimited -delimiter ',' --temp /data02/tmp &

其第一行有每個欄位的解釋，打出來看一下：

# head -n 2 fsimage_0127.csv
Path,Replication,ModificationTime,AccessTime,PreferredBlockSize,BlocksCount,FileSize,NSQUOTA,DSQUOTA,Permission,UserName,GroupName
/,0,2020-03-26,16:00,1970-01-01,08:00,0,0,0,9223372036854775807,-1,drwxr-xr-x,hdfs,hdfs
/tmp,0,2020-01-08,14:40,1970-01-01,08:00,0,0,0,-1,-1,drwxrwxrwx,hdfs,hdfs

看字面意思很好理解，這裡就不挨個解釋了。

#!/bin/bash
prepare_operation()
{
    # get parameters
    t_save_fsimage_path=$1
    # delete history fsimage
    fsimage_tmp_file=`find ${t_save_fsimage_path} -name "fsimage*"`
    if [ ! -z "${fsimage_tmp_file}" ]
    then
        for file in ${fsimage_tmp_file}
        do
            rm -f ${file}
        done
    fi
    # 使用set -e時，如果命令返回結果不為0就報錯，即無法再使用$?獲取命令結果，可用||或!處理     
    
}
get_hdfs_fsimage()
{
    # 獲取傳入引數   
    t_save_fsimage_path=$1
    # 從namenode上下載fsimage               
    hdfs dfsadmin -fetchImage ${t_save_fsimage_path}
    # 獲取下載的fsimage具體檔案路徑               
    t_fsimage_file=`ls ${t_save_fsimage_path}/fsimage*`
    # 處理fsimage為可讀的csv格式檔案             
    hdfs oiv -i ${t_fsimage_file} -o ${t_save_fsimage_path}/fsimage.csv -p Delimited
    # 刪除fsimage.csv的首行資料          
    sed -i -e "1d" ${t_save_fsimage_path}/fsimage.csv
    # 建立資料目錄      
    hadoop fs -test -e ${t_save_fsimage_path}/fsimage || hdfs dfs -mkdir -p ${t_save_fsimage_path}/fsimage
    # 拷貝fsimage.csv到指定的路徑          
    hdfs dfs -copyFromLocal -f ${t_save_fsimage_path}/fsimage.csv ${t_save_fsimage_path}/fsimage/
}

main()
{
    # 開始時間           
    begin_time=`date +%s`   
    # 定義本地和HDFS的臨時目錄路徑        
    t_save_fsimage_path=/tmp/dfs 
    # 建立臨時目錄，刪除歷史資料等操作                
    prepare_operation ${t_save_fsimage_path} 
    # 獲取HDFS的FSImage         
    hdfs_fsimage_update_time=`date "+%Y-%m-%d %H:%M:%S"`
    get_hdfs_fsimage ${t_save_fsimage_path}
    # 結束時間        
    end_time=`date +%s`
    # 耗時(秒數)     
    result_time=$((end_time-begin_time))
    echo "******************************************************************"
    echo "The script has taken ${result_time} seconds..."
    echo "Result Table: default.hdfs_meta"
    echo "HDFS FSImage update-time before: ${hdfs_fsimage_update_time}"
    echo "******************************************************************"
}
#執行主方法             
main "[email protected]"

之後在進行建外部表和sql分析操作。

除了上述兩種獲取HDFS元資料的方法之外，還可以通過WebHDFS REST API獲取，並且優雅的Python還有個對WebHDFS REST API介面解析的一個對應的包–pywebhdfs，可謂是非常方便。pywebhdfs documentation

總結

其實基於hdfs上的檔案以及目錄的分析還有很多工作要做，比如：分析hdfs各級目錄每天的增量變化情況，得出叢集主要的增長資料來自哪個地方；分析hdfs上檔案的生命週期，得出hdfs檔案的冷熱狀態，太久沒有被訪問的檔案被認為冷資料，一個檔案在hdfs上很久都沒變動了是否代表這個資料就沒價值了，合理的利用hdfs儲存空間可是能幫公司節約很大的成本哦。
又如，在一個多租戶的hadoop叢集中，分析租戶hdfs檔案目錄配額及使用率，可為租戶生成租戶賬單。
另外hive表實質上也是hdfs上的檔案，通過分析hdfs上檔案包含的小檔案可以知道哪些hive表沒有正常使用引數產生了大量的小檔案，還可以通過hive表對應的hdfs目錄使用者的訪問頻率可以看出哪些hive表使用者訪問頻繁，進而反映出哪些業務資料是熱點資料，哪些話題是熱點話題等等。元資料價值非常值得我們去挖掘。

本文大部分內容來自於：https://www.jianshu.com/p/c1c32c4def6f，轉載已獲得作者同意。

在這裡插入圖片描述

一種分析HDFS檔案變化及小檔案分佈情況的方法

文件編寫目的

實現方式說明

程式碼講解

第一種用java程式碼通過hdfs的api文件獲取完整資料

使用Shell指令碼獲取HDFS元資料映象FSImage檔案

總結

一種分析HDFS檔案變化及小檔案分佈情況的方法

Sqoop從關係型資料庫往HDFS上匯入及匯出檔案

一種輕量、便捷的小程式轉App技術方案

Java如何將大檔案切割成小檔案

spark sql合併小檔案_Spark SQL小檔案問題在OPPO的解決方案

nginx配置檔案編寫及日誌檔案相關操作

資料分析2020年全國各省高考成績分佈情況

小程式+App，低成本獲客及活躍的一種技術組合思路

tp5.0框架隱藏index.php入口檔案及模組和控制器的方法分析

Tensorflow中批量讀取資料的案列分析及TFRecord檔案的打包與讀取

小程式結構目錄及配置檔案

LINUX核心命令實戰總結四——檔案過濾及內容編輯處理命令一

解決idea services視窗不見的一種特殊情況(小白採坑系列)

hadoop SequenceFile 檔案寫入及各式分析

探尋從HDFS到Spark的高效資料通道：以小檔案輸入為案例(轉)

技術實操丨HBase 2.X版本的元資料修復及一種資料遷移方式

【筆記】拉勾Java工程師高薪訓練營-第一階段開源框架原始碼解析-模組一持久層框架涉及實現及MyBatis原始碼分析-任務二：Mybatis基礎回顧及高階應用

【筆記】拉勾Java工程師高薪訓練營-第一階段開源框架原始碼解析-模組一持久層框架涉及實現及MyBatis原始碼分析-任務三：Mybatis原始碼剖析

一種混合包管理和容器管理方案，及在tinycorelinux上安裝containerd和openfaas

Linux 下 4 種實時監控日誌檔案方法，總有一種是你喜歡的

一種分析HDFS檔案變化及小檔案分佈情況的方法

文件編寫目的

實現方式說明

程式碼講解

第一種用java程式碼通過hdfs的api文件獲取完整資料

使用Shell指令碼獲取HDFS元資料映象FSImage檔案

總結

相關推薦