HDFS小檔案合併問題的優化：copyMerge的改進

阿新 • • 發佈：2019-02-08

1.問題分析

用fsck命令統計檢視HDFS上在某一天日誌的大小，分塊情況以及平均的塊大小，即

[[email protected] jar]$ hadoop fsck /wcc/da/kafka/report/2015-01-11
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

15/01/13 18:57:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Connecting to namenode via http://da-master:50070
FSCK started by hduser (auth:SIMPLE) from /172.21.101.30 for path /wcc/da/kafka/report/2015-01-11 at Tue Jan 13 18:57:24 CST 2015
....................................................................................................
....................................................................................................
........................................Status: HEALTHY
 Total size:	9562516137 B
 Total dirs:	1
 Total files:	240
 Total symlinks:		0
 Total blocks (validated):	264 (avg. block size 36221652 B)
 Minimally replicated blocks:	264 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	2
 Average block replication:	2.0
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)
 Number of data-nodes:		5
 Number of racks:		1
FSCK ended at Tue Jan 13 18:57:24 CST 2015 in 14 milliseconds


The filesystem under path '/wcc/da/kafka/report/2015-01-11' is HEALTHY

用表格整理出來：

Date Time	Total(GB)	Total blocks	AveBlockSize(MB)
2014/12/21	9.39	268	36
2014/12/20	9.5	268	36
2014/12/19	8.89	268	34
2014/11/5	8.6	266	33
2014/10/1	9.31	268	36

分析問題的存在性：從表中可以看出，每天日誌量的分塊情況：總共大概有268左右的塊數，平均塊大小為36MB左右，遠遠不足128MB，這潛在的說明了一個問題。日誌產生了很多小檔案，大多數都不足128M，嚴重影響叢集的擴充套件性和效能：首先，在HDFS中，任何block，檔案或者目錄在記憶體中均以物件的形式儲存，每個物件約佔150byte，如果有1000 0000個小檔案，每個檔案佔用一個block，則namenode大約需要2G空間。如果儲存1億個檔案，則namenode需要20G空間，這樣namenode記憶體容量嚴重製約了叢集的擴充套件；其次，訪問大量小檔案速度遠遠小於訪問幾個大檔案；HDFS最初是為流式訪問大檔案開發的，如果訪問大量小檔案，需要不斷的從一個datanode跳到另一個datanode，嚴重影響效能；最後，處理大量小檔案速度遠遠小於處理同等大小的大檔案的速度，因為每一個小檔案要佔用一個slot，而task啟動將耗費大量時間甚至大部分時間都耗費在啟動task和釋放task上，累計起來的總時長必然增加。我們採取的策略是先合併小檔案，比如整理日誌成user_report.tsv，client_report.tsv，AppLog_UserDevice.tsv, 再執行job。

2.解決方案

可以呼叫API的 FileUtil工具類的方法copyMerge(FileSystem srcFS, Path srcDir, FileSystem dstFS, Path dstFile, boolean deleteSource,Configuration conf, String addString)；

但是此方法並不適用，因為某一天日誌存在著三種類型的日誌，即：

要分別合併成三個檔案user_report.tsv，client_report.tsv和AppLog_UserDevice.tsv，故必須重新實現copyMerge函式，先分析copyMerge原始碼：

  /** Copy all files in a directory to one output file (merge). */
  public static boolean copyMerge(FileSystem srcFS, Path srcDir, 
                                  FileSystem dstFS, Path dstFile, 
                                  boolean deleteSource,
                                  Configuration conf, String addString) throws IOException {
	//生成合並後的目標檔案路徑dstFile，檔名為srcDir.getName()，即源路徑的目錄名，因此這裡我們不能自定義生成的日誌檔名，非常不方便							  
    dstFile = checkDest(srcDir.getName(), dstFS, dstFile, false);
	//判斷源路徑是否為檔案目錄
    if (!srcFS.getFileStatus(srcDir).isDirectory())
      return false;
   //建立輸出流，準備寫檔案
    OutputStream out = dstFS.create(dstFile);
    
    try {
	 // 得到每個源路徑目錄下的每個檔案
      FileStatus contents[] = srcFS.listStatus(srcDir);
	  //排序操作
      Arrays.sort(contents);
      for (int i = 0; i < contents.length; i++) {
        if (contents[i].isFile()) {
			//建立輸入流，讀檔案
          InputStream in = srcFS.open(contents[i].getPath());
          try {
		  //執行復制操作，寫入到目標檔案中
            IOUtils.copyBytes(in, out, conf, false);
            if (addString!=null)
              out.write(addString.getBytes("UTF-8"));
                
          } finally {
            in.close();
          } 
        }
      }
    } finally {
      out.close();
    }
    
	//若deleteSource為true，刪除源路徑目錄下的每個檔案
    if (deleteSource) {
      return srcFS.delete(srcDir, true);
    } else {
      return true;
    }
  }

改進後：（這種方式只需要開啟關閉輸出流out 三次）

	/** Copy corresponding files in a directory to related output file (merge). */
	@SuppressWarnings("unchecked")
	public static boolean merge(FileSystem hdfs, Path srcDir, Path dstFile,
			boolean deleteSource, Configuration conf) throws IOException {
		if (!hdfs.getFileStatus(srcDir).isDirectory())
			return false;
		// 得到每個源目錄下的每個檔案；
		FileStatus[] fileStatus = hdfs.listStatus(srcDir);
		// 三種不同型別的檔案各自的檔案路徑存入不同的list；
		ArrayList<Path> userPaths = new ArrayList<Path>();
		ArrayList<Path> clientPaths = new ArrayList<Path>();
		ArrayList<Path> appPaths = new ArrayList<Path>();
		for (FileStatus fileStatu : fileStatus) {
			Path filePath = fileStatu.getPath();
			if (filePath.getName().startsWith("user_report")) {
				userPaths.add(filePath);
			} else if (filePath.getName().startsWith("client_report")) {
				clientPaths.add(filePath);
			} else if (filePath.getName().startsWith("AppLog_UserDevice")) {
				appPaths.add(filePath);
			}
		}
		// 分別寫入到目標檔案:user_report.tsv中
		if (userPaths.size() > 0) {
			Path userDstFile = new Path(dstFile.toString() + "/user_report.tsv");
			OutputStream out = hdfs.create(userDstFile);
			Collections.sort(userPaths);
			try {
				Iterator<Path> iterator = userPaths.iterator();
				while (iterator.hasNext()) {
					InputStream in = hdfs.open(iterator.next());
					try {
						IOUtils.copyBytes(in, out, conf, false);
					} finally {
						in.close();
					}
				}
			} finally {
				out.close();
			}
		}
		// 分別寫入到目標檔案:client_report.tsv中
		if (clientPaths.size() > 0) {
			Path clientDstFile = new Path(dstFile.toString()
					+ "/client_report.tsv");
			OutputStream out = hdfs.create(clientDstFile);
			Collections.sort(clientPaths);
			try {
				Iterator<Path> iterator = clientPaths.iterator();
				while (iterator.hasNext()) {
					InputStream in = hdfs.open(iterator.next());
					try {
						IOUtils.copyBytes(in, out, conf, false);
					} finally {
						in.close();
					}
				}
			} finally {
				out.close();
			}
		}
		// 分別寫入到目標檔案:AppLog_UserDevice.tsv中
		if (appPaths.size() > 0) {
			Path appDstFile = new Path(dstFile.toString()
					+ "/AppLog_UserDevice.tsv");
			OutputStream out = hdfs.create(appDstFile);
			Collections.sort(appPaths);
			try {
				Iterator<Path> iterator = appPaths.iterator();
				while (iterator.hasNext()) {
					InputStream in = hdfs.open(iterator.next());
					try {
						IOUtils.copyBytes(in, out, conf, false);
					} finally {
						in.close();
					}
				}
			} finally {
				out.close();
			}
		}
		if (deleteSource) {
			return hdfs.delete(srcDir, true);
		}
		return true;
	}

當然你也可以這樣：(每一個小檔案都會開啟輸入輸出流一次，不過程式碼上看上去比較簡潔，還無需用ArrayList做中間儲存)

	public static boolean mergeFiles(FileSystem hdfs, Path srcDir,
			Path dstFile, boolean deleteSource, Configuration conf)
			throws IOException {
		if (!hdfs.getFileStatus(srcDir).isDirectory())
			return false;
		// 得到每個源目錄下的每個檔案；
		FileStatus[] fileStatus = hdfs.listStatus(srcDir);
		// 三種不同型別的檔案各自合併

		for (FileStatus fileStatu : fileStatus) {
			Path filePath = fileStatu.getPath();
			Path dstPath = new Path("");
			if (filePath.getName().startsWith("user_report")) {
				dstPath = new Path(dstFile.toString() + "/user_report.tsv");
			} else if (filePath.getName().startsWith("client_report")) {
				dstPath = new Path(dstFile.toString() + "/client_report.tsv");
			} else if (filePath.getName().startsWith("AppLog_UserDevice")) {
				dstPath = new Path(dstFile.toString() + "/client_report.tsv");
			}else{
				dstPath=new Path( "/error.tsv");
			}
			
			OutputStream out = hdfs.create(dstPath);
			try {
				InputStream in = hdfs.open(filePath);
				try {
					IOUtils.copyBytes(in, out, conf, false);
				} finally {
					in.close();
				}

			} finally {
				out.close();
			}
			
			
		}
		if (deleteSource) {
			return hdfs.delete(srcDir, true);
		}
		return true;
	}

3.總結

根據不同業務邏輯的需求，你可以自定義實現API介面函式。對於解決小檔案合併問題，如果你有更好的策略，歡迎交流！

HDFS小檔案合併問題的優化：copyMerge的改進

1.問題分析

2.解決方案

3.總結

HDFS小檔案合併問題的優化：copyMerge的改進

hdfs 小檔案合併方案(附程式碼)

python spark中parquet檔案寫到hdfs，同時避免太多的小檔案（block小檔案合併）

程式碼自留地：小檔案合併成大檔案，需要配置BytesZip使用，java

HDFS小檔案優化方法

Hadoop實戰專案：小檔案合併

MaxCompute小檔案問題優化方案

小網站架構優化：從100併發抗到4000併發

解決MapReduce中多個小檔案合併成大檔案問題

HDFS小檔案處理及解決方案

HDFS小檔案問題解決方案+SequenceFile簡單介紹

Linux下大檔案分割和小檔案合併

對海量小檔案儲存優化的一些理解和TFS介紹

大資料（hadoop-小檔案合併、Mapreduce原理）

解決Spark讀取HDFS小檔案的問題

HDFS簡單程式設計例項：檔案合併

ElasticSearch58：核心原理揭祕_最後優化寫入流程實現海量磁碟檔案合併(segment merge, optimize)

使用Impala合併小檔案

HDFS無法高效儲存大量小檔案，如何處理好小檔案？

史上最全MapReduce小檔案優化策略

HDFS小檔案合併問題的優化：copyMerge的改進

1.問題分析

2.解決方案

3.總結

相關推薦