HDFS獲取目錄大小API

阿新 • • 發佈：2019-01-25

獲取檔案大小，在命令列上，使用hadoop fs -du 命令可以，但是通過javaAPI怎麼獲取呢，
最開始我想到的是遞迴的方法，這個方法很慢，後來發現FileSystem.getContentSummary的方法

最慢的一個方法–遞迴

網上很多類似的方法，不建議

使用FileSystem.getContentSummary方法

下面是api的一句話：

The getSpaceConsumed() function in the ContentSummary class will return the actual space the file/directory occupies in the cluster i.e. it takes into account the replication factor set for the cluster.

For instance, if the replication factor in the hadoop cluster is set to 3 and the directory size is 1.5GB, the getSpaceConsumed() function will return the value as 4.5GB.

Using getLength() function in the ContentSummary class will return you the actual file/directory size.
示例程式碼如下

 public static void 
 main(String[] args) {
        FileSystem hdfs = null;

        Configuration conf = new Configuration();

        try {
            hdfs = FileSystem.get(new URI("hdfs://192.xxx.xx.xx:9000"),conf,"username");
        } catch (Exception e) {
            e.printStackTrace();
        }

        Path filenamePath = 
 new Path("/test/input");
        try {
            //會根據叢集的配置輸出，例如我這裡輸出3G
            System.out.println("SIZE OF THE HDFS DIRECTORY : " + hdfs.getContentSummary(filenamePath).getSpaceConsumed());
           // 顯示實際的輸出，例如這裡顯示 1G
            System.out.println("SIZE OF THE HDFS DIRECTORY : " + hdfs.getContentSummary(filenamePath).getLength());
        } catch (IOException e) {
            e.printStackTrace();
        }

    }

問題記錄

記錄一個非常詭異的問題，暫未發現原因.
原生API很快，自己使用遞迴，慢了不止10倍，將原生API的程式碼複製出來，也慢了不止10倍。
然後看原始碼，原生API也是用的遞迴

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.ContentSummary;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import java.io.IOException;

public class TestString{
    public static FileSystem fs = null;
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS", "hdfs://192.168.xx.xx:9000");
        fs = FileSystem.get(conf);
        String rootPath = "/sparktest";
        // 測試getContentSummary
        long t1 = System.currentTimeMillis();
        ContentSummary contentSummary = fs.getContentSummary(new Path(rootPath));
        System.out.println("contentSummary.count"+contentSummary.getFileCount()+
                " contentSummary length: " +
                ""+contentSummary.getLength
                ());
        long t2 = System.currentTimeMillis();
        System.out.println("API本身的getContentSummary 用時: "+(t2-t1)+" ms");

        //測試遞迴的方式
        TestString ts = new TestString();
        FileModel fileModel = ts.new FileModel();
        long t3 = System.currentTimeMillis();
        ts.getFileLength(fileModel,new Path(rootPath));
        long t4 = System.currentTimeMillis();
        System.out.println("count: "+fileModel.count+" length "+fileModel.length);
        System.out.println("自己寫的遞迴用時 "+(t4-t3)+" ms");


        ContentSummary contentSummary1 = ts.getContentSummary(new Path(rootPath));
        System.out.println("contentSummary1.count"+contentSummary1.getFileCount()+
                " contentSummary1 length: " +
                ""+contentSummary1.getLength
                ());
        long t5 = System.currentTimeMillis();
        System.out.println("原始碼複製出來的 getContentSummary1 用時: "+(t5-t4)+" ms");

    }

    /**
     * 自己寫的遞迴呼叫，很慢
     * @param fileModel
     * @param path
     */
    public void getFileLength(FileModel fileModel,Path path){
        try {
            FileStatus file = fs.getFileStatus(path);
            if(file.isFile()){
                fileModel.count+=1;
                fileModel.length+=file.getLen();
            }else {
                //說明是資料夾
                for (FileStatus curFile : fs.listStatus(path)) {
                    if(curFile.isFile()){
                        fileModel.count+=1;
                        fileModel.length+=curFile.getLen();
                    }else {
                        getFileLength(fileModel,curFile.getPath());
                    }
                }
            }

        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    /**
     * 完全拷貝出來的，也很慢
     * @param f
     * @return
     * @throws IOException
     */
    public ContentSummary getContentSummary(Path f) throws IOException {
        FileStatus status = fs.getFileStatus(f);
        if (status.isFile()) {
            // f is a file
            long length = status.getLen();
            return new ContentSummary.Builder().length(length).
                    fileCount(1).directoryCount(0).spaceConsumed(length).build();
        }
        // f is a directory
        long[] summary = {0, 0, 1};
        for(FileStatus s : fs.listStatus(f)) {
            long length = s.getLen();
            ContentSummary c = s.isDirectory() ? getContentSummary(s.getPath()) :
                    new ContentSummary.Builder().length(length).
                            fileCount(1).directoryCount(0).spaceConsumed(length).build();
            summary[0] += c.getLength();
            summary[1] += c.getFileCount();
            summary[2] += c.getDirectoryCount();
        }
        return new ContentSummary.Builder().length(summary[0]).
                fileCount(summary[1]).directoryCount(summary[2]).
                spaceConsumed(summary[0]).build();
    }


    public class FileModel{
        int count = 0;
        int length = 0;
    }

}

問題的可能情況

前面的測試其實測錯類了，
我們是複製錯了，我們複製的是FileSystem中的，其實如果是讀hdfs，應該複製DistributedFileSystem的getContentSummay的方法
然後我仔細去看裡面的原始碼，會回撥其中的doCall方法（不會呼叫next方法）。
將doCall跟進去，會進入org.apache.hadoop.hdfs.
DFSClient.getContentSummary，這一步會進入namenode.getContentSummary(src);
其中namenode就是org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB的例項化類，跟到
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB的getContentSummary方法
後面就跟不進去了看不懂了
個人的猜測是，DFSClient的getContentSummary方法，可能用的是多執行緒的方法，
又或者是用的request請求到namenode後，由namenode在叢集中使用分散式的查詢，然後將結果彙總給namenode，然後再直接返回。由於存在並行方式，所以會快很多。
個人感覺第二種的可能性較大
具體的原始碼實在是太難看懂了，特別是涉及到protocolPB就不是很清晰了

HDFS獲取目錄大小API

最慢的一個方法–遞迴

使用FileSystem.getContentSummary方法

問題記錄

問題的可能情況

HDFS獲取目錄大小API

Linux 下獲取當前路徑所有目錄大小並排序

java 正則匹配 HDFS路徑後獲取目錄下檔案

獲取螢幕大小的API

Java非遞歸的方式獲取目錄中所有文件（包括目錄）

JS獲取clientWidth大小

使用HDFS客戶端java api讀取hadoop集群上的信息

關於服務器各用戶的家目錄大小限制方案

查看磁盤空間和目錄大小的命令

JQuery獲取圖片大小並控制圖片文件上傳大小以及上圖片文件時如何預覽圖片

【python】獲取目錄下的最新文件夾/文件

獲取目錄下時間最新的文件的文件名

[Java] File類遞歸獲取目錄下所有文件/文件夾

struts2獲取servlet的api

python 向上獲取目錄

Python開發【筆記】：獲取目錄下所有文件

獲取圖片大小

HDFS基本操作的API

通過url動態獲取圖片大小方法總結

使用C++獲取目錄下的指定檔案：結構體_finddata_t 以及函式_findfirst、_findnext、_fineclose

HDFS獲取目錄大小API

最慢的一個方法–遞迴

使用FileSystem.getContentSummary方法

問題記錄

問題的可能情況

相關推薦