hadoop SequenceFile

阿新 • • 發佈：2019-01-18

SequenceFile是專為mapreduce設計的，是可分割的二進位制格式，以key/value對的形式儲存。在儲存日誌檔案時，每一行文字代表一條日誌記錄。純文字不合適記錄二進位制型別的資料。SequenceFile可以作為小檔案的容器。

write

先看下在hadoop中如何寫SequenceFile。

 private static final String[] DATA = {
            "One, two, buckle my shoe",
            "Three, four, shut the door",
            "Five, six, pick up sticks" 
,
            "Seven, eight, lay them straight",
            "Nine, ten, a big fat hen"
    };

    public static void main(String[] args) throws IOException {

        Configuration configuration = new Configuration();

        configuration.set("fs.defaultFS", "hdfs://hadoop:9000");

        FileSystem fs = FileSystem.get 
(configuration);

        Path path = new Path("hdfs://hadoop:9000/hadoop/seq/numbers.seq");

        IntWritable key = new IntWritable();
        Text value = new Text();

        SequenceFile.Writer.Option valueOption = SequenceFile.Writer.valueClass(value.getClass());
        SequenceFile.Writer.Option keyOption = SequenceFile.Writer 
.keyClass(key.getClass());
        SequenceFile.Writer.Option file = SequenceFile.Writer.file(path);
        //指定了 file optiona 就不需要指定 stream
      //  SequenceFile.Writer.Option stream = SequenceFile.Writer.stream(fs.create(path));

        SequenceFile.Writer writer = SequenceFile.createWriter(configuration,file,keyOption,valueOption);

        for (int i = 0; i <100; i++) {
            key.set(100 -i);
            value.set(DATA[i % DATA.length]);
            System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);
            writer.append(key, value);
        }

        IOUtils.closeStream(writer);
    }

上面的程式碼基於hadoop 2.6.4，指定了 file option 就不能指定 stream option，否則會丟擲異常。原因如下：

      // check consistency of options
      if ((fileOption == null) == (streamOption == null)) {
        throw new IllegalArgumentException("file or stream must be specified");
      }

read

       Configuration configuration = new Configuration();

        configuration.set("fs.defaultFS", "hdfs://hadoop:9000");

        FileSystem fs = FileSystem.get(configuration);

        Path path = new Path("hdfs://hadoop:9000/hadoop/seq/numbers.seq");



        SequenceFile.Reader.Option file = SequenceFile.Reader.file(path);

        SequenceFile.Reader reader = new SequenceFile.Reader(configuration,file);
        Writable key = (Writable) ReflectionUtils.newInstance(reader.getKeyClass(),configuration);
        Writable value = (Writable)ReflectionUtils.newInstance(reader.getValueClass(),configuration);
        long position = reader.getPosition();

        while (reader.next(key,value)){
            String syncSeen = reader.syncSeen() ? "*" : "";
            System.out.printf("[%s%s]\t%s\t%s\n", position, syncSeen, key, value);
            position = reader.getPosition(); // beginning of next record
        }

        IOUtils.closeStream(reader);

輸出內容如下：

[128]   100 One, two, buckle my shoe
[173]   99  Three, four, shut the door
[220]   98  Five, six, pick up sticks
[264]   97  Seven, eight, lay them straight
[314]   96  Nine, ten, a big fat hen
[359]   95  One, two, buckle my shoe
[404]   94  Three, four, shut the door
[451]   93  Five, six, pick up sticks
[495]   92  Seven, eight, lay them straight
[545]   91  Nine, ten, a big fat hen
[590]   90  One, two, buckle my shoe
[635]   89  Three, four, shut the door
[682]   88  Five, six, pick up sticks
[726]   87  Seven, eight, lay them straight
[776]   86  Nine, ten, a big fat hen
[821]   85  One, two, buckle my shoe
[866]   84  Three, four, shut the door
[913]   83  Five, six, pick up sticks
[957]   82  Seven, eight, lay them straight
[1007]  81  Nine, ten, a big fat hen

The SequenceFile format

SequenceFile 內部儲存結構

一個SequenceFile是由一個head和一條或者多條record組成。
SequenceFile頭三個位元組是SEQ ，用數字表示；其後是version number，如下圖，Header也包括其他欄位，key/value class，compress等資訊。Sync 標誌是否允許使用者從檔案的任何position同步的讀取記錄。
SequenceFile Header

SequenceFile 的record的格式化依賴於是否開啟壓縮，如果壓縮，是record compression還是block compression。使用record compression的key不會被壓縮。

block compress

block 壓縮的有點是高效，整個塊被壓縮，而不是在記錄層進行壓縮。資料直到達到塊的大小時才被壓縮，在某一點上整個塊被壓縮，從而形成整體壓縮。預設情況下block的大小與HDFS的block的大小相同，可以使用 io.seqfile.compress.blocksize 設定。
block壓縮

使用壓縮，只要在write中新增一行程式碼就可以,read 程式碼可以不變

SequenceFile.Writer.Option compression = SequenceFile.Writer.compression(SequenceFile.CompressionType.BLOCK,new DefaultCodec());

hadoop SequenceFile

write

read

The SequenceFile format

hadoop SequenceFile——大資料儲存

hadoop SequenceFile 簡介／優缺點

基於Hadoop Sequencefile的小檔案解決方案

Hadoop—SequenceFile檔案的資料格式（1）

hadoop SequenceFile介紹大資料儲存

hadoop sequenceFile詳解

hadoop SequenceFile

hadoop SequenceFile介紹

Hadoop IO操作之SequenceFile 和 MapFile

Hadoop定義的SequenceFile和MapFile的程式設計實現

Hadoop中的SequenceFile系統之一

Hadoop中sequencefile和mapfile的區別

Hadoop中基於檔案的資料格式（1）SequenceFile

Hadoop 中SequenceFile的簡介

hadoop 使用map合並小文件到SequenceFile

Hadoop HDFS SequenceFile

hadoop安裝

全文索引-lucene，solr，nutch，hadoop之nutch與hadoop

Hadoop化繁為簡-從安裝Linux到搭建集群環境

Ubuntu14下Hadoop開發<1> 基礎環境安裝

hadoop SequenceFile

write

read

The SequenceFile format

相關推薦