hadoop sequenceFile詳解

阿新 • • 發佈：2019-01-14

hadoop不僅可以處理文字資訊，還可以處理二進位制格式資料，sequenceFile就是這樣的檔案。有以下特點：

可分割
支援壓縮
每一行都是一個key value對
可以設定同步點

SequenceFile的格式是由一個header 跟隨一個或多個記錄組成。前三個位元組是一個Bytes SEQ代表著版本號，同時header也包括key的名稱，value class , 壓縮細節，metadata，以及Sync markers。Sync markers的作用在於可以讀取任意位置的資料。

使用以下方法生成一個seqFile，其中的key是年份value是溫度。

     @Test
    public void testWrite() throws Exception {
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS","file:///");
        FileSystem fs = FileSystem.get(conf);
        Path p = new Path("/home/hadoop/seq/1.seq");
        SequenceFile.Writer writer = null;


        writer = SequenceFile.createWriter(conf,SequenceFile.Writer.file(p),//設定檔案輸出路徑
                SequenceFile.Writer.keyClass(IntWritable.class),//設定輸出key的型別
                SequenceFile.Writer.valueClass(IntWritable.class)//設定輸出value的型別
        );
        IntWritable year;//年份
        IntWritable temperature;//溫度
        for (int i = 0;i<8000;i++){
            year = new IntWritable();
            year.set(1970 + new Random().nextInt(100));
            temperature = new IntWritable();
            temperature.set(new Random().nextInt(100)-30);
            writer.append(year,temperature);
        }
        writer.close();
    }

使用命令檢視生成檔案的內容：

hdfs dfs -text 1.seq

寫seqFile的時候生成壓縮後的檔案：

    /**
     * 壓縮後的seqFile
     */
    @Test
    public void testWriteZipSeqFile() throws Exception {
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS","file:///");
        FileSystem fs = FileSystem.get(conf);
        Path p = new Path("/home/hadoop/seq/2.seq");
        SequenceFile.Writer writer = null;

        writer = SequenceFile.createWriter(conf,SequenceFile.Writer.file(p),
                SequenceFile.Writer.keyClass(IntWritable.class),
                SequenceFile.Writer.valueClass(IntWritable.class),
                SequenceFile.Writer.compression(SequenceFile.CompressionType.BLOCK,new DefaultCodec())//採取塊壓縮的方式,defaultCodec的壓縮編碼
                );
        IntWritable year;//年份
        IntWritable temperature;//溫度
        for (int i = 0;i<8000;i++){
            year = new IntWritable();
            year.set(1970 + new Random().nextInt(100));
            temperature = new IntWritable();
            temperature.set(new Random().nextInt(100)-30);
            writer.append(year,temperature);
            if(i % 400 == 0){
                writer.sync();//每四百條記錄新增一個同步點
            }
        }
        writer.close();
    }

壓縮方式有三種：塊壓縮"BLOCK",記錄壓縮（壓縮value）“RECORD”,還有就是不壓縮。壓縮方式有：DeflateCodecGzipCodecBZip2CodecLz4CodecLZOSnappyCodec。lzo snappyCodec需要你機器本地有響應的庫支援。

讀操作，讀取所有的記錄，不管是壓縮還是非壓縮檔案都可以直接讀取

    /**
     * 讀操作,迴圈輸出所有key-value
     */
    @Test
    public void readSeqFile() throws Exception {
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS","file:///");
        FileSystem fs = FileSystem.get(conf);
        Path p = new Path("/home/hadoop/seq/1.seq") ;
        SequenceFile.Reader reader = null;
        reader = new SequenceFile.Reader(conf,SequenceFile.Reader.file(p)//路徑
                );
        IntWritable key = new IntWritable();
        IntWritable value = new IntWritable() ;
        while(reader.next(key,value)){
            System.out.println(key.get() + " : " + value.toString());
        }
        reader.close();
    }

從seqFile中定位偏移量，讀取檔案

    /**
     * 讀操作,迴圈輸出所有key-value
     * 根據定位到某個位元組，來讀取檔案
     */
    @Test
    public void readSeqFileFromSeek() throws Exception {
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS","file:///");
        FileSystem fs = FileSystem.get(conf);
        Path p = new Path("/home/hadoop/seq/2.seq") ;
        SequenceFile.Reader reader = null;
        reader = new SequenceFile.Reader(conf,SequenceFile.Reader.file(p)//路徑
        );
        IntWritable key = new IntWritable();
        IntWritable value = new IntWritable() ;
        reader.seek(25130);//從這個偏移量的下一個同步點開始讀，偏移量必須是準確的
        while(reader.next(key,value)){
            System.out.println(reader.getPosition() + "   " +key.get() + " : " + value.toString());
        }
        reader.close();
    }

/**
     * 讀操作,迴圈輸出所有key-value
     * 根據同步點，來讀取檔案
     */
    @Test
    public void readSeqFileFromSync() throws Exception {
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS","file:///");
        FileSystem fs = FileSystem.get(conf);
        Path p = new Path("/home/hadoop/seq/2.seq") ;
        SequenceFile.Reader reader = null;
        reader = new SequenceFile.Reader(conf,SequenceFile.Reader.file(p)//路徑
                );
        IntWritable key = new IntWritable();
        IntWritable value = new IntWritable() ;
        reader.sync(23810);//從這個偏移量的下一同步點開始讀
        while(reader.next(key,value)){
            System.out.println(reader.getPosition() + "   " +key.get() + " : " + value.toString());
        }
        reader.close();
    }

hadoop sequenceFile詳解

hadoop sequenceFile詳解

hadoop框架詳解

大數據hadoop入門之hadoop家族詳解

HADOOP IO詳解——序列化（1）

HADOOP IO詳解——序列化（2）舉列

hadoop家族學習路線圖之hadoop產品詳解

Hadoop原始碼詳解之DBOutputFormat類

Hadoop原始碼詳解之Mapper類

Hadoop原始碼詳解之FileOutputFormat 類

Hadoop 原始碼詳解之FileInputFormat類

Hadoop 原始碼詳解之RecordReader介面

Hadoop原始碼詳解之Job 類

Hadoop Operations(Hadoop操作) 詳解(二) HDFS

Hadoop RPC詳解-RMI

Mahout駕馭hadoop之詳解

雲端計算期末報告無圖 kmeans和最短路徑演算法hadoop實現詳解

hadoop wordcount 詳解

Hadoop之WordCount詳解

008-Hadoop Hive sql語法詳解3-DML 操作:元數據存儲

hadoop生態圈的詳解

hadoop sequenceFile詳解

相關推薦