MapReduce設定輸出檔案到多個資料夾下

阿新 • • 發佈：2018-11-28

一：自定義OutputFormat類

MapReduce預設的OutPutFormat會將結果輸出檔案放置到一個我們指定的目錄下，但如果想把輸出檔案根據某個條件，把滿足不同條件的內容分別輸出到不同的目錄下，就需要自定義實現OutputFormat類，且重寫RecordWriter方法。

在驅動類中設定job.setOutputFormatClass方法為自定義實現的OutputFormat類

下面案例以一組購物文字資料，將其中的好評和差評分別輸出到對應的好評資料夾下、差評資料夾下。

二：自定義實現OutputFormat類程式碼實現


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * 自定義實現OutputFormat類
 */
public class MyOutputFormat extends FileOutputFormat<Text,NullWritable> {

    @Override
    public RecordWriter<Text, NullWritable> getRecordWriter(TaskAttemptContext context) throws IOException, InterruptedException {
       //從這個方法裡面可以獲取一個configuration
        Configuration configuration = context.getConfiguration();
        //獲取檔案系統的物件
        FileSystem fileSystem = FileSystem.get(configuration);
        //好評檔案的輸出路徑
        Path goodComment = new Path("file:///F:\\goodComment\\1.txt");

        //差評檔案的輸出路徑
        Path badComment = new Path("file:///F:\\badComment\\1.txt");

        //獲取到了兩個輸出流
        FSDataOutputStream fsDataOutputStream = fileSystem.create(goodComment);
        FSDataOutputStream fsDataOutputStream1 = fileSystem.create(badComment);

        MyRecordWriter myRecordWriter = new MyRecordWriter(fsDataOutputStream, fsDataOutputStream1);

        return myRecordWriter;
    }
}

三：自定義實現RecordWriter類


import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;

import java.io.IOException;

public class MyRecordWriter extends RecordWriter<Text,NullWritable> {
    private FSDataOutputStream goodStream;
    private FSDataOutputStream badSteam;

    public MyRecordWriter(){

    }

    public  MyRecordWriter(FSDataOutputStream goodStream,FSDataOutputStream badSteam){
        this.goodStream = goodStream;
        this.badSteam= badSteam;

    }

    /**
     * 重寫write方法
     * 這個write方法就是往外寫出去資料，我們可以根據這個key，來判斷檔案究竟往哪個目錄下面寫
     * goodStream：指定輸出檔案
     * badSteam：自定輸出檔案
     * @param key：k3
     * @param value
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    public void write(Text key, NullWritable value) throws IOException, InterruptedException {
        String[] split = key.toString().split("\t");
        //獲取評論狀態  0  好評  1   中評  2 差評
     //   split[9]
        //判斷評評論狀態，如果是小於等於1，都寫到好評檔案裡面去
        if(Integer.parseInt(split[9])<=1){
            //好評
            goodStream.write(key.getBytes());
            goodStream.write("\r\n".getBytes());
        }else{
            //差評
            badSteam.write(key.getBytes());
            badSteam.write("\r\n".getBytes());
        }
    }

    /**
     * 關閉資源
     * @param context：上下文物件
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    public void close(TaskAttemptContext context) throws IOException, InterruptedException {
        IOUtils.closeStream(badSteam);
        IOUtils.closeStream(goodStream);
    }
}

四：自定義Map類


import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class MyOutputMapper extends Mapper<LongWritable,Text,Text,NullWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        context.write(value,NullWritable.get());
    }
}

五：驅動程式


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class MyOutputMain extends Configured implements Tool {
    @Override
    public int run(String[] args) throws Exception {
        Job job = Job.getInstance(super.getConf(), "ownOutputFormat");

        job.setInputFormatClass(TextInputFormat.class);
        TextInputFormat.addInputPath(job,new Path("file:///F:\\input"));


        job.setMapperClass(MyOutputMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);


        job.setOutputFormatClass(MyOutputFormat.class);
        //由於重寫了FileOutputFormat，所以下面這個指定的目錄內不會有輸出檔案
        //輸出檔案在MyOutputFormat中重新指定
        MyOutputFormat.setOutputPath(job ,new Path("file:///F:\\output"));

        boolean b = job.waitForCompletion(true);

        return b?0:1;
    }

    public static void main(String[] args) throws Exception {
        int run = ToolRunner.run(new Configuration(), new MyOutputMain(), args);
        System.exit(run);
    }

}

MapReduce設定輸出檔案到多個資料夾下

一：自定義OutputFormat類 MapReduce預設的OutPutFormat會將結果輸出檔案放置到一個我們指定的目錄下，但如果想把輸出檔案根據某個條件，把滿足不同條件的內容分別輸出到不同的目錄下，就需要自定義實現OutputFormat類，且重寫RecordWriter方法。在驅

Python_合併多個資料夾下的多個csv檔案

# -*- coding:utf8 -*- import os.path import os import csv import re path = "D:\Datebase\data1\DataChallengeOne" #i = 0 files = os.listdir(path) with open(

java 同時建立多個資料夾和檔案

public void demo1() { try { File dir = new File("d:\\abc\\bcd"); if (!dir.exists()) { dir.mkdirs(); } File file = new

計算多個資料夾中，總檔案個數（python）

# -*- coding: utf-8 -*- # Time:2017.03.28 # Author:coplin # Function:Count the number of image file.

spark讀取多個資料夾(巢狀)下的多個檔案

在正常呼叫過程中，難免需要對多個資料夾下的多個檔案進行讀取，然而之前只是明確了Spark具備讀取多個檔案的能力。針對多個資料夾下的多個檔案，以前的做法是先進行資料夾的遍歷，然後再進行各個資料夾目錄的讀取。今天在做測試的時候，居然發現spark原生就支援這樣的能力。

python實現將多個檔案分配到多個資料夾

import os import shutil #path of imgr path = 'D:\\BaiduNetdiskDownload\\newim\\' #path of folder folderPath = 'D:\\BaiduNetdiskDo

Spark中載入本地（或者hdfs）檔案以及 spark使用SparkContext例項的textFile讀取多個資料夾(巢狀)下的多個數據檔案

Spark中載入本地(或者hdfs)檔案以及 spark使用SparkContext例項的textFile讀取多個資料夾(巢狀)下的多個數據檔案在正常呼叫過程中，難免需要對多個資料夾下的多個檔案進行讀取，然而之前只是明確了spark具備讀取多個檔案的能力。針對多個資料夾下

OS 建立多個資料夾

import requests import os import json url='https://pvp.qq.com/web201605/js/herolist.json' html=requests.get(url) html_json=html.json() B=list(map(lambd

nginx 一個域名配置多個資料夾

server { listen 80; server_name mall.cn; #charset koi8-r; access_log logs/access.log main; client_max_body_

python利用pandas對多個資料夾裡的excel進行合併，切割

程式碼如下 import os import pandas as pd df = pd.DataFrame(columns=['流水號','事件名稱','本方戶名','對方戶名','流水時間','操作員','交易額','流水標誌','扇區號']) l = []

dockerfile COPY如何同時拷貝多個資料夾

首先，拷貝一個資料夾到容器裡的命令是 COPY src WORKDIR/src 那麼，同時拷貝多個資料夾就是這樣？ COPY src1 \ src2 \ WORKDIR/ 但是這麼操作過後，你會發現容器裡面WORKDIR目錄

liunx-shell比較兩個資料夾下的文字內容

兩個資料夾下面的檔案個數及其名稱完全一樣，但是內容有的可能不一樣，這個指令碼遍歷資料夾下的所有子資料夾，包括巢狀多層的，然後使用linux系統的diff命令對兩個名稱一樣的檔案進行比較，指令碼如下： #!/bin/sh if [ $# -ne 2 ] then

python使用os.walk和os.path.join來遍歷資料夾的檔案(包括子資料夾下的檔案)

使用os.walk和os.path.join來遍歷資料夾的檔案 import os import os.path path = 'C://' for root, dirs, files in os.walk(path): for file in files:

shell指令碼查詢兩個資料夾下相同的檔名

#!/bin/bash FOLDER_A=/home/yangfei/recieve/Galaxy/res/drawable-mdpi-480x320 FOLDER_B=/home/yangfei/recieve/new FOLDER_C=/home/yangfei/recieve/tmp for fil

MFC遞迴掃描指定資料夾下的所有檔案包括子資料夾下的檔案。

函式名：TraverseDir 函式功能：遞迴掃描制定資料夾下所有檔案（包括子資料夾下的檔案）引數：strDir, vecFiles （入口） strDir : 用於遞迴掃描的資料夾路徑（出口） vecFiles : 資料夾下的所有檔

怎樣用matlab讀取一個資料夾下的多個子資料夾中的多個圖片檔案

maindir = 'E:\Temp Folder'; subdir = dir( maindir ); % 先確定子資料夾 for i = 1 : length( subdir ) if( isequal( subdir( i ).name, '.' )

python 讀取資料夾下多個檔案

import os os.chdir("G:\head in python\hfpy_ch5_data") L=[] for files in os.walk("G:\head in python\hfpy_ch5_data"): for file i

合併一個資料夾下多個檔案內容的單行shell命令

轉載網址：http://www.shangxueba.com/jingyan/1898710.html 合併一個資料夾下多個檔案內容：複製程式碼程式碼如下: find -name "*.log" -exec 'cat' {} \; > test.txt

MyEclipse編譯後，classes資料夾下為空2. 3.刪除現在的專案,提前設定好編譯檔案輸出路徑，重新匯入原始檔，設定eclipse為儲存時編譯，然後在儲存的時候就可以自動編譯了

問題總結： 1.重新匯入的專案結構與原來的不同 src包等等都和原來的不同，開啟專案主目錄，中有個.classpath檔案，用記事本開啟會發現有一行<classpathentry kind="src" path="src"/>，估計它的意思就是說你的原始檔位置，看看有沒有這一行，沒有一定要補上，下

MFC對話方塊選擇多個檔案及選擇資料夾

選擇多個檔案（這裡選擇多張圖片） void SelctFiles() { CFileDialog dlg(TRUE, _T("*.jpg"), NULL, OFN_ALLOWMULTISE

MapReduce設定輸出檔案到多個資料夾下

一：自定義OutputFormat類

二：自定義實現OutputFormat類程式碼實現

三：自定義實現RecordWriter類

四：自定義Map類

五：驅動程式

相關推薦