Hadoop中Mapper過程的原始碼分析

阿新 • • 發佈：2019-01-03

簡單WordCount的案例程式碼

通過三個簡單的類WordCount,MyMapper和MyReducer實現一個簡單的單詞統計的功能.

WordCount類程式碼:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache 
.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

    public static void main(String[] args) throws Exception {
        System.getProperty("HADOOP_USER_NAME", "root");

        Configuration conf = new Configuration(true); 

        Job job = Job.getInstance(conf);

        job.setJarByClass(WordCount.class);
        job.setJobName("myjob");

        //設定mapper output的key和value
        job.setMapOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        job.setMapperClass(MyMapper.class);
        job.setReducerClass 
(MyReducer.class);

        Path input = new Path("/temp/wc/input");
        FileInputFormat.addInputPath(job, input);

        Path output = new Path("/temp/wc/output");
        if (output.getFileSystem(conf).exists(output)) {
            output.getFileSystem(conf).delete(output);
        }
        FileOutputFormat.setOutputPath(job, output);

        job.waitForCompletion(true);
    }
}

MyMapper類程式碼:

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class MyMapper extends Mapper<Object, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    /**
     * @param key split後的每一行的偏移量
     * @param value split後每一行的內容
     */
    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());
        while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            context.write(word, one);
        }
    }
}

MyReducer類程式碼:

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values, Context context)
            throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        result.set(sum);
        context.write(key, result);
    }
}

Map的準備過程

MapTask類裡面有一個run()方法:

@SuppressWarnings("unchecked")
  private <INKEY,INVALUE,OUTKEY,OUTVALUE>
  void runNewMapper(final JobConf job,
                    final TaskSplitIndex splitIndex,
                    final TaskUmbilicalProtocol umbilical,
                    TaskReporter reporter
                    ) throws IOException, ClassNotFoundException,
                             InterruptedException {
    // make a task context so we can get the classes
    org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =
      new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job, 
                                                                  getTaskID(),
                                                                  reporter);
    /* make a mapper,通過反射獲取Mapper. 如果使用者定義了Mapper,則使用使用者定義的Mapper,如果使用者沒有定義,則使用預設的Mapper. 可進入taskContext.getMapperClass()看一下是怎樣從配置資訊獲取Mapper的.*/
    org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE> mapper =
      (org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>)
        ReflectionUtils.newInstance(taskContext.getMapperClass(), job);

    /*make the input format 通過反射獲取InputFormat. 如果使用者自定義了就是用使用者自定義的,如果使用者沒有自定義,則使用預設的.可進入taskContext.getInputFormatClass()看一下是怎麼從配置資訊獲取InuputFormat的.*/
    org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat =
      (org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)
        ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);

    /*rebuild the input split,獲取split的資訊,split的資訊包括檔案file,開始位置的偏移量,大小,hosts等*/
    org.apache.hadoop.mapreduce.InputSplit split = null;
    split = getSplitDetails(new Path(splitIndex.getSplitLocation()),
        splitIndex.getStartOffset());
    LOG.info("Processing split: " + split);

    /*通過獲取完split,inputFormat,reporter, taskContext的物件去建立input. 這裡獲取的input是LineRecordReader型別*/
    org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =
      new NewTrackingRecordReader<INKEY,INVALUE>
        (split, inputFormat, reporter, taskContext);

    job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());
    org.apache.hadoop.mapreduce.RecordWriter output = null;

    // get an output object,獲取輸出流
    if (job.getNumReduceTasks() == 0) {
      output = 
        new NewDirectOutputCollector(taskContext, job, umbilical, reporter);
    } else {
      output = new NewOutputCollector(taskContext, job, umbilical, reporter);
    }

    org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY, OUTVALUE> 
    mapContext = 
      new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job, getTaskID(), 
          input, output, 
          committer, 
          reporter, split);

    org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context 
        mapperContext = 
          new WrappedMapper<INKEY, INVALUE, OUTKEY, OUTVALUE>().getMapContext(
              mapContext);

    try {
      //初始化輸入
      input.initialize(split, mapperContext);
      mapper.run(mapperContext);
      mapPhase.complete();
      setPhase(TaskStatus.Phase.SORT);
      statusUpdate(umbilical);
      input.close();
      input = null;
      output.close(mapperContext);
      output = null;
    } finally {
      closeQuietly(input);
      closeQuietly(output, mapperContext);
    }
  }

Map的輸入分析

通過以上原始碼分析,可以知道input的型別是LineRecordReader,
input.initialize(split, mapperContext)是進行輸入的初始化工作,這個方法的實現類是LineRecordReader. 咱們進入方法裡面窺探一下:

public void initialize(InputSplit genericSplit,
                         TaskAttemptContext context) throws IOException {
    FileSplit split = (FileSplit) genericSplit;
    Configuration job = context.getConfiguration();
    this.maxLineLength = job.getInt(MAX_LINE_LENGTH, Integer.MAX_VALUE);
    //下面三行是獲取split的start(偏移量),end(結束位置)和file(檔案資訊)
    start = split.getStart();
    end = start + split.getLength();
    final Path file = split.getPath();

    // open the file and seek to the start of the split
    final FileSystem fs = file.getFileSystem(job);
    //獲取到split的檔案輸入流
    fileIn = fs.open(file);

    CompressionCodec codec = new CompressionCodecFactory(job).getCodec(file);
    if (null!=codec) {
      isCompressedInput = true; 
      decompressor = CodecPool.getDecompressor(codec);
      if (codec instanceof SplittableCompressionCodec) {
        final SplitCompressionInputStream cIn =
          ((SplittableCompressionCodec)codec).createInputStream(
            fileIn, decompressor, start, end,
            SplittableCompressionCodec.READ_MODE.BYBLOCK);
        in = new CompressedSplitLineReader(cIn, job,
            this.recordDelimiterBytes);
        start = cIn.getAdjustedStart();
        end = cIn.getAdjustedEnd();
        filePosition = cIn;
      } else {
        in = new SplitLineReader(codec.createInputStream(fileIn,
            decompressor), job, this.recordDelimiterBytes);
        filePosition = fileIn;
      }
      //fileIn輸入流從split切片的偏移量開始讀取
      fileIn.seek(start);
      //這裡的in是SplitLineReader型別,通過fileIn去獲取SplitLineReader資料in
      in = new UncompressedSplitLineReader(
          fileIn, job, this.recordDelimiterBytes, split.getLength());
      filePosition = fileIn;
    }

    /*If this is not the first split, we always throw away first record because we always (except the last split) read one extra line in next() method. 說人話就是,如果起始偏移量不是0,也就是說如果不是第一個切片的第一行,放棄讀取該行,從第二行開始讀取. 因為在生成Block塊的時候,有可能一行資料會被拆分放到兩個block中*/
    if (start != 0) {
      start += in.readLine(new Text(), 0, maxBytesToConsume(start));
    }
    this.pos = start;
  }

回到MapTask類的run()方法,裡面的程式碼:

      //初始化輸入
      input.initialize(split, mapperContext);
      //輸入的初始化完成後就開始mapper的過程了.
      mapper.run(mapperContext);

Map的輸出分析

回到MapTask類的run()方法,看Map的輸出程式碼:

    // get an output object,獲取輸出流
    //如果Reduce的數量為0,執行NewDirectOutputCollector物件建立
    if (job.getNumReduceTasks() == 0) {
      output = 
        new NewDirectOutputCollector(taskContext, job, umbilical, reporter);
    } else {
      output = new NewOutputCollector(taskContext, job, umbilical, reporter);
    }

我們這裡分析Reduce數量大於0的情況. 進入NewOutputCollector()的程式碼:

private class NewOutputCollector<K,V>
    extends org.apache.hadoop.mapreduce.RecordWriter<K,V> {
    private final MapOutputCollector<K,V> collector;
    private final org.apache.hadoop.mapreduce.Partitioner<K,V> partitioner;
    private final int partitions;

    @SuppressWarnings("unchecked")
    NewOutputCollector(org.apache.hadoop.mapreduce.JobContext jobContext,
                       JobConf job,
                       TaskUmbilicalProtocol umbilical,
                       TaskReporter reporter
                       ) throws IOException, ClassNotFoundException {
      collector = createSortingCollector(job, reporter);
      //有幾個Reduce就對應幾個分割槽partitions
      partitions = jobContext.getNumReduceTasks();
      if (partitions > 1) {
        //又是反射,反射獲取的物件要麼是使用者設定的,要麼是預設的
        partitioner = (org.apache.hadoop.mapreduce.Partitioner<K,V>)
          ReflectionUtils.newInstance(jobContext.getPartitionerClass(), job);
      } else {
        partitioner = new org.apache.hadoop.mapreduce.Partitioner<K,V>() {
          @Override
          public int getPartition(K key, V value, int numPartitions) {
            return partitions - 1;
          }
        };
      }

@Override
    public void write(K key, V value) throws IOException, InterruptedException {
      collector.collect(key, value,
                        partitioner.getPartition(key, value, partitions));
    }

    @Override
    public void close(TaskAttemptContext context
                      ) throws IOException,InterruptedException {
      try {
        collector.flush();
      } catch (ClassNotFoundException cnf) {
        throw new IOException("can't find class ", cnf);
      }
      collector.close();
    }
  }

collector的建立

NewOutputCollector的構造方法中建立了collector:

collector = createSortingCollector(job, reporter);

我們進去看一下createSortingCollector(job, reporter)是怎麼實現的:

@SuppressWarnings("unchecked")
  private <KEY, VALUE> MapOutputCollector<KEY, VALUE>
          createSortingCollector(JobConf job, TaskReporter reporter)
    throws IOException, ClassNotFoundException {
    MapOutputCollector.Context context =
      new MapOutputCollector.Context(this, job, reporter);

    /*看到反射就激動,如果使用者定義了MAP_OUTPUT_COLLECTOR_CLASS_ATTR,collectorClasses取該型別,如果使用者沒有定義,collectorClasses則取MapOutputBuffer.class. 這是一個比較複雜的類,一般情況下不會自己定義這個東西. collectorClasses就是用預設的MapOutputBuffer.class就好了*/
    Class<?>[] collectorClasses = job.getClasses(
      JobContext.MAP_OUTPUT_COLLECTOR_CLASS_ATTR, MapOutputBuffer.class);
    int remainingCollectors = collectorClasses.length;
    for (Class clazz : collectorClasses) {
      try {
        if (!MapOutputCollector.class.isAssignableFrom(clazz)) {
          throw new IOException("Invalid output collector class: " + clazz.getName() +
            " (does not implement MapOutputCollector)");
        }
        Class<? extends MapOutputCollector> subclazz =
          clazz.asSubclass(MapOutputCollector.class);
        LOG.debug("Trying map output collector class: " + subclazz.getName());
        MapOutputCollector<KEY, VALUE> collector =
          ReflectionUtils.newInstance(subclazz, job);
        /*獲取完collector後進行collector的初始化,一般情況下collector使用預設的型別MapOutputBuffer.class*/
        collector.init(context);
        LOG.info("Map output collector class = " + collector.getClass().getName());
        return collector;
      } catch (Exception e) {
        String msg = "Unable to initialize MapOutputCollector " + clazz.getName();
        if (--remainingCollectors > 0) {
          msg += " (" + remainingCollectors + " more collector(s) to try)";
        }
        LOG.warn(msg, e);
      }
    }
    throw new IOException("Unable to initialize any output collector");
  }

從上面分析collector的型別一般是MapOutputBuffer.我們看一下collector.init(context)是怎麼初始化的:

@SuppressWarnings("unchecked")
    public void init(MapOutputCollector.Context context
                    ) throws IOException, ClassNotFoundException {
      job = context.getJobConf();
      reporter = context.getReporter();
      mapTask = context.getMapTask();
      mapOutputFile = mapTask.getMapOutputFile();
      sortPhase = mapTask.getSortPhase();
      spilledRecordsCounter = reporter.getCounter(TaskCounter.SPILLED_RECORDS);
      partitions = job.getNumReduceTasks();
      rfs = ((LocalFileSystem)FileSystem.getLocal(job)).getRaw();

      //sanity checks
      final float spillper =
        job.getFloat(JobContext.MAP_SORT_SPILL_PERCENT, (float)0.8);
      final int sortmb = job.getInt(JobContext.IO_SORT_MB, 100);
      indexCacheMemoryLimit = job.getInt(JobContext.INDEX_CACHE_MEMORY_LIMIT,
                                         INDEX_CACHE_MEMORY_LIMIT_DEFAULT);
      if (spillper > (float)1.0 || spillper <= (float)0.0) {
        throw new IOException("Invalid \"" + JobContext.MAP_SORT_SPILL_PERCENT +
            "\": " + spillper);
      }
      if ((sortmb & 0x7FF) != sortmb) {
        throw new IOException(
            "Invalid \"" + JobContext.IO_SORT_MB + "\": " + sortmb);
      }
      sorter = ReflectionUtils.newInstance(job.getClass("map.sort.class",
            QuickSort.class, IndexedSorter.class), job);
      // buffers and accounting
      int maxMemUsage = sortmb << 20;
      maxMemUsage -= maxMemUsage % METASIZE;
      kvbuffer = new byte[maxMemUsage];
      bufvoid = kvbuffer.length;
      kvmeta = ByteBuffer.wrap(kvbuffer)
         .order(ByteOrder.nativeOrder())
         .asIntBuffer();
      setEquator(0);
      bufstart = bufend = bufindex = equator;
      kvstart = kvend = kvindex;

      maxRec = kvmeta.capacity() / NMETA;
      softLimit = (int)(kvbuffer.length * spillper);
      bufferRemaining = softLimit;
      LOG.info(JobContext.IO_SORT_MB + ": " + sortmb);
      LOG.info("soft limit at " + softLimit);
      LOG.info("bufstart = " + bufstart + "; bufvoid = " + bufvoid);
      LOG.info("kvstart = " + kvstart + "; length = " + maxRec);

      // k/v serialization
      /*獲取比較器,如果使用者自定義了,則使用使用者自定義的比較器,如果沒有定義,則取預設的比較器.這裡就不進去看怎麼取比較器了,裡面的程式碼比較簡單,看官可以自己進去看*/
      comparator = job.getOutputKeyComparator();
      keyClass = (Class<K>)job.getMapOutputKeyClass();
      valClass = (Class<V>)job.getMapOutputValueClass();
      serializationFactory = new SerializationFactory(job);
      keySerializer = serializationFactory.getSerializer(keyClass);
      keySerializer.open(bb);
      valSerializer = serializationFactory.getSerializer(valClass);
      valSerializer.open(bb);

      // output counters
      mapOutputByteCounter = reporter.getCounter(TaskCounter.MAP_OUTPUT_BYTES);
      mapOutputRecordCounter =
        reporter.getCounter(TaskCounter.MAP_OUTPUT_RECORDS);
      fileOutputByteCounter = reporter
          .getCounter(TaskCounter.MAP_OUTPUT_MATERIALIZED_BYTES);

      // compression
      if (job.getCompressMapOutput()) {
        Class<? extends CompressionCodec> codecClass =
          job.getMapOutputCompressorClass(DefaultCodec.class);
        codec = ReflectionUtils.newInstance(codecClass, job);
      } else {
        codec = null;
      }

      // combiner
      final Counters.Counter combineInputCounter =
        reporter.getCounter(TaskCounter.COMBINE_INPUT_RECORDS);
      combinerRunner = CombinerRunner.create(job, getTaskID(), 
                                             combineInputCounter,
                                             reporter, null);
      if (combinerRunner != null) {
        final Counters.Counter combineOutputCounter =
          reporter.getCounter(TaskCounter.COMBINE_OUTPUT_RECORDS);
        combineCollector= new CombineOutputCollector<K,V>(combineOutputCounter, reporter, job);
      } else {
        combineCollector = null;
      }
      spillInProgress = false;
      minSpillsForCombine = job.getInt(JobContext.MAP_COMBINE_MIN_SPILLS, 3);
      spillThread.setDaemon(true);
      spillThread.setName("SpillThread");
      spillLock.lock();
      try {
        //開始往外溢寫內容
        spillThread.start();
        while (!spillThreadRunning) {
          spillDone.await();
        }
      } catch (InterruptedException e) {
        throw new IOException("Spill thread failed to initialize", e);
      } finally {
        spillLock.unlock();
      }
      if (sortSpillException != null) {
        throw new IOException("Spill thread failed to initialize",
            sortSpillException);
      }
    }

裡面有程式碼spillThread.start(),我們看一下map怎麼往外溢寫的,進入spillThread的run()方法程式碼:

protected class SpillThread extends Thread {

      @Override
      public void run() {
        spillLock.lock();
        spillThreadRunning = true;
        try {
          while (true) {
            spillDone.signal();
            while (!spillInProgress) {
              spillReady.await();
            }
            try {
              spillLock.unlock();
              //排序並開始往外溢寫
              sortAndSpill();
            } catch (Throwable t) {
              sortSpillException = t;
            } finally {
              spillLock.lock();
              if (bufend < bufstart) {
                bufvoid = kvbuffer.length;
              }
              kvstart = kvend;
              bufstart = bufend;
              spillInProgress = false;
            }
          }
        } catch (InterruptedException e) {
          Thread.currentThread().interrupt();
        } finally {
          spillLock.unlock();
          spillThreadRunning = false;
        }
      }
    }

我們看一下SpillThread是怎麼sortAndSpill()的,進去看一下:

private void sortAndSpill() throws IOException, ClassNotFoundException,
                                       InterruptedException {
      //approximate the length of the output file to be the length of the
      //buffer + header lengths for the partitions
      final long size = distanceTo(bufstart, bufend, bufvoid) +
                  partitions * APPROX_HEADER_LENGTH;
      FSDataOutputStream out = null;
      try {
        // create spill file
        final SpillRecord spillRec = new SpillRecord(partitions);
        final Path filename =
            mapOutputFile.getSpillFileForWrite(numSpills, size);
        out = rfs.create(filename);

        final int mstart = kvend / NMETA;
        final int mend = 1 + // kvend is a valid record
          (kvstart >= kvend
          ? kvstart
          : kvmeta.capacity() + kvstart) / NMETA;

        /*sorter對map進行排序,排序使過程使用的comparator在前面的分析中已經獲取過了,看官往回看一下.排序的詳細過程這裡就不分析了,客官自己進去看吧*/
        sorter.sort(MapOutputBuffer.this, mstart, mend, reporter);
        int spindex = mstart;
        final IndexRecord rec = new IndexRecord();
        final InMemValBytes value = new InMemValBytes();
        for (int i = 0; i < partitions; ++i) {
          IFile.Writer<K, V> writer = null;
          try {
            long segmentStart = out.getPos();
            FSDataOutputStream partitionOut = CryptoUtils.wrapIfNecessary(job, out);
            writer = new Writer<K, V>(job, partitionOut, keyClass, valClass, codec,
                                      spilledRecordsCounter);
            if (combinerRunner == null) {
              // spill directly
              DataInputBuffer key = new DataInputBuffer();
              while (spindex < mend &&
                  kvmeta.get(offsetFor(spindex % maxRec) + PARTITION) == i) {
                final int kvoff = offsetFor(spindex % maxRec);
                int keystart = kvmeta.get(kvoff + KEYSTART);
                int valstart = kvmeta.get(kvoff + VALSTART);
                key.reset(kvbuffer, keystart, valstart - keystart);
                getVBytesForOffset(kvoff, value);
                writer.append(key, value);
                ++spindex;
              }
            } else {
              int spstart = spindex;
              while (spindex < mend &&
                  kvmeta.get(offsetFor(spindex % maxRec)
                            + PARTITION) == i) {
                ++spindex;
              }
              // Note: we would like to avoid the combiner if we've fewer
              // than some threshold of records for a partition
              if (spstart != spindex) {
                combineCollector.setWriter(writer);
                RawKeyValueIterator kvIter =
                  new MRResultIterator(spstart, spindex);
                combinerRunner.combine(kvIter, combineCollector);
              }
            }

            // close the writer
            writer.close();

            // record offsets
            rec.startOffset = segmentStart;
            rec.rawLength = writer.getRawLength() + CryptoUtils.cryptoPadding(job);
            rec.partLength = writer.getCompressedLength() + CryptoUtils.cryptoPadding(job);
            spillRec.putIndex(rec, i);

            writer = null;
          } finally {
            if (null != writer) writer.close();
          }
        }

        if (totalIndexCacheMemory >= indexCacheMemoryLimit) {
          // create spill index file
          Path indexFilename =
              mapOutputFile.getSpillIndexFileForWrite(numSpills, partitions
                  * MAP_OUTPUT_INDEX_RECORD_LENGTH);
          spillRec.writeToFile(indexFilename, job);
        } else {
          indexCacheList.add(spillRec);
          totalIndexCacheMemory +=
            spillRec.size() * MAP_OUTPUT_INDEX_RECORD_LENGTH;
        }
        LOG.info("Finished spill " + numSpills);
        ++numSpills;
      } finally {
        if (out != null) out.close();
      }
    }

怎是分析到這裡.我們的目的是為了獲取NewOutputCollector collector. 前面的分析就是怎麼獲取這個collector的.為什麼我們要獲取這個collector呢?客官別急.
在MyReducer類裡面有一行程式碼:context.write(key, result);

public void reduce(Text key, Iterable<IntWritable> values, Context context)
            throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        result.set(sum);
        //這是往reduce溢寫的開始
        context.write(key, result);
    }

我們跟進去context.write(key, result);看看:

 @Override
    public void write(KEYOUT key, VALUEOUT value) throws IOException,
        InterruptedException {
      reduceContext.write(key, value);
    }

跟進去reduceContext.write(key, value)看看:

public void write(KEYOUT key, VALUEOUT value
                    ) throws IOException, InterruptedException {
    output.write(key, value);
  }

這裡的output就是前面我們分析獲取的NewOutputCollector collector.到這裡,Map的輸出過程大致分析完畢了.

確保map中相同的key分發到同一個reduce中

下面來分析一下map是如何確保具有相同key的資料會被分發到同一個reduce中.
回到NewOutputCollector類中,如果deduce的數量大於1:

if (partitions > 1) {
        //又是反射,反射獲取的物件要麼是使用者設定的,要麼是預設的
        partitioner = (org.apache.hadoop.mapreduce.Partitioner<K,V>)
          ReflectionUtils.newInstance(jobContext.getPartitionerClass(), job);
}

這裡jobContext的實現型別是JobContextImpl,進入jobContext.getPartitionerClass()的實現方法:

@SuppressWarnings("unchecked")
  public Class<? extends Partitioner<?,?>> getPartitionerClass() 
     throws ClassNotFoundException {
     /*如果使用者設定了PARTITIONER_CLASS_ATTR,則取該值,如果使用者沒有設定,則取HashPartitioner.class.*/
    return (Class<? extends Partitioner<?,?>>) 
      conf.getClass(PARTITIONER_CLASS_ATTR, HashPartitioner.class);
  }

我們可以進去看一下HashPartitioner.class是的邏輯是怎麼樣的:

public class HashPartitioner<K, V> extends Partitioner<K, V> {

  /** Use {@link Object#hashCode()} to partition. */
  public int getPartition(K key, V value,
                          int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }

}

從程式碼我們可以看出,經過key的取hash值再取模後確保了相同的key返回來的partition是相同的,說人話就是相同的key會被放到相同的reduce中.

回來到NewOutputCollector類中:

@SuppressWarnings("unchecked")
    NewOutputCollector(org.apache.hadoop.mapreduce.JobContext jobContext,
                       JobConf job,
                       TaskUmbilicalProtocol umbilical,
                       TaskReporter reporter
                       ) throws IOException, ClassNotFoundException {
      collector = createSortingCollector(job, reporter);
      //有幾個Reduce就對應幾個分割槽partitions
      partitions = jobContext.getNumReduceTasks();
      if (partitions > 1) {
        /*又是反射,反射獲取的物件要麼是使用者設定的,要麼是預設的.這兩行的程式碼的功能是獲取map分發出來的key對應到prititioner中,進而給對應的reduce處理.*/
        partitioner = (org.apache.hadoop.mapreduce.Partitioner<K,V>)
          ReflectionUtils.newInstance(jobContext.getPartitionerClass(), job);
      } else {
      //若果只有一個reduce,則所有的key都會被放到唯一的reduce中進行計算
        partitioner = new org.apache.hadoop.mapreduce.Partitioner<K,V>() {
          @Override
          public int getPartition(K key, V value, int numPartitions) {
            return partitions - 1;
          }
        };
      }
    }

Hadoop中Mapper過程的原始碼分析

簡單WordCount的案例程式碼通過三個簡單的類WordCount,MyMapper和MyReducer實現一個簡單的單詞統計的功能. WordCount類程式碼: import org.apache.hadoop.conf.Configura

Android系統程序間通訊 IPC 機制Binder中的Server啟動過程原始碼分析

在前面一篇文章中，介紹了在Android系統中Binder程序間通訊機制中的Server角色是如何獲得Service Manager遠端介面的，即defaultServiceManager函式的實現。Server獲得了Service Manager遠端介面之後，

Flink中TaskManager端執行使用者邏輯過程(原始碼分析)

TaskManager接收到來自JobManager的jobGraph轉換得到的TDD物件，啟動了任務，在StreamInputProcessor類的processInput()方法中通過一個while(true)中不停的拉取上游的資料，然後呼叫streamOperator.processElement(r

Golang中heap包原始碼分析

heap的實現使用到了小根堆，下面先對堆做個簡單說明 1. 堆概念　　　　堆是一種經過排序的完全二叉樹，其中任一非終端節點的資料值均不大於（或不小於）其左孩子和右孩子節點的值。　　最大堆和最小堆是二叉堆的兩種形式。　　最大堆：根結點的鍵值是所有堆結點鍵值中最大者。　　最小

Netty NioEventLoop 啟動過程原始碼分析

原文連結：https://wangwei.one/posts/netty-nioeventloop-analyse-for-startup.html 前面，我們分析了NioEventLoop的建立過程，接下來我們開始分析NioEventLoop的啟動和執行邏輯。

Netty NioEventLoop 建立過程原始碼分析

原文：https://wangwei.one/posts/netty-nioeventloop-analyse-for-create.html 前面，我們分析了Netty中的Channel元件，本篇我們來介紹一下與Channel關聯的另一個核心的元件 —— EventLo

Netty（五）服務端啟動過程原始碼分析——好文摘抄

下面先來一段 Netty 服務端的程式碼： public class NettyServer { public void bind(int port){ // 建立EventLoopGroup EventLoopGroup bossGroup = new

Tinyxml解析過程原始碼分析

tinyxml是一個優秀的，易用的，開源的xml解析庫，xml解析的最關鍵之處，就是如何將xml檔案內容解析成記憶體中的可用、易用的程式資料---DOM(Document Object Model)樹。DOM其實就是多叉樹，每個節

Uboot啟動過程原始碼分析之第二階段

UBoot的最終目標是啟動核心 1.從Flash中讀出核心 2.啟動核心通過呼叫lib_arm/board.c中的start_armboot函式進入uboot第二階段第二階段總結圖 typedef struct global_data { bd_t *bd; unsigned

Uboot啟動過程原始碼分析之第一階段（硬體相關）

從上一個部落格知道uboot的入口點在 cpu/arm920t/start.s 開啟cpu/arm920t/start.s 跳轉到reset reset: /* * set the cpu to SVC32 mode// CUP設定為管理模式 */ mrs r0,cps

Android GATT 連線過程原始碼分析

Android GATT 連線過程原始碼分析低功耗藍芽（BLE）裝置的通訊基本協議是 GATT, 要操作 BLE 裝置，第一步就是要連線裝置，其實就是連線 BLE 裝置上的 GATT service。結合上一篇文章，我這裡結合原始碼，分析一下 GATT 連線的流程

JAVA中的集合原始碼分析一：ArrayList的內部實現原理

作為以java為語言開發的android開發者，集合幾乎天天都要打交道，無論是使用頻率最高的ArrayList還是HashSet，都頻繁的出現在平時的工作中。但是其中的原理之前卻一直沒深入探究，接下來記錄一下這次自己學習ArrayList原始碼的過程。一.構造方法：

String類中的compareTo原始碼分析(為什麼這樣啊~~！！)

今天看了集合，在treeset中，要自定義排序，需要實現comparable介面（比較器排序），或者自己給出compareTo方法（自然排序），但是實現comparable介面，自己寫邏輯這個還好理解一點，但是對於這個compareTo我還是有的懵逼-->因為我

Android應用程式啟動過程原始碼分析(2)

Step 9. ActivityStack.startActivityUncheckedLocked 這個函式定義在frameworks/base/services/java/com/android/server/am/ActivityStack.java檔案中： view plain pu

spring啟動component-scan類掃描載入過程---原始碼分析

最近因為寫書的事情，一段時間沒有寫部落格了，有朋友最近問到了spring載入類的過程，尤其是基於annotation註解的載入過程，有些時候如果由於某些系統部署的問題，載入不到，很是不解！就針對這個問題，我這篇部落格說說spring啟動過程，用原始碼來說明，這部分內容也會在

Netty 接受請求過程原始碼分析 (基於4.1.23)

前言在前文中，我們分析了伺服器是如何啟動的。而伺服器啟動後肯定是要接受客戶端請求並返回客戶端想要的資訊的，否則要你伺服器幹啥子呢？所以，我們今天就分析分析 Netty 在啟動之後是如何接受客戶端請求的。開始吧！ 1. 從源頭開始從之前伺服器啟動的原始碼中，我們得

Mybatis之Mapper呼叫原始碼分析

Mybatis之Mapper呼叫原始碼分析這一篇是承接前面兩篇的，分別為:Mybatis原始碼解析之配置載入(一), Mybatis原始碼解析之配置載入(二),前面兩篇講了在Mybatis啟動時如何載入配置，這一節就講在執行時，如何通過session獲取Mapper代理類，從而實現

kubernetes pod-name生成過程原始碼分析

kubernetes 版本 [[email protected] ~]# kubectl version Client Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.0-168+f47446a

Android應用程式啟動過程原始碼分析

前文簡要介紹了Android應用程式的Activity的啟動過程。在Android系統中，應用程式是由Activity組成的，因此，應用程式的啟動過程實際上就是應用程式中的預設Activity的啟動過程，本文將詳細分析應用程式框架層的原始碼，瞭解Android

Android VSync事件分發過程原始碼分析

在上一篇文章Android VSync訊號產生過程原始碼分析中分別介紹了VSync的兩種產生方式，無論是通過硬體中斷產生還是通過軟體模擬產生，VSync事件最終都會交給EventThread執行緒來分發給所有VSync事件接收者。VSync事件接收者有很多，Surface

Hadoop中Mapper過程的原始碼分析

簡單WordCount的案例程式碼

WordCount類程式碼:

MyMapper類程式碼:

MyReducer類程式碼:

Map的準備過程

Map的輸入分析

Map的輸出分析

collector的建立

確保map中相同的key分發到同一個reduce中

相關推薦