聊聊flink的ParallelIteratorInputFormat

阿新 • • 發佈：2018-11-30

序

本文主要研究一下flink的ParallelIteratorInputFormat

例項

        final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        DataSet<Long> dataSet = env.generateSequence(15,106)
                .setParallelism(3);
        dataSet.print();
複製程式碼

這裡使用ExecutionEnvironment的generateSequence方法建立了帶NumberSequenceIterator的ParallelIteratorInputFormat

ParallelIteratorInputFormat

flink-java-1.6.2-sources.jar!/org/apache/flink/api/java/io/ParallelIteratorInputFormat.java

/**
 * An input format that generates data in parallel through a {@link SplittableIterator}.
 */
@PublicEvolving
public class ParallelIteratorInputFormat<T> extends GenericInputFormat<T> {

	private static final long serialVersionUID = 1L;

	private final SplittableIterator<T> source 
;

	private transient Iterator<T> splitIterator;

	public ParallelIteratorInputFormat(SplittableIterator<T> iterator) {
		this.source = iterator;
	}

	@Override
	public void open(GenericInputSplit split) throws IOException {
		super.open(split);

		this.splitIterator = this.source.getSplit(split.getSplitNumber(), split.getTotalNumberOfSplits());
	}

	@Override
	public boolean reachedEnd 
() {
		return !this.splitIterator.hasNext();
	}

	@Override
	public T nextRecord(T reuse) {
		return this.splitIterator.next();
	}
}
複製程式碼

ParallelIteratorInputFormat繼承了GenericInputFormat類，而GenericInputFormat類底下還有其他四個子類，分別是CRowValuesInputFormat、CollectionInputFormat、IteratorInputFormat、ValuesInputFormat，它們有一個共同的特點就是都實現了NonParallelInput介面

NonParallelInput

flink-core-1.6.2-sources.jar!/org/apache/flink/api/common/io/NonParallelInput.java

/**
 * This interface acts as a marker for input formats for inputs which cannot be split.
 * Data sources with a non-parallel input formats are always executed with a parallelism
 * of one.
 * 
 * @see InputFormat
 */
@Public
public interface NonParallelInput {
}
複製程式碼

這個介面沒有定義任何方法，僅僅是一個標識，表示該InputFormat是否支援split

GenericInputFormat.createInputSplits

flink-core-1.6.2-sources.jar!/org/apache/flink/api/common/io/GenericInputFormat.java

	@Override
	public GenericInputSplit[] createInputSplits(int numSplits) throws IOException {
		if (numSplits < 1) {
			throw new IllegalArgumentException("Number of input splits has to be at least 1.");
		}

		numSplits = (this instanceof NonParallelInput) ? 1 : numSplits;
		GenericInputSplit[] splits = new GenericInputSplit[numSplits];
		for (int i = 0; i < splits.length; i++) {
			splits[i] = new GenericInputSplit(i, numSplits);
		}
		return splits;
	}
複製程式碼

GenericInputFormat的createInputSplits方法對輸入的numSplits進行了限制，如果小於1則丟擲IllegalArgumentException異常，如果當前InputFormat有實現NonParallelInput介面，則將numSplits重置為1

ExecutionEnvironment.fromParallelCollection

flink-java-1.6.2-sources.jar!/org/apache/flink/api/java/ExecutionEnvironment.java

	/**
	 * Creates a new data set that contains elements in the iterator. The iterator is splittable, allowing the
	 * framework to create a parallel data source that returns the elements in the iterator.
	 *
	 * <p>Because the iterator will remain unmodified until the actual execution happens, the type of data
	 * returned by the iterator must be given explicitly in the form of the type class (this is due to the
	 * fact that the Java compiler erases the generic type information).
	 *
	 * @param iterator The iterator that produces the elements of the data set.
	 * @param type The class of the data produced by the iterator. Must not be a generic class.
	 * @return A DataSet representing the elements in the iterator.
	 *
	 * @see #fromParallelCollection(SplittableIterator, TypeInformation)
	 */
	public <X> DataSource<X> fromParallelCollection(SplittableIterator<X> iterator, Class<X> type) {
		return fromParallelCollection(iterator, TypeExtractor.getForClass(type));
	}

	/**
	 * Creates a new data set that contains elements in the iterator. The iterator is splittable, allowing the
	 * framework to create a parallel data source that returns the elements in the iterator.
	 *
	 * <p>Because the iterator will remain unmodified until the actual execution happens, the type of data
	 * returned by the iterator must be given explicitly in the form of the type information.
	 * This method is useful for cases where the type is generic. In that case, the type class
	 * (as given in {@link #fromParallelCollection(SplittableIterator, Class)} does not supply all type information.
	 *
	 * @param iterator The iterator that produces the elements of the data set.
	 * @param type The TypeInformation for the produced data set.
	 * @return A DataSet representing the elements in the iterator.
	 *
	 * @see #fromParallelCollection(SplittableIterator, Class)
	 */
	public <X> DataSource<X> fromParallelCollection(SplittableIterator<X> iterator, TypeInformation<X> type) {
		return fromParallelCollection(iterator, type, Utils.getCallLocationName());
	}

	// private helper for passing different call location names
	private <X> DataSource<X> fromParallelCollection(SplittableIterator<X> iterator, TypeInformation<X> type, String callLocationName) {
		return new DataSource<>(this, new ParallelIteratorInputFormat<>(iterator), type, callLocationName);
	}

	/**
	 * Creates a new data set that contains a sequence of numbers. The data set will be created in parallel,
	 * so there is no guarantee about the order of the elements.
	 *
	 * @param from The number to start at (inclusive).
	 * @param to The number to stop at (inclusive).
	 * @return A DataSet, containing all number in the {@code [from, to]} interval.
	 */
	public DataSource<Long> generateSequence(long from, long to) {
		return fromParallelCollection(new NumberSequenceIterator(from, to), BasicTypeInfo.LONG_TYPE_INFO, Utils.getCallLocationName());
	}
複製程式碼

ExecutionEnvironment的fromParallelCollection方法，針對SplittableIterator型別的iterator，會建立ParallelIteratorInputFormat；generateSequence方法也呼叫了fromParallelCollection方法，它建立的是NumberSequenceIterator(是SplittableIterator的子類)

SplittableIterator

flink-core-1.6.2-sources.jar!/org/apache/flink/util/SplittableIterator.java

/**
 * Abstract base class for iterators that can split themselves into multiple disjoint
 * iterators. The union of these iterators returns the original iterator values.
 *
 * @param <T> The type of elements returned by the iterator.
 */
@Public
public abstract class SplittableIterator<T> implements Iterator<T>, Serializable {

	private static final long serialVersionUID = 200377674313072307L;

	/**
	 * Splits this iterator into a number disjoint iterators.
	 * The union of these iterators returns the original iterator values.
	 *
	 * @param numPartitions The number of iterators to split into.
	 * @return An array with the split iterators.
	 */
	public abstract Iterator<T>[] split(int numPartitions);

	/**
	 * Splits this iterator into <i>n</i> partitions and returns the <i>i-th</i> partition
	 * out of those.
	 *
	 * @param num The partition to return (<i>i</i>).
	 * @param numPartitions The number of partitions to split into (<i>n</i>).
	 * @return The iterator for the partition.
	 */
	public Iterator<T> getSplit(int num, int numPartitions) {
		if (numPartitions < 1 || num < 0 || num >= numPartitions) {
			throw new IllegalArgumentException();
		}

		return split(numPartitions)[num];
	}

	/**
	 * The maximum number of splits into which this iterator can be split up.
	 *
	 * @return The maximum number of splits into which this iterator can be split up.
	 */
	public abstract int getMaximumNumberOfSplits();
}
複製程式碼

SplittableIterator是個抽象類，它定義了抽象方法split以及getMaximumNumberOfSplits；它有兩個實現類，分別是LongValueSequenceIterator以及NumberSequenceIterator，這裡我們看下NumberSequenceIterator

NumberSequenceIterator

flink-core-1.6.2-sources.jar!/org/apache/flink/util/NumberSequenceIterator.java

/**
 * The {@code NumberSequenceIterator} is an iterator that returns a sequence of numbers (as {@code Long})s.
 * The iterator is splittable (as defined by {@link SplittableIterator}, i.e., it can be divided into multiple
 * iterators that each return a subsequence of the number sequence.
 */
@Public
public class NumberSequenceIterator extends SplittableIterator<Long> {

	private static final long serialVersionUID = 1L;

	/** The last number returned by the iterator. */
	private final long to;

	/** The next number to be returned. */
	private long current;


	/**
	 * Creates a new splittable iterator, returning the range [from, to].
	 * Both boundaries of the interval are inclusive.
	 *
	 * @param from The first number returned by the iterator.
	 * @param to The last number returned by the iterator.
	 */
	public NumberSequenceIterator(long from, long to) {
		if (from > to) {
			throw new IllegalArgumentException("The 'to' value must not be smaller than the 'from' value.");
		}

		this.current = from;
		this.to = to;
	}


	@Override
	public boolean hasNext() {
		return current <= to;
	}

	@Override
	public Long next() {
		if (current <= to) {
			return current++;
		} else {
			throw new NoSuchElementException();
		}
	}

	@Override
	public NumberSequenceIterator[] split(int numPartitions) {
		if (numPartitions < 1) {
			throw new IllegalArgumentException("The number of partitions must be at least 1.");
		}

		if (numPartitions == 1) {
			return new NumberSequenceIterator[] { new NumberSequenceIterator(current, to) };
		}

		// here, numPartitions >= 2 !!!

		long elementsPerSplit;

		if (to - current + 1 >= 0) {
			elementsPerSplit = (to - current + 1) / numPartitions;
		}
		else {
			// long overflow of the range.
			// we compute based on half the distance, to prevent the overflow.
			// in most cases it holds that: current < 0 and to > 0, except for: to == 0 and current == Long.MIN_VALUE
			// the later needs a special case
			final long halfDiff; // must be positive

			if (current == Long.MIN_VALUE) {
				// this means to >= 0
				halfDiff = (Long.MAX_VALUE / 2 + 1) + to / 2;
			} else {
				long posFrom = -current;
				if (posFrom > to) {
					halfDiff = to + ((posFrom - to) / 2);
				} else {
					halfDiff = posFrom + ((to - posFrom) / 2);
				}
			}
			elementsPerSplit = halfDiff / numPartitions * 2;
		}

		if (elementsPerSplit < Long.MAX_VALUE) {
			// figure out how many get one in addition
			long numWithExtra = -(elementsPerSplit * numPartitions) + to - current + 1;

			// based on rounding errors, we may have lost one)
			if (numWithExtra > numPartitions) {
				elementsPerSplit++;
				numWithExtra -= numPartitions;

				if (numWithExtra > numPartitions) {
					throw new RuntimeException("Bug in splitting logic. To much rounding loss.");
				}
			}

			NumberSequenceIterator[] iters = new NumberSequenceIterator[numPartitions];
			long curr = current;
			int i = 0;
			for (; i < numWithExtra; i++) {
				long next = curr + elementsPerSplit + 1;
				iters[i] = new NumberSequenceIterator(curr, next - 1);
				curr = next;
			}
			for (; i < numPartitions; i++) {
				long next = curr + elementsPerSplit;
				iters[i] = new NumberSequenceIterator(curr, next - 1, true);
				curr = next;
			}

			return iters;
		}
		else {
			// this can only be the case when there are two partitions
			if (numPartitions != 2) {
				throw new RuntimeException("Bug in splitting logic.");
			}

			return new NumberSequenceIterator[] {
				new NumberSequenceIterator(current, current + elementsPerSplit),
				new NumberSequenceIterator(current + elementsPerSplit, to)
			};
		}
	}

	@Override
	public int getMaximumNumberOfSplits() {
		if (to >= Integer.MAX_VALUE || current <= Integer.MIN_VALUE || to - current + 1 >= Integer.MAX_VALUE) {
			return Integer.MAX_VALUE;
		}
		else {
			return (int) (to - current + 1);
		}
	}

	//......
}
複製程式碼

NumberSequenceIterator的構造器提供了from及to兩個引數，它內部有一個current值，初始的時候等於from
split方法首先根據numPartitions，來計算elementsPerSplit，當to - current + 1 >= 0時，計算公式為(to - current + 1) / numPartitions
之後根據計算出來的elementsPerSplit來計算numWithExtra，這是因為計算elementsPerSplit的時候用的是取整操作，如果每一批都按elementsPerSplit，可能存在多餘的，於是就算出這個多餘的numWithExtra，如果它大於numPartitions，則對elementsPerSplit增加1，然後對numWithExtra減去numPartitions
最後就是先根據numWithExtra來迴圈分配前numWithExtra個批次，將多餘的numWithExtra平均分配給前numWithExtra個批次；numWithExtra之後到numPartitions的批次，就正常的使用from + elementsPerSplit -1來計算to
getMaximumNumberOfSplits則是返回可以split的最大數量，(to >= Integer.MAX_VALUE || current <= Integer.MIN_VALUE || to - current + 1 >= Integer.MAX_VALUE)的條件下返回Integer.MAX_VALUE，否則返回(int) (to - current + 1)

小結

GenericInputFormat類底下有五個子類，除了ParallelIteratorInputFormat外，其他的分別是CRowValuesInputFormat、CollectionInputFormat、IteratorInputFormat、ValuesInputFormat，後面這四個子類有一個共同的特點就是都實現了NonParallelInput介面
GenericInputFormat的createInputSplits會對輸入的numSplits進行限制，如果是NonParallelInput型別的，則強制重置為1
NumberSequenceIterator是SplittableIterator的一個實現類，在ExecutionEnvironment的fromParallelCollection方法，generateSequence方法(它建立的是NumberSequenceIterator)，針對SplittableIterator型別的iterator，建立ParallelIteratorInputFormat；而NumberSequenceIterator的split方法，它先計算elementsPerSplit，然後計算numWithExtra，把numWithExtra均分到前面幾個批次，最後在按elementsPerSplit均分剩餘的批次

doc

聊聊淘寶天貓個性化推薦技術演進史

阿裏雙11 個性化推薦引言：個性化推薦技術直面用戶，可以說是站在最前線的那個。如今，從用戶打開手機淘寶客戶端（簡稱“手淘”）或是手機天貓客戶端（簡稱“貓客”）的那一刻起，個性化推薦技術就已經啟動，為你我帶來一場個性化的購物之旅。本文將細數個性化推薦的一路風雨，講講個性化推薦技術的演進史。本文選

聊聊日誌這件小事情

聊聊日誌這件小事情寫應用不寫日誌，只會在撞板後也不知道為何撞板。線上的問題永遠不會知道為何會發生，只會出現事故之後身處茫然之中。1、哪怕用 print 也要輸出關鍵數據新手會經常在調試的時候使用 print，不論這種方式的優劣，反正關鍵位置數據哪怕用 print 輸出都比沒有好。在 linux 系統，noh

白扯之聊聊我們的情懷

更遠證明吉他旅行等等後來今天小女生包括這周科研時間占據了60%的，睡覺30%，最後那可憐的10%時間留給了前端，現在整個人處於蒙圈狀態。今天我們不聊科研，不聊前端，來，來，我們聊一聊情懷。作為一位偽文藝程序媛平時除了聽周傑倫的悲

“匿名聊聊”作者談如何打造現象級爆款小程序

模糊搜索目的公開是我高端微博微信大量玩法　　前段時間小程序“匿名聊聊”刷爆了朋友圈，可惜後面被屏蔽了。作為第一款現象級呈現爆炸級傳播的小程序它是如何做到的呢？我們就跟隨“匿名聊聊”作者來聊聊如何打造現象級爆款

聊聊高並發（十九）理解並發編程的幾種"性" -- 可見性，有序性，原子性

sock clas 關註條件 infoq zed 應該單獨 ssa 這篇的主題本應該放在最初的幾篇。討論的是並發編程最基礎的幾個核心概念。可是這幾個概念又牽扯到非常多的實際技術。比方Java內存模型。各種鎖的實現，volatile的實現。原子變量等等，每個都可以展開

博客第一彈—聊聊HTML裏的head部分

有助於設置標簽設置詳細信息網頁 tle ref 分享 gb2 HTML(HyperText Markup Language)，即超文本標記語言。它的結構包括head部分和body部分，其中head部分用於描述網頁的一些關鍵信息，這些信息本身不作為內容來顯示，但對網頁

【文學文娛】鬥膽聊聊那《三國》

val ide 公眾號 title blog gin 體會普通這也本文地址：http://www.cnblogs.com/aiweixiao/p/6985398.html 原文地址(微信)：http://t.cn/RSmz9xs 點擊關註微信公眾號

聊聊成為大神路上的過程（決定偉大水平和一般水平的關鍵因素，既不是天賦，也不是經驗，而是[刻意練習]的程度，要多看別人的代碼）

www 思維原因時間管理匯報何事 why 連續準則每個人都在成為大神的路上，只不過有的人在走，而有的人在跑。寫在前面的話在開始正文之前我先跟大家分享一個我身邊的例子。我有兩個朋友，A和B。B從高一開始打dota，A從高二開始，到高中畢業的時候，A已經是一

Cocos2dx 小技巧（十三）聊聊坐標系

south world 有趣 rect 區別發現技術 ins 不同一好友考上了空姐。她說：以後基本上不會回來了。等下次見面時請叫我白富美！盡管有點羨慕。但我依然不甘示弱回復：下次見面時請叫我高富帥！未來，誰說得準呢？------------------有段時間沒用到

聊聊架構--讀書筆記

聊聊架構--讀書筆記1.認識架構1.1生命周期：萬物皆有生命周期生命周期包含各種活動，活動的推進是生命周期的必要因素（對象的行為）生命周期裏面的活動拆分後，形成若幹新的生命周期拆分後主體不變的是核心生命周期，變化了的是非核心生命周期每個主體的生命周期變化都累積在自身，這個就是所謂的內聚（面向對象分析新思路）生

聊聊架構--讀後感

聊聊架構--讀後感為什麽會產生架構？什麽是架構？軟件架構？什麽是架構師？軟件架構師？對於這些問題，不知道有多少人思考過，至少我以前沒有細想過。現在一談起“架構”，就覺得它是一個很高大上的東西。在讀完這本書後，你會發現原來它無處不在，只是很普通，時常發生的一種事而已。讓我們來看看作者對這些問題的見解：1、為什麽

聊聊高並發（三十二）實現一個基於鏈表的無鎖Set集合

target 方向刪除元素 min 集合 date 變量 find Set表示一種沒有反復元素的集合類，在JDK裏面有HashSet的實現，底層是基於HashMap來實現的。這裏實現一個簡化版本號的Set，有下面約束： 1. 基於鏈表實現。鏈表節點依照對象的h

聊聊高並發（二十四）解析java.util.concurrent各個組件（六）深入理解AQS（四）

sar 成功通知 ati help write ng- ads 同步近期總體過了下AQS的結構。也在網上看了一些講AQS的文章，大部分的文章都是泛泛而談。又一次看了下AQS的代碼，把一些新的要點拿出來說一說。 AQS是一個管程。提供了一個主要的同步器的

聊聊基礎

數據庫否則不變策略 tar 區別 hashmap類原子變量新的摘要：最近和女友聊天，說我的工作需要作出調整，當前狀態下壓力太大，急需通過提供自身的專業技能來作出改變，所以便有了這個基礎知識的整理。本來這個帖子是發布在簡書的，因為考慮到簡書比較好編輯和閱覽，但

【JVM】6、聊聊JVM常用參數設置

閾值 policy 虛擬機棧時間戳 ces 增加 action 垃圾容易整體考慮堆大小 -Xms3550m，初始化堆大小。通常情況和-Xmx大小設置一樣，避免虛擬機頻繁自動計算後調整堆大小。 -Xmx3550m，最大堆大小。考慮分代設置堆大小首先通過jstat等

聊聊Java的字節碼

便在二進制結果 com 系統學習驗證 stat inux 巴山楚水淒涼地，二十三年棄置身。懷舊空吟聞笛賦，到鄉翻似爛柯人。沈舟側畔千帆過，病樹前頭萬木春。今日聽君歌一曲，暫憑杯酒長精神。一、什麽是Java字節碼？借用Algorithm(4th)節選：它是程序的一

聊聊手遊的那些驚喜與驚嚇

track 領域 ora 時間會有移動版 store 機制移動遊戲引言：對於一個可以蘊藏巨大信息量的遊戲產品而言。多為玩家準備一些驚喜的心態。是不會有錯的。非常多的案例和事實也證明，驚喜會給遊戲帶來非常多產品設計師意想不到的收獲，但假設驚喜運用得不好往往

聊聊JVM（一）相對全面的GC總結(轉)

cor war 性能依靠 blank 知識 flags 要去內存空間轉至：http://blog.csdn.net/iter_zc/article/details/41746265 最近時間比較緊張，要寫的東西也有很多，只能想到一點寫一點。關於GC，網上的資料太多，之

取代Android？聊聊谷歌的Fuchsia新操作系統

進程間通訊軟件開發行為現在無線路由 qemu 正常這樣的調用最近，一款由谷歌開發，被稱為 Fuchsia 的操作系統在網上曝光。Fuchsia 是在去年 8 月就進入了 GitHub 項目，但谷歌對此非常低調，像操作系統這樣的重量級項目，卻沒有官方的宣傳和說明

聊聊流水線處理器

con 5% height 轉發包括詳細 3.1 生產解決方案流水線處理模式，相對非流水線，本質上是一種生產管理模式的改變。在硬件條件有空閑的前提下，通過劃分工作步驟，讓硬件處於填滿狀態，從而提升工作效率。在計算機處理器體系結構中，正是采用這種方式來對指令進行處理。