聊聊flink如何相容StormTopology

阿新 • • 發佈：2018-12-06

序

本文主要研究一下flink如何相容StormTopology

例項

    @Test
    public void testStormWordCount() throws Exception {
        //NOTE 1 build Topology the Storm way
        final TopologyBuilder builder = new TopologyBuilder();
        builder.setSpout("spout", new RandomWordSpout(), 1);
        builder.setBolt("count" 
, new WordCountBolt(), 5)
                .fieldsGrouping("spout", new Fields("word"));
        builder.setBolt("print", new PrintBolt(), 1)
                .shuffleGrouping("count");

        //NOTE 2 convert StormTopology to FlinkTopology
        FlinkTopology flinkTopology = FlinkTopology.createTopology(builder);

        //NOTE 3 execute program locally using FlinkLocalCluster
        Config conf = new Config();
        // only required to stabilize integration test 

        conf.put(FlinkLocalCluster.SUBMIT_BLOCKING, true);

        final FlinkLocalCluster cluster = FlinkLocalCluster.getLocalCluster();
        cluster.submitTopology("stormWordCount", conf, flinkTopology);
        cluster.shutdown();
    }
複製程式碼

這裡使用FlinkLocalCluster.getLocalCluster()來建立或獲取FlinkLocalCluster，之後呼叫FlinkLocalCluster.submitTopology來提交topology，結束時通過FlinkLocalCluster.shutdown來關閉cluster

這裡構建的RandomWordSpout繼承自storm的BaseRichSpout，WordCountBolt繼承自storm的BaseBasicBolt；PrintBolt繼承自storm的BaseRichBolt(由於flink是使用的Checkpoint機制，不會轉換storm的ack操作，因而這裡用BaseBasicBolt還是BaseRichBolt都無特別要求)
FlinkLocalCluster.submitTopology這裡使用的topology是StormTopoloy轉換後的FlinkTopology

LocalClusterFactory

flink-release-1.6.2/flink-contrib/flink-storm/src/main/java/org/apache/flink/storm/api/FlinkLocalCluster.java

	// ------------------------------------------------------------------------
	//  Access to default local cluster
	// ------------------------------------------------------------------------

	// A different {@link FlinkLocalCluster} to be used for execution of ITCases
	private static LocalClusterFactory currentFactory = new DefaultLocalClusterFactory();

	/**
	 * Returns a {@link FlinkLocalCluster} that should be used for execution. If no cluster was set by
	 * {@link #initialize(LocalClusterFactory)} in advance, a new {@link FlinkLocalCluster} is returned.
	 *
	 * @return a {@link FlinkLocalCluster} to be used for execution
	 */
	public static FlinkLocalCluster getLocalCluster() {
		return currentFactory.createLocalCluster();
	}

	/**
	 * Sets a different factory for FlinkLocalClusters to be used for execution.
	 *
	 * @param clusterFactory
	 * 		The LocalClusterFactory to create the local clusters for execution.
	 */
	public static void initialize(LocalClusterFactory clusterFactory) {
		currentFactory = Objects.requireNonNull(clusterFactory);
	}

	// ------------------------------------------------------------------------
	//  Cluster factory
	// ------------------------------------------------------------------------

	/**
	 * A factory that creates local clusters.
	 */
	public interface LocalClusterFactory {

		/**
		 * Creates a local Flink cluster.
		 * @return A local Flink cluster.
		 */
		FlinkLocalCluster createLocalCluster();
	}

	/**
	 * A factory that instantiates a FlinkLocalCluster.
	 */
	public static class DefaultLocalClusterFactory implements LocalClusterFactory {

		@Override
		public FlinkLocalCluster createLocalCluster() {
			return new FlinkLocalCluster();
		}
	}
複製程式碼

flink在FlinkLocalCluster裡頭提供了一個靜態方法getLocalCluster，用來獲取FlinkLocalCluster，它是通過LocalClusterFactory來建立一個FlinkLocalCluster
LocalClusterFactory這裡使用的是DefaultLocalClusterFactory實現類，它的createLocalCluster方法，直接new了一個FlinkLocalCluster
目前的實現來看，每次呼叫FlinkLocalCluster.getLocalCluster，都會建立一個新的FlinkLocalCluster，這個在呼叫的時候是需要注意一下的

FlinkTopology

flink-release-1.6.2/flink-contrib/flink-storm/src/main/java/org/apache/flink/storm/api/FlinkTopology.java

	/**
	 * Creates a Flink program that uses the specified spouts and bolts.
	 * @param stormBuilder The Storm topology builder to use for creating the Flink topology.
	 * @return A {@link FlinkTopology} which contains the translated Storm topology and may be executed.
	 */
	public static FlinkTopology createTopology(TopologyBuilder stormBuilder) {
		return new FlinkTopology(stormBuilder);
	}

	private FlinkTopology(TopologyBuilder builder) {
		this.builder = builder;
		this.stormTopology = builder.createTopology();
		// extract the spouts and bolts
		this.spouts = getPrivateField("_spouts");
		this.bolts = getPrivateField("_bolts");

		this.env = StreamExecutionEnvironment.getExecutionEnvironment();

		// Kick off the translation immediately
		translateTopology();
	}
複製程式碼

FlinkTopology提供了一個靜態工廠方法createTopology用來建立FlinkTopology
FlinkTopology先儲存一下TopologyBuilder，然後通過getPrivateField反射呼叫getDeclaredField獲取_spouts、_bolts私有屬性然後儲存起來，方便後面轉換topology使用
之後先獲取到ExecutionEnvironment，最後就是呼叫translateTopology進行整個StormTopology的轉換

translateTopology

flink-release-1.6.2/flink-contrib/flink-storm/src/main/java/org/apache/flink/storm/api/FlinkTopology.java

	/**
	 * Creates a Flink program that uses the specified spouts and bolts.
	 */
	private void translateTopology() {

		unprocessdInputsPerBolt.clear();
		outputStreams.clear();
		declarers.clear();
		availableInputs.clear();

		// Storm defaults to parallelism 1
		env.setParallelism(1);

		/* Translation of topology */

		for (final Entry<String, IRichSpout> spout : spouts.entrySet()) {
			final String spoutId = spout.getKey();
			final IRichSpout userSpout = spout.getValue();

			final FlinkOutputFieldsDeclarer declarer = new FlinkOutputFieldsDeclarer();
			userSpout.declareOutputFields(declarer);
			final HashMap<String, Fields> sourceStreams = declarer.outputStreams;
			this.outputStreams.put(spoutId, sourceStreams);
			declarers.put(spoutId, declarer);

			final HashMap<String, DataStream<Tuple>> outputStreams = new HashMap<String, DataStream<Tuple>>();
			final DataStreamSource<?> source;

			if (sourceStreams.size() == 1) {
				final SpoutWrapper<Tuple> spoutWrapperSingleOutput = new SpoutWrapper<Tuple>(userSpout, spoutId, null, null);
				spoutWrapperSingleOutput.setStormTopology(stormTopology);

				final String outputStreamId = (String) sourceStreams.keySet().toArray()[0];

				DataStreamSource<Tuple> src = env.addSource(spoutWrapperSingleOutput, spoutId,
						declarer.getOutputType(outputStreamId));

				outputStreams.put(outputStreamId, src);
				source = src;
			} else {
				final SpoutWrapper<SplitStreamType<Tuple>> spoutWrapperMultipleOutputs = new SpoutWrapper<SplitStreamType<Tuple>>(
						userSpout, spoutId, null, null);
				spoutWrapperMultipleOutputs.setStormTopology(stormTopology);

				@SuppressWarnings({ "unchecked", "rawtypes" })
				DataStreamSource<SplitStreamType<Tuple>> multiSource = env.addSource(
						spoutWrapperMultipleOutputs, spoutId,
						(TypeInformation) TypeExtractor.getForClass(SplitStreamType.class));

				SplitStream<SplitStreamType<Tuple>> splitSource = multiSource
						.split(new StormStreamSelector<Tuple>());
				for (String streamId : sourceStreams.keySet()) {
					SingleOutputStreamOperator<Tuple> outStream = splitSource.select(streamId)
							.map(new SplitStreamMapper<Tuple>());
					outStream.getTransformation().setOutputType(declarer.getOutputType(streamId));
					outputStreams.put(streamId, outStream);
				}
				source = multiSource;
			}
			availableInputs.put(spoutId, outputStreams);

			final ComponentCommon common = stormTopology.get_spouts().get(spoutId).get_common();
			if (common.is_set_parallelism_hint()) {
				int dop = common.get_parallelism_hint();
				source.setParallelism(dop);
			} else {
				common.set_parallelism_hint(1);
			}
		}

		/**
		 * 1. Connect all spout streams with bolts streams
		 * 2. Then proceed with the bolts stream already connected
		 *
		 * <p>Because we do not know the order in which an iterator steps over a set, we might process a consumer before
		 * its producer
		 * ->thus, we might need to repeat multiple times
		 */
		boolean makeProgress = true;
		while (bolts.size() > 0) {
			if (!makeProgress) {
				StringBuilder strBld = new StringBuilder();
				strBld.append("Unable to build Topology. Could not connect the following bolts:");
				for (String boltId : bolts.keySet()) {
					strBld.append("\n  ");
					strBld.append(boltId);
					strBld.append(": missing input streams [");
					for (Entry<GlobalStreamId, Grouping> streams : unprocessdInputsPerBolt
							.get(boltId)) {
						strBld.append("'");
						strBld.append(streams.getKey().get_streamId());
						strBld.append("' from '");
						strBld.append(streams.getKey().get_componentId());
						strBld.append("'; ");
					}
					strBld.append("]");
				}

				throw new RuntimeException(strBld.toString());
			}
			makeProgress = false;

			final Iterator<Entry<String, IRichBolt>> boltsIterator = bolts.entrySet().iterator();
			while (boltsIterator.hasNext()) {

				final Entry<String, IRichBolt> bolt = boltsIterator.next();
				final String boltId = bolt.getKey();
				final IRichBolt userBolt = copyObject(bolt.getValue());

				final ComponentCommon common = stormTopology.get_bolts().get(boltId).get_common();

				Set<Entry<GlobalStreamId, Grouping>> unprocessedBoltInputs = unprocessdInputsPerBolt.get(boltId);
				if (unprocessedBoltInputs == null) {
					unprocessedBoltInputs = new HashSet<>();
					unprocessedBoltInputs.addAll(common.get_inputs().entrySet());
					unprocessdInputsPerBolt.put(boltId, unprocessedBoltInputs);
				}

				// check if all inputs are available
				final int numberOfInputs = unprocessedBoltInputs.size();
				int inputsAvailable = 0;
				for (Entry<GlobalStreamId, Grouping> entry : unprocessedBoltInputs) {
					final String producerId = entry.getKey().get_componentId();
					final String streamId = entry.getKey().get_streamId();
					final HashMap<String, DataStream<Tuple>> streams = availableInputs.get(producerId);
					if (streams != null && streams.get(streamId) != null) {
						inputsAvailable++;
					}
				}

				if (inputsAvailable != numberOfInputs) {
					// traverse other bolts first until inputs are available
					continue;
				} else {
					makeProgress = true;
					boltsIterator.remove();
				}

				final Map<GlobalStreamId, DataStream<Tuple>> inputStreams = new HashMap<>(numberOfInputs);

				for (Entry<GlobalStreamId, Grouping> input : unprocessedBoltInputs) {
					final GlobalStreamId streamId = input.getKey();
					final Grouping grouping = input.getValue();

					final String producerId = streamId.get_componentId();

					final Map<String, DataStream<Tuple>> producer = availableInputs.get(producerId);

					inputStreams.put(streamId, processInput(boltId, userBolt, streamId, grouping, producer));
				}

				final SingleOutputStreamOperator<?> outputStream = createOutput(boltId,
						userBolt, inputStreams);

				if (common.is_set_parallelism_hint()) {
					int dop = common.get_parallelism_hint();
					outputStream.setParallelism(dop);
				} else {
					common.set_parallelism_hint(1);
				}

			}
		}
	}
複製程式碼

整個轉換是先轉換spout，再轉換bolt，他們根據的spouts及bolts資訊是在構造器裡頭使用反射從storm的TopologyBuilder物件獲取到的
flink使用FlinkOutputFieldsDeclarer(它實現了storm的OutputFieldsDeclarer介面)來承載storm的IRichSpout及IRichBolt裡頭配置的declareOutputFields資訊，不過要注意的是flink不支援dirct emit；這裡通過userSpout.declareOutputFields方法，將原始spout的declare資訊設定到FlinkOutputFieldsDeclarer
flink使用SpoutWrapper來包裝spout，將其轉換為RichParallelSourceFunction型別，這裡對spout的outputStreams的個數是否大於1進行不同處理；之後就是將RichParallelSourceFunction作為StreamExecutionEnvironment.addSource方法的引數建立flink的DataStreamSource，並新增到availableInputs中，然後根據spout的parallelismHit來設定DataStreamSource的parallelism
對於bolt的轉換，這裡維護了unprocessdInputsPerBolt，key為boltId，value為該bolt要連線的GlobalStreamId及Grouping方式，由於是使用map來進行遍歷的，因此轉換的bolt可能亂序，如果連線的GlobalStreamId存在則進行轉換，然後從bolts中移除，bolt連線的GlobalStreamId不在availableInputs中的時候，需要跳過處理下一個，不會從bolts中移除，因為外層的迴圈條件是bolts的size大於0，就是依靠這個機制來處理亂序
對於bolt的轉換有一個重要的方法就是processInput，它把bolt的grouping轉換為對spout的DataStream的對應操作(比如shuffleGrouping轉換為對DataStream的rebalance操作，fieldsGrouping轉換為對DataStream的keyBy操作，globalGrouping轉換為global操作，allGrouping轉換為broadcast操作)，之後呼叫createOutput方法轉換bolt的執行邏輯，它使用BoltWrapper或者MergedInputsBoltWrapper將bolt轉換為flink的OneInputStreamOperator，然後作為引數對stream進行transform操作返回flink的SingleOutputStreamOperator，同時將轉換後的SingleOutputStreamOperator新增到availableInputs中，之後根據bolt的parallelismHint對這個SingleOutputStreamOperator設定parallelism

FlinkLocalCluster

flink-storm_2.11-1.6.2-sources.jar!/org/apache/flink/storm/api/FlinkLocalCluster.java

/**
 * {@link FlinkLocalCluster} mimics a Storm {@link LocalCluster}.
 */
public class FlinkLocalCluster {

	/** The log used by this mini cluster. */
	private static final Logger LOG = LoggerFactory.getLogger(FlinkLocalCluster.class);

	/** The Flink mini cluster on which to execute the programs. */
	private FlinkMiniCluster flink;

	/** Configuration key to submit topology in blocking mode if flag is set to {@code true}. */
	public static final String SUBMIT_BLOCKING = "SUBMIT_STORM_TOPOLOGY_BLOCKING";

	public FlinkLocalCluster() {
	}

	public FlinkLocalCluster(FlinkMiniCluster flink) {
		this.flink = Objects.requireNonNull(flink);
	}

	@SuppressWarnings("rawtypes")
	public void submitTopology(final String topologyName, final Map conf, final FlinkTopology topology)
			throws Exception {
		this.submitTopologyWithOpts(topologyName, conf, topology, null);
	}

	@SuppressWarnings("rawtypes")
	public void submitTopologyWithOpts(final String topologyName, final Map conf, final FlinkTopology topology, final SubmitOptions submitOpts) throws Exception {
		LOG.info("Running Storm topology on FlinkLocalCluster");

		boolean submitBlocking = false;
		if (conf != null) {
			Object blockingFlag = conf.get(SUBMIT_BLOCKING);
			if (blockingFlag instanceof Boolean) {
				submitBlocking = ((Boolean) blockingFlag).booleanValue();
			}
		}

		FlinkClient.addStormConfigToTopology(topology, conf);

		StreamGraph streamGraph = topology.getExecutionEnvironment().getStreamGraph();
		streamGraph.setJobName(topologyName);

		JobGraph jobGraph = streamGraph.getJobGraph();

		if (this.flink == null) {
			Configuration configuration = new Configuration();
			configuration.addAll(jobGraph.getJobConfiguration());

			configuration.setString(TaskManagerOptions.MANAGED_MEMORY_SIZE, "0");
			configuration.setInteger(TaskManagerOptions.NUM_TASK_SLOTS, jobGraph.getMaximumParallelism());

			this.flink = new LocalFlinkMiniCluster(configuration, true);
			this.flink.start();
		}

		if (submitBlocking) {
			this.flink.submitJobAndWait(jobGraph, false);
		} else {
			this.flink.submitJobDetached(jobGraph);
		}
	}

	public void killTopology(final String topologyName) {
		this.killTopologyWithOpts(topologyName, null);
	}

	public void killTopologyWithOpts(final String name, final KillOptions options) {
	}

	public void activate(final String topologyName) {
	}

	public void deactivate(final String topologyName) {
	}

	public void rebalance(final String name, final RebalanceOptions options) {
	}

	public void shutdown() {
		if (this.flink != null) {
			this.flink.stop();
			this.flink = null;
		}
	}

	//......
}
複製程式碼

FlinkLocalCluster的submitTopology方法呼叫了submitTopologyWithOpts，而後者主要是設定一些引數，呼叫topology.getExecutionEnvironment().getStreamGraph()根據transformations生成StreamGraph，再獲取JobGraph，然後建立LocalFlinkMiniCluster並start，最後使用LocalFlinkMiniCluster的submitJobAndWait或submitJobDetached來提交整個JobGraph

小結

flink通過FlinkTopology對storm提供了一定的相容性，這對於遷移storm到flink非常有幫助
要在flink上執行storm的topology，主要有幾個步驟，分別是構建storm原生的TopologyBuilder，之後通過FlinkTopology.createTopology(builder)來將StormTopology轉換為FlinkTopology，最後是通過FlinkLocalCluster(本地模式)或者FlinkSubmitter(遠端提交)的submitTopology方法提交FlinkTopology
FlinkTopology是flink相容storm的核心，它負責將StormTopology轉換為flink對應的結構，比如使用SpoutWrapper將spout轉換為RichParallelSourceFunction，然後新增到StreamExecutionEnvironment建立DataStream，把bolt的grouping轉換為對spout的DataStream的對應操作(比如shuffleGrouping轉換為對DataStream的rebalance操作，fieldsGrouping轉換為對DataStream的keyBy操作，globalGrouping轉換為global操作，allGrouping轉換為broadcast操作)，然後使用BoltWrapper或者MergedInputsBoltWrapper將bolt轉換為flink的OneInputStreamOperator，然後作為引數對stream進行transform操作
構建完FlinkTopology之後，就使用FlinkLocalCluster提交到本地執行，或者使用FlinkSubmitter提交到遠端執行
FlinkLocalCluster的submitTopology方法主要是通過FlinkTopology作用的StreamExecutionEnvironment生成StreamGraph，通過它獲取JobGraph，然後建立LocalFlinkMiniCluster並start，最後通過LocalFlinkMiniCluster提交JobGraph

doc

Storm Compatibility Beta

聊聊flink如何相容StormTopology

序本文主要研究一下flink如何相容StormTopology 例項 @Test public void testStormWordCount() throws Exception { //NOTE 1 build Topology the Storm way

聊聊flink的log.file配置

序本文主要研究一下flink的log.file配置 log4j.properties flink-release-1.6.2/flink-dist/src/main/flink-bin/conf/log4j.properties # This affects logging for both use

聊聊flink的SourceFunction

序本文主要研究一下flink的SourceFunction 例項 // set up the execution environment final StreamExecutionEnvironment env = StreamExecutionEnvironment.

聊聊flink的RichParallelSourceFunction

序本文主要研究一下flink的RichParallelSourceFunction RichParallelSourceFunction /** * Base class for implementing a parallel data source. Upon execution, the run

聊聊flink的CsvReader

序本文主要研究一下flink的CsvReader 例項 final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSet<RecordDto>

聊聊flink的ParallelIteratorInputFormat

序本文主要研究一下flink的ParallelIteratorInputFormat 例項 final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSet&

聊聊flink的PrintSinkFunction

序本文主要研究一下flink的PrintSinkFunction DataStream.print flink-streaming-java_2.11-1.7.0-sources.jar!/org/apache/flink/streaming/api/datastream/DataStream.java

聊聊flink的InputFormatSourceFunction

序本文主要研究一下flink的InputFormatSourceFunction 例項 final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); I

聊聊flink的SocketClientSink

序本文主要研究一下flink的SocketClientSink DataStream.writeToSocket flink-streaming-java_2.11-1.7.0-sources.jar!/org/apache/flink/streaming/api/datastream/DataStream

聊聊flink的TextOutputFormat

序本文主要研究一下flink的TextOutputFormat DataStream.writeAsText flink-streaming-java_2.11-1.7.0-sources.jar!/org/apache/flink/streaming/api/datastream/DataStream.j

聊聊flink的JDBCOutputFormat

序本文主要研究一下flink的JDBCOutputFormat JDBCOutputFormat flink-jdbc_2.11-1.7.0-sources.jar!/org/apache/flink/api/java/io/jdbc/JDBCOutputFormat.java /** * OutputF

聊聊flink的CheckpointedFunction

序本文主要研究一下flink的CheckpointedFunction 例項 public class BufferingSink implements SinkFunction<Tuple2<String, Integer>>,

聊聊flink的ListCheckpointed

序本文主要研究一下flink的ListCheckpointed 例項 public static class CounterSource extends RichParallelSourceFunction<Long> implements ListCheckpo

聊聊flink的CheckpointScheduler

序本文主要研究一下flink的CheckpointScheduler CheckpointCoordinatorDeActivator flink-runtime_2.11-1.7.0-sources.jar!/org/apache/flink/runtime/checkpoint/CheckpointCo

聊聊flink StreamOperator的initializeState方法

序本文主要研究一下flink StreamOperator的initializeState方法 Task.run flink-runtime_2.11-1.7.0-sources.jar!/org/apache/flink/runtime/taskmanager/Task.java public class

聊聊flink的checkpoint配置

序本文主要研究下flink的checkpoint配置例項 StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // start a checkpoint every 1000 ms

聊聊flink的MemoryStateBackend

序本文主要研究一下flink的MemoryStateBackend StateBackend flink-runtime_2.11-1.7.0-sources.jar!/org/apache/flink/runtime/state/StateBackend.java @PublicEvolving publ

聊聊flink的OperatorStateBackend

序本文主要研究一下flink的OperatorStateBackend OperatorStateBackend flink-runtime_2.11-1.7.0-sources.jar!/org/apache/flink/runtime/state/OperatorStateBackend.java /*

聊聊flink的PartitionableListState

序本文主要研究一下flink的PartitionableListState PartitionableListState flink-runtime_2.11-1.7.0-sources.jar!/org/apache/flink/runtime/state/DefaultOperatorStateBack

聊聊flink的MemoryBackendCheckpointStorage

序本文主要研究一下flink的MemoryBackendCheckpointStorage CheckpointStorage flink-runtime_2.11-1.7.0-sources.jar!/org/apache/flink/runtime/state/CheckpointStorage.jav

聊聊flink如何相容StormTopology

序

例項

LocalClusterFactory

FlinkTopology

translateTopology

FlinkLocalCluster

小結

doc

相關推薦