1. 程式人生 > 其它 >Spark和Java API(二)Word Count

Spark和Java API(二)Word Count

本文介紹如何基於Spark和Java來實現一個單詞計數(Word Count)的程式。

建立工程

建立一個Maven工程,pom.xml檔案如下:

<project xmlns="http://maven.apache.org/POM/4.0.0"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
	<modelVersion>4.0.0</modelVersion>
	<groupId>com.github.ralgond</groupId>
	<artifactId>spark-java-api</artifactId>
	<version>0.0.1-SNAPSHOT</version>

	<dependencies>
		<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
		<dependency>
			<groupId>org.apache.spark</groupId>
			<artifactId>spark-core_2.12</artifactId>
			<version>3.1.1</version>
			<scope>provided</scope>
		</dependency>

	</dependencies>

	<build>
		<plugins>
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-compiler-plugin</artifactId>
				<version>3.0</version>
				<configuration>
					<source>1.8</source>
					<target>1.8</target>
				</configuration>
			</plugin>
		</plugins>
	</build>
</project>

編寫java類WordCount

建立一個包com.github.ralgond.sparkjavaapi,在該包下建立一個名為WordCount的類,該類內容如下:

package com.github.ralgond.sparkjavaapi;

import java.util.Arrays;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

import scala.Tuple2;

public class WordCount {
	public static void main(String args[]) {
		String fileName = args[0];
		SparkConf conf = new SparkConf().setAppName("WordCount Application");
		
		JavaSparkContext sc = new JavaSparkContext(conf);
		
		JavaRDD<String> data1 = sc.textFile(fileName);
		
		JavaRDD<String> data2 = data1.flatMap(line -> Arrays.asList(line.split("\\s+")).iterator());
		
		JavaPairRDD<String, Integer> data3 = data2.mapToPair(e -> new Tuple2<String,Integer>(e, 1));
		
		JavaPairRDD<String, Integer> data4 = data3.reduceByKey((a,b)-> a + b);
		
		System.out.println(data4.collect());
		
		sc.close();
	}
}

編譯並執行

通過mvn clean package編譯出jar包spark-java-api-0.0.1-SNAPSHOT.jar。

到spark安裝目錄裡,執行如下命令:

bin\spark-submit --class com.github.ralgond.sparkjavaapi.WordCount {..}\spark-java-api-0.0.1-SNAPSHOT.jar README.md

便可以看到結果: