Spark和Java API(二)Word Count
阿新 • • 發佈:2021-06-10
本文介紹如何基於Spark和Java來實現一個單詞計數(Word Count)的程式。
建立工程
建立一個Maven工程,pom.xml檔案如下:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.github.ralgond</groupId> <artifactId>spark-java-api</artifactId> <version>0.0.1-SNAPSHOT</version> <dependencies> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.12</artifactId> <version>3.1.1</version> <scope>provided</scope> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>3.0</version> <configuration> <source>1.8</source> <target>1.8</target> </configuration> </plugin> </plugins> </build> </project>
編寫java類WordCount
建立一個包com.github.ralgond.sparkjavaapi,在該包下建立一個名為WordCount的類,該類內容如下:
package com.github.ralgond.sparkjavaapi; import java.util.Arrays; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import scala.Tuple2; public class WordCount { public static void main(String args[]) { String fileName = args[0]; SparkConf conf = new SparkConf().setAppName("WordCount Application"); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<String> data1 = sc.textFile(fileName); JavaRDD<String> data2 = data1.flatMap(line -> Arrays.asList(line.split("\\s+")).iterator()); JavaPairRDD<String, Integer> data3 = data2.mapToPair(e -> new Tuple2<String,Integer>(e, 1)); JavaPairRDD<String, Integer> data4 = data3.reduceByKey((a,b)-> a + b); System.out.println(data4.collect()); sc.close(); } }
編譯並執行
通過mvn clean package編譯出jar包spark-java-api-0.0.1-SNAPSHOT.jar。
到spark安裝目錄裡,執行如下命令:
bin\spark-submit --class com.github.ralgond.sparkjavaapi.WordCount {..}\spark-java-api-0.0.1-SNAPSHOT.jar README.md
便可以看到結果: