1. 程式人生 > 實用技巧 >spark Graphx 之 Connected Components

spark Graphx 之 Connected Components

一、Connected Components演算法

Connected Components即連通體演算法用id標註圖中每個連通體,將連通體中序號最小的頂點的id作為連通體的id。如果在圖G中,任意2個頂點之間都存在路徑,那麼稱G為連通圖,否則稱該圖為非連通圖,則其中的極大連通子圖稱為連通體,如下圖所示,該圖中有兩個連通體:

二、示例

followers.txt (起點id,終點id)

4 1
1 2
6 3
7 3
7 6
6 7
3 7

users.txt (id,first name,full name)

1,BarackObama,Barack Obama
2,ladygaga,Goddess of Love
3,jeresig,John Resig 4,justinbieber,Justin Bieber 6,matei_zaharia,Matei Zaharia 7,odersky,Martin Odersky 8,anonsys
import org.apache.spark.graphx.{Graph, GraphLoader, VertexId, VertexRDD}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object Connected_Components {
  def main(args: Array[String]): Unit 
= { val conf: SparkConf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local") val sc: SparkContext = new SparkContext(conf) //讀取followers.txt檔案建立圖 val graph: Graph[Int, Int] = GraphLoader.edgeListFile(sc,"src/main/resources/connected/followers.txt") //計算連通體 val components: Graph[VertexId, Int] = graph.connectedComponents() val vertices: VertexRDD[VertexId]
= components.vertices /** * vertices: * (4,1) * (1,1) * (6,3) * (3,3) * (7,3) * (2,1) * 是一個tuple型別,key分別為所有的頂點id,value為key所在的連通體id(連通體中頂點id最小值) */ //讀取users.txt檔案轉化為(key,value)形式 val users: RDD[(VertexId, String)] = sc.textFile("src/main/resources/connected/users.txt").map(line => { val fields: Array[String] = line.split(",") (fields(0).toLong, fields(1)) }) /** * users: * (1,BarackObama) * (2,ladygaga) * (3,jeresig) * (4,justinbieber) * (6,matei_zaharia) * (7,odersky) * (8,anonsys) */ users.join(vertices).map{ case(id,(username,vertices))=>(vertices,username) }.groupByKey().map(t=>{ t._1+"->"+t._2.mkString(",") }).foreach(println(_)) /** * 得到結果為: * 1->justinbieber,BarackObama,ladygaga * 3->matei_zaharia,jeresig,odersky */ } }

最終計算得到這個關係網路有兩個社群。