Large-scale Graph Mining with Spark

阿新 • • 發佈：2018-12-28

Graphs 101

A graph is a data structure for representing pairwise relationships between objects. Graphs are comprised of nodes (also called vertices) and edges. They can also be directed or undirected. For instance, Twitter follows can be a directed graph; the relationship is one-way. Just because I follow another user, doesn’t mean they follow me!

I focus on web graphs. Web graphs capture link relationships between different websites. Each webpage is a node. If there is an html link from one page to another, draw an edge between those two nodes.

As you do this for more and more pages, you’ll notice substructures emerge. On real web data, these substructures can be quite large and complex!

Here’s a sample graph of all the pages under a single news site.

Each light blue dot represents a single webpage, or node.
Each dark blue line represents a link between 2 pages, an edge.

Subpage structure of a news site, generated by me using Gephi.

Even at this level, you can see dense clusters, or communities,

of pages. You can spot nodes of higher degree centrality (pages with a large number of other pages linking to them).

If a single site is so dense with connections, imagine what we can mine from tens of thousands of sites!

Why are graphs useful?

OK, that blue jellyfish thingy looks cool and all, but why even do all this?

-you

There are many machine learning problems where labels (information on whether a data point is of one class or another) are not available. Unsupervised learning problems rely on finding similarities between data points to classify data into groups or clusters. Contrast this with supervised approaches, where data is labelled with the appropriate class and your model learns to differentiate classes using these labels.

Unsupervised learning is very useful when you can’t easily get more data, so you leverage more value out of what data you do have. Labels can be unavailable; even if they are, they may be too time-consuming or expensive to obtain. At the start of a machine learning problem, we also may not know exactly how many classes of objects we are looking for!

Here’s why we want graphs in our toolkit:

Graphs allow us to get more value from our data in an unsupervised setting. We can get clusters from graphs.

Unsupervised learning is not unlike how humans learn! How did you first learn to tell the difference between dogs and cats? I’d guess for most people, no one sat your young self down and defined in precise taxonomic terms what a dog or cat was. Nor did your parents give you a corpus of thousands of cat and dog photos, each labelled, and ask you to draw a decision boundary that accurately divides the two classes of animals.

If your childhood was anything like mine, you probably met a few cats and met a few dogs. All the while, your young mind identified the salient differences between the 2 animals, as well as relevant common traits within each type of animal. Our brains are incredible at soaking up information from our environment, synthesizing this data, and formulating commonalities between the vastly different things we come across over the course of our lives.

There are many exciting applications of clustering. A few examples that come up in my work:

Predicting class labelsfor datasets where there were no labels to learn from.
Generate grouping for audience segmentation and classification.
Build a recommender for similar web sites.
Find anomalies.
Use clusters as part of a semi-supervised machine learning ensemble. Clusters can help you extend know labels to nearby data points to increase training data size, or they can be used outright if labels are needed immediately until a secondary system can classify them.

Here’s the kicker: in unsupervised learning, clusters are communities! And communities are clusters!

Graph communities are clusters too!

The only difference is, instead of using engineered features, you’re relying on the underlying network structure in your graph to derive clusters. Instead of a predefined distance metric, you use edges in your graph to measure similarity between data points.

Large-scale Graph Mining with Spark

Graphs 101

Why are graphs useful?

Large-scale Graph Mining with Spark

Large-Scale Machine Learning with Spark on Amazon EMR

Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection

Visualizing large scale terrain with open source tools

Introducing DataFrames in Apache Spark for Large Scale Data Science（中英雙語）

Ng第十七課：大規模機器學習(Large Scale Machine Learning)

VGGnet論文總結（VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION）

Node.js: Extend and Maintain Applications + large scale

論文翻譯 DOTA:A Large-scale Dataset for Object Detection in Aerial Images

1804.03235-Large scale distributed neural network training through online distillation.md

Design of a machine for the universal non-contact measurement of large free-form optics with 30 nm uncertainty

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNTION（翻譯）

Deep Mixture of Diverse Experts for Large-Scale Visual Recognition 閱讀及相關疑問

Author name disambiguation using a graph model with node splitting and merging based on bibliographic information

Deep Hash in Large Scale Image Retrieval

AFLW:Annotated Facial Landmarks in the Wild: A large-scale, real-world database for facial landmark

ILSVRC競賽詳細介紹（ImageNet Large Scale Visual Recognition Challenge）

Get Out of My Lab: Large-scale, Real-Time Visual-Inertial Localization文章理解

Coursera-吳恩達-機器學習-第十週-測驗-Large Scale Machine Learning

機器學習系列之coursera week 10 Large Scale Machine Learning

Large-scale Graph Mining with Spark

Graphs 101

Why are graphs useful?

相關推薦