Netflix at Spark+AI Summit 2018

阿新 • • 發佈：2018-12-29

Netflix at Spark+AI Summit 2018

A glimpse at Spark usage for Netflix Recommendations

Apache Spark has been an immensely popular big data platform for distributed computing. Netflix has been using Spark extensively for various batch and stream-processed workloads. A substantial list of use cases for Spark computation come from the various applications in the domain of content recommendations and personalization. A majority of the machine learning pipelines for member personalization run atop large managed Spark clusters. These models form the basis of the recommender system that backs the various personalized canvases you see on the Netflix app including, title relevance ranking, row selection & sorting, and artwork personalization among others.

Spark provides the computation infrastructure to help develop the models through data preparation, feature extraction, training, and model selection. The Personalization Infrastructure team has been helping scale Spark applications in this domain for the last several years. We believe strongly in sharing our learnings with the broader Spark community and at this year’s Spark +AI Summit in San Francisco, we had the opportunity to do so via three different talks on projects using Spark at Netflix scale. This post summarizes the three talks.

Fact Store for Netflix Recommendations (Nitin Sharma, Kedar Sadekar)

The first talk cataloged our journey building training data infrastructure for personalization models — how we built a fact store for extracting features in an ever-evolving landscape of new requirements. To improve the quality of our personalized recommendations, we try an idea offline using historical data. Ideas that improve our offline metrics are then pushed as A/B tests which are measured through statistically significant improvements in core metrics such as member engagement, satisfaction, and retention. The heart of such offline analyses are historical facts (for example, viewing history of a member, videos in ‘My List’ etc) that are used to generate features required by the machine learning model. Ensuring we capture enough fact data to cover all stratification needs of various experiments and guarantee that the data we serve is temporally accurate is an important requirement.

In the talk, we presented the key requirements, evolution of our fact store design, its push-based architecture, the scaling efforts, and our learnings.

We discussed how we use Spark extensively for data processing for this fact store and delved into the design tradeoffs of fast access versus efficient storage.

Near Real-time Recommendations with Spark Streaming (Elliot Chow, Nitin Sharma)

Many recommendations for the personalization use cases at Netflix are precomputed in a batch processing fashion, but that may not be quick enough for time sensitive use cases that need to take into account member interactions, trending popularity, and new show launch promotions. With an ever-growing Netflix catalog, finding the right content for our audience in near real-time is a necessary element to providing the best personalized experience.

Our second talk delved into the realtime Spark Streaming ecosystem we have built at Netflix to provide this near-line ML Infrastructure. This talk was contextualized by a couple of product use cases using this near-real-time (NRT) infrastructure, specifically how we select the personalized video to present on the Billboard (large canvas at the top of the page), and how we select the personalized artwork for any title given the right canvas. We also reflected upon the lessons learnt while building a high volume infrastructure on top of Spark Streaming.

With regards to the infrastructure, we talked about:

Scale challenges with Spark Streaming
State management that we had to build on top of Spark
Data persistence
Resiliency, Metrics, and Operational Auto-remediation

Spark-based Stratification library for ML use cases (Shiva Chaitanya)

Our last talk introduced a specific Spark based library that we built to help with stratification of the training sets used for offline machine learning workflows. This allows us to better model our users’ behaviors and provide them great personalized video recommendations.

This library was originally created to implement user selection algorithms in our training data snapshotting infrastructure, but it has evolved to cater to the general-purpose stratification use cases in ML pipelines. The main idea here is to be able to provide a mechanism for down-sampling the data set while still maintaining the desired constraints on the data distribution. We described the flexible stratification API on top of Spark Dataframes.

Choosing Spark+Scala gave us strong type safety in a distributed computing environment. We gave some examples of how, using the library’s DSL one can easily express complex sampling rules.

These talks presented a few glimpses of the Spark usage from the Personalization use cases at Netflix. Spark is also used for many other data processing, ETL, and analytical uses in many other different domains in Netflix. Each domain brings its unique sets of challenges. For the member-facing personalization domain, the infrastructure needs to scale at the level of member scale. That means, for our over 125 million members and each of their active profiles, we need to personalize our content and do so reasonably fast for it to be relevant and timely.

While Spark provides a great horizontally-scalable compute platform, we have found that using some of the advanced features, like code-gen for example, at our scale often poses interesting technical challenges. As Spark’s popularity grows, the project will need to continue to evolve to meet the growing hunger for truly big data sets and do a better job at providing transparency and ease of debugging for the workloads running on it.

This is where sharing lessons from one organization can help benefit the community-at-large. We are happy to share our experiences at such conferences and welcome the ongoing interchange of ideas on making Spark better for modern ML and big data infrastructure use cases.

Netflix at Spark+AI Summit 2018

Netflix at Spark+AI Summit 2018

A glimpse at Spark usage for Netflix Recommendations

Netflix at Spark+AI Summit 2018

Netflix at the Spinnaker Summit 2018

Six Secret Superpowers Of Top Entrepreneurs at Founders Network Summit 2018

Guest Speakers at Global Summit 2018 InterSystems

PreSeries Predicts! The 5 Most Promising Startups at South Summit 2018

AWS Presents at the Open Source Leadership Summit 2018

微軟技術大會2018（Microsoft Tech Summit 2018）

微軟智能雲再創新高，Tech Summit 2018三喜臨門

網易AI入選2018年杭州案例

AI資訊--2018年年初預測

Microsoft Tech Summit 2018 課程簡述：利用 Windows 新特性開發出更好的手繪視訊應用

【活動】上海apache spark+Ai第十五次聚會

AI Challenger 2018 農作物病害細粒度分類-----Pytorch 深度學習實戰

2017年中國人工智慧產業最全研究報告發布 | AI世界2018年八大趨勢

Summit Symposium InterSystems Global Summit 2018 | AITopics

Generali AI Summit, Baveno (VB), October 11

Impressions and Lessons from the O’Reilly AI Conf 2018

Anita Schjøll Brede, Author at Iris.ai

dev blog: Reflections on the Human-Level AI Conference 2018

Ai CC 2018詳細安裝教程

Netflix at Spark+AI Summit 2018

Netflix at Spark+AI Summit 2018

A glimpse at Spark usage for Netflix Recommendations

相關推薦