Hands on Big Data by Peter Norvig

阿新 • • 發佈：2019-01-12

When I’m asked about resources for big data, I typically recommend people watch Peter Norvig’s Big Data tech talk to Facebook Engineering from 2009.

It’s fantastic because he’s a great communicator and clearly and presents the deceptively simple thesis of big data in this video.

In this blog post I summarize this video for you into cliff notes you can review.

Essentially, all models are wrong, but some are useful.

More Data vs Better Algorithms
Screenshot from Peter Norvig on big data

Norvig starts out by summarizing that theories (models) are created by smart people that have insight. The process is slow and not reproducible and the models have flaws in them. If the models are going to be wrong anyway, can we come up with a faster and simpler process to create them.

Big Data Case Studies

Three case studies are presented that demonstrate that simple models can be created from large corpus of data. The three case studies are difficult problems from the field of natural language process (NLP):

Word Segmentation

The problem of separating unspaced characters into words so that the sentences have meaning. For example, Chinese characters do not have spacing. Use a simple probabilistic model of what constitutes a word and the Python program fits on one page.

Spelling Correction

The problem of determining whether a word is a typo and what the correction should be. Again, a simple probabilistic model that models what is a word and whether a word is a typo of a correction by looking at edit difference. It is a harder problem than segmentation.

Norvig compares his one page Python program to an open source project that has sophisticated models. He comments on the maintainability of the hand crafted models and the difficulty of adapting it to new languages. He contracts this with the big data solution that only requires the corpus to create the statistical model.

In addition to maintainability and adaptability Norvig comments that the simpler statistical model can capture the detail that is hand crafted into complex smarter models because this detail is in the data. It is not necessary to split out and maintain smaller complex models.

Machine Translation

The problem of translating one language into another. This is a more complex problem than segmentation and spelling correction. It requires a corpus of translated text, for example newspapers that have an English and Chinese edition. The problem is addressed as an alignment problem between the two languages. Many fancy models were tried but failed to add benefit over the simple statistical model.

Big Data Principles

Big data promotes a different mode of thinking about machine learning algorithms and datasets. The data is the model.

More data versus Better Algorithms

Example problem by Microsoft Research on sentence disambiguation. The worst algorithm beats the best algorithm when the size of the dataset is dramatically increased. The lesson is to look to max-out the data for the model and find the plateau before moving onto the next model.

Parametric versus Non-parametric

When you are data poor, there is not much you can do unless you have a good theory. You essentially throw the data away and rely on your model. If you are data rich, you have something you can work from. Keep all the data because the situation could change which will change your model.

Norvig finished the talk with comments on supervised and unsupervised learning and the opportunity for semi-supervised methods that strike a balance and reap the benefits from both methods.

This is a great video and is well worth the one hour to watch. Highly recommended if you are looking for insight into the big data movement.

You can get a good treatment of the same material by reading Norvig’s chapter contribution to the book Beautiful Data: The Stories Behind Elegant Data Solutions (affiliate link). You can download this chapter for free on Norvig’s webpage Natural Language Corpus Data.

Resources

Below are a list of resources if you are interested in learning or reading more about Norvig’s take on big data.

Peter Norvig on big data at Facebook Engineering (Video) The subject of this blog post
How to Write a Spelling Corrector: Norvig’s tutorial on writing a spelling corrector in Python. I believe this is the example given in the talk.
Scaling to very very large corpora for natural language disambiguation (Banko and Brill 2001): I believe this is the Microsoft Research paper referenced in the talk as an argument for more data over more complex models.

Have you watched this video? Leave a comment and let me know what you thought.

Hands on Big Data by Peter Norvig

Tweet Share Share Google Plus When I’m asked about resources for big data, I typically recommend

Insights on big data from Optum Technology's first Technical Fellow

Kerrie Holley has had a storied career in the technology industry spending nearly three decades at IBM before moving to Cisco as a chief technology officer

Microsoft buys into Grab as pair focus on big data and AI on Azure ZDNet

Microsoft has announced making a strategic investment in ride-sharing service Grab, as one of the first moves under a recently forged partnership between t

Survey Report on Data Skew in Big Data

range 如何長時間變量延遲過濾 gas 而且允許 1 Introduction 信息時代產生了大量的數據，運用和使用數據已經成為一個公司乃至一個國家核心實力的重要組成部分。當代大數據一般指的是：數據量巨大，需要運用新處理模式才能具有更強的決策力、洞察力和流程優

Big Data Opportunities and Challenges（by周誌華）論文要點

重要 big data 環境數據分布範式 hal 大數據挖掘目標最優化大數據環境下的機器學習三種誤解：模型不再重要（大量數據上復雜模型依然提升顯著，大數據是的復雜模型充分利用數據且難以過擬合），相關性就足夠了（因果關系重要性無法被替代），以前的研究方向不再重

Microsoft invests in Grab to bring AI and big data to on

Microsoft has made a strategic investment in ride-hailing and on-demand services company Grab as part of a deal that includes collaborating on big data and

Discover the big data analysis software by Expert System

Big data analysis software powered by Cogito Expert System provides big data analysis software to unlock the value in your information. Relying on a de

time bushfire alerting with Complex Event Processing in Apache Flink on Amazon EMR and IoT sensor network | AWS Big Data Blog

Bushfires are frequent events in the warmer months of the year when the climate is hot and dry. Countries like Australia and the United States are

Hands on Big Data by Peter Norvig

Big Data Case Studies

Word Segmentation

Spelling Correction

Machine Translation

Big Data Principles

More data versus Better Algorithms

Parametric versus Non-parametric

Resources

Hands on Big Data by Peter Norvig

Insights on big data from Optum Technology's first Technical Fellow

Microsoft buys into Grab as pair focus on big data and AI on Azure ZDNet

Survey Report on Data Skew in Big Data

Big Data Opportunities and Challenges（by周誌華）論文要點

Microsoft invests in Grab to bring AI and big data to on

Discover the big data analysis software by Expert System

time bushfire alerting with Complex Event Processing in Apache Flink on Amazon EMR and IoT sensor network | AWS Big Data Blog

Using Presto in our Big Data Platform on AWS

Informatica Big Data Management on AWS

Big Data On

Q&A: Trifacta's Sachin Chawla on getting the most out of Big Data Internet of Business

翻譯-In-Stream Big Data Processing 流式大數據處理

Peter Norvig：十年學會編程

[Angular] Fetch non-JSON data by specifying HttpClient responseType in Angular

power bi hands-on lab for MS technology fans

2017.11.12 Power BI hands-on lab workshop

AI＋BIG DATA：無人送貨時代離我們究竟還有多遠？

《Toward an SDN-Enabled Big Data Platform for Social TV Analysis》--2015--Han Hu

OReilly.Hands-On.Machine.Learning.with.Scikit-Learn.and.TensorFlow學習筆記彙總

Hands on Big Data by Peter Norvig

Big Data Case Studies

Word Segmentation

Spelling Correction

Machine Translation

Big Data Principles

More data versus Better Algorithms

Parametric versus Non-parametric

Resources

相關推薦