1. 程式人生 > >Hands on Big Data by Peter Norvig

Hands on Big Data by Peter Norvig

When I’m asked about resources for big data, I typically recommend people watch Peter Norvig’s Big Data tech talk to Facebook Engineering from 2009.

It’s fantastic because he’s a great communicator and clearly and presents the deceptively simple thesis of big data in this video.

In this blog post I summarize this video for you into cliff notes you can review.

Essentially, all models are wrong, but some are useful.

More Data vs Better Algorithms

More Data vs Better Algorithms
Screenshot from Peter Norvig on big data

Norvig starts out by summarizing that theories (models) are created by smart people that have insight. The process is slow and not reproducible and the models have flaws in them. If the models are going to be wrong anyway, can we come up with a faster and simpler process to create them.

Big Data Case Studies

Three case studies are presented that demonstrate that simple models can be created from large corpus of data. The three case studies are difficult problems from the field of natural language process (NLP):

Word Segmentation

The problem of separating unspaced characters into words so that the sentences have meaning. For example, Chinese characters do not have spacing. Use a simple probabilistic model of what constitutes a word and the Python program fits on one page.

Spelling Correction

The problem of determining whether a word is a typo and what the correction should be. Again, a simple probabilistic model that models what is a word and whether a word is a typo of a correction by looking at edit difference. It is a harder problem than segmentation.

Norvig compares his one page Python program to an open source project that has sophisticated models. He comments on the maintainability of the hand crafted models and the difficulty of adapting it to new languages. He contracts this with the big data solution that only requires the corpus to create the statistical model.

In addition to maintainability and adaptability Norvig comments that the simpler statistical model can capture the detail that is hand crafted into complex smarter models because this detail is in the data. It is not necessary to split out and maintain smaller complex models.

Machine Translation

The problem of translating one language into another. This is a more complex problem than segmentation and spelling correction. It requires a corpus of translated text, for example newspapers that have an English and Chinese edition. The problem is addressed as an alignment problem between the two languages. Many fancy models were tried but failed to add benefit over the simple statistical model.

Big Data Principles

Big data promotes a different mode of thinking about machine learning algorithms and datasets. The data is the model.

More data versus Better Algorithms

Example problem by Microsoft Research on sentence disambiguation. The worst algorithm beats the best algorithm when the size of the dataset is dramatically increased. The lesson is to look to max-out the data for the model and find the plateau before moving onto the next model.

Parametric versus Non-parametric

When you are data poor, there is not much you can do unless you have a good theory. You essentially throw the data away and rely on your model. If you are data rich, you have something you can work from. Keep all the data because the situation could change which will change your model.

Norvig finished the talk with comments on supervised and unsupervised learning and the opportunity for semi-supervised methods that strike a balance and reap the benefits from both methods.

This is a great video and is well worth the one hour to watch. Highly recommended if you are looking for insight into the big data movement.

You can get a good treatment of the same material by reading Norvig’s chapter contribution to the book Beautiful Data: The Stories Behind Elegant Data Solutions (affiliate link). You can download this chapter for free on Norvig’s webpage Natural Language Corpus Data.

Resources

Below are a list of resources if you are interested in learning or reading more about Norvig’s take on big data.

Have you watched this video? Leave a comment and let me know what you thought.

相關推薦

Hands on Big Data by Peter Norvig

Tweet Share Share Google Plus When I’m asked about resources for big data, I typically recommend

Insights on big data from Optum Technology's first Technical Fellow

Kerrie Holley has had a storied career in the technology industry spending nearly three decades at IBM before moving to Cisco as a chief technology officer

Microsoft buys into Grab as pair focus on big data and AI on Azure ZDNet

Microsoft has announced making a strategic investment in ride-sharing service Grab, as one of the first moves under a recently forged partnership between t

Survey Report on Data Skew in Big Data

range 如何 長時間 變量 延遲 過濾 gas 而且 允許 1 Introduction 信息時代產生了大量的數據,運用和使用數據已經成為一個公司乃至一個國家核心實力的重要組成部分。當代大數據一般指的是:數據量巨大,需要運用新處理模式才能具有更強的決策力、洞察力和流程優

Big Data Opportunities and Challenges(by周誌華)論文要點

重要 big data 環境 數據分布 範式 hal 大數據挖掘 目標 最優化 大數據環境下的機器學習 三種誤解:模型不再重要(大量數據上復雜模型依然提升顯著,大數據是的復雜模型充分利用數據且難以過擬合),相關性就足夠了(因果關系重要性無法被替代),以前的研究方向不再重

Microsoft invests in Grab to bring AI and big data to on

Microsoft has made a strategic investment in ride-hailing and on-demand services company Grab as part of a deal that includes collaborating on big data and

Discover the big data analysis software by Expert System

Big data analysis software powered by Cogito Expert System provides big data analysis software to unlock the value in your information.    Relying on a de

time bushfire alerting with Complex Event Processing in Apache Flink on Amazon EMR and IoT sensor network | AWS Big Data Blog

Bushfires are frequent events in the warmer months of the year when the climate is hot and dry. Countries like Australia and the United States are

Using Presto in our Big Data Platform on AWS

Using Presto in our Big Data Platform on AWSby Eva Tse, Zhenxiao Luo, Nezih Yigitbasi @ Big Data Platform teamAt Netflix, the Big Data Platform team is res

Informatica Big Data Management on AWS

This Quick Start deploys Informatica Big Data Management automatically into an AWS Cloud configuration of your choice. Big Data Managemen

Big Data On

Amazon Web Services is Hiring. Amazon Web Services (AWS) is a dynamic, growing business unit within Amazon.com. We are currently hiring So

Q&A: Trifacta's Sachin Chawla on getting the most out of Big Data Internet of Business

The insights offered by Big Data are key to many businesses today. Getting the information that's hidden within it isn't easy but there are plenty of compa

翻譯-In-Stream Big Data Processing 流式大數據處理

rto 風格 需要 最重要的 建立 reference 處理器 web 用戶id 相當長一段時間以來,大數據社區已經普遍認識到了批量數據處理的不足。很多應用都對實時查詢和流式處理產生了迫切需求。最近幾年,在這個理念的推動下,催生出了一系列解決方案,Twitter Storm

Peter Norvig:十年學會編程

結果 足夠 學生 尋找 reac 嘗試 設計 測試 激發 為啥都想速成?隨便逛一下書店,你會看到《7天自學Java》等諸如此類的N天甚至N小時學習Visual Basic、Windows、Internet的書。我用亞馬遜網站的搜索功能,出版年份選1992年以後,書名關鍵詞是

[Angular] Fetch non-JSON data by specifying HttpClient responseType in Angular

ica service ext esp 4.3 ttpClient cto sin post By default the new Angular Http client (introduced in v4.3.1) uses JSON as the data format

power bi hands-on lab for MS technology fans

ech vpd imm hot jpg col https and cto These Power BI hands-on lab are free. I‘m a hands-on speaker. The URL of my demo is https://app.po

2017.11.12 Power BI hands-on lab workshop

power biThis event was a free of charge. As a MVP ,lihuan Song was a lecturer in the event. He introduced Microsoft Power BI data visualization technology

AI+BIG DATA:無人送貨時代離我們究竟還有多遠?

實驗 一段 無人車 美國 倉儲 資金 便宜 藍圖 真的 目前,快遞小哥承擔著物流業螺絲釘的角色,把大量快件從快遞點運送到千家萬戶。但在未來幾年,這份高強度的重復性勞動,很可能就會有一部分由無人配送來完成了。 前陣子,京東發布一則關於無人配送的視頻,正式公布了一個無人配

《Toward an SDN-Enabled Big Data Platform for Social TV Analysis》--2015--Han Hu

man 開關 衍生 背景 虛擬機 授權 關系 獲取 實體 《面向應用於社會TV分析的應用了SDN的大數據平臺》 Abstract social TV analytics 是什麽,就是說很多TV觀眾在微博、微信和推特等這些地方分享他們的觀感時,然後有人就對這個進行挖掘分析,這

OReilly.Hands-On.Machine.Learning.with.Scikit-Learn.and.TensorFlow學習筆記彙總

其中用到的知識點我都記錄在部落格中了:https://blog.csdn.net/dss_dssssd 第一章知識點總結: supervised learning k-Nearest Neighbors Linear Regression