1. 程式人生 > >Lessons for Machine Learning from Econometrics

Lessons for Machine Learning from Econometrics

Hal Varian is the chief economist at Google and gave a talk to Electronic Support Group at EECS Department at the University of California at Berkeley in November 2013. The talk was titled Machine Learning and Econometrics

and was really focused on what lessons the machine learning can take away from the field of Econometrics.

Hal started out by summarizing a recent paper of his titled “Big Data: New Tricks for Econometrics” (PDF) which comments on what the econometrics community can learn from the machine learning community, namely:

  • Train-test-validate to avoid overfitting
  • Cross validation
  • Nonlinear estimation (trees, forests, SVMs, neural nets, etc)
  • Bootstrap, bagging, boosting
  • Variable selection (lasso and friends)
  • Model averaging
  • Computational Bayesian methods (MCMC)
  • Tools for manipulating big data (SQL, NoSQL databases)
  • Textual analysis (not discussed)

He continued by talking about non-i.i.d data such as time series data and panel data. This is data where cross validation typically does not perform well. He suggests decomposing data trend + seasonal components and look at deviations from expected behavior. An example is given of Google Correlate showing that auto dealer sales data best correlates with searches for indian restaurants (madness!).

google correlate example

NSA auto sales and Google Correlate to 2012

The focus on the talk is causal inference, a big subject in econometrics. He covers:

  • Counterfactuals: What would have happened to the treated if they weren’t treated? Would they look like the control on average? Read more about counterfactuals within empirical testing.
  • Confounding Variables: Unobserved variables that correlates with both x and y (the other stuff). Commonly an issue when human choice is involved. Read more about confounding variables.
  • Natural Experiments: May or may not be randomized. An example is the draft lottery. Read more about natural experiments.
  • Regression Discontinuity: Cut-off or threshold above or below the treatment is applied. You can compare cases close to the (arbitrary) threshold to estimate the average treatment effect when randomization is not possible. Tune the threshold once you can model the causal relationship and play what-if’s (don’t leave randomization to chance). Read more on regression discontinuity design (RDD).
  • Difference in Differences (DiD): It’s not enough to look at before and after of the treatment, you need to adjust the treated by the control. The treatment may not be randomly assigned. Read more about difference in differences.
  • Instrumental Variables: Variation in X that is independent of error. Something that changes X (correlates with X) but does not change the error. Provides a control lever. Randomization is an instrumental variable. Read more about instrumental variables.

He summarized the lessons for the machine learning community from econometrics as follows:

  • Observational data (usually) can’t determine causality, no matter how big it is (big data is not enough)
  • Causal inference is what you want for policy
  • Treatment-control with random assignment is the gold standard
  • Sometimes you can find natural experiments, discontinuities, etc.
  • Prediction is critical to causal inference for both selection issue and counterfactual
  • Very interesting research in systems for continual testing

Hal finished with two book recommendations:

The talk was also given to the Stanford University Department of Electrical Engineering in 2014 titled What Machine Learning Can Learn from Econometrics and Vice Versa. You can see the PDF slides from this second talk, it’s pretty much the same.

相關推薦

Lessons for Machine Learning from Econometrics

Tweet Share Share Google Plus Hal Varian is the chief economist at Google and gave a talk to Ele

Free Online Course: Neural Networks for Machine Learning from Coursera Class Central

I honestly can't understand the multiple 5 star reviews presented on this site about the course. I'm giving it a 1 star which is a bit harsh I know but I'm

斯坦福大學公開課機器學習:machine learning system design | data for machine learning(數據量很大時,學習算法表現比較好的原理)

ali 很多 好的 info 可能 斯坦福大學公開課 數據 div http 下圖為四種不同算法應用在不同大小數據量時的表現,可以看出,隨著數據量的增大,算法的表現趨於接近。即不管多麽糟糕的算法,數據量非常大的時候,算法表現也可以很好。 數據量很大時,學習算法表現比

Statistical Methods for Machine Learning

AS n-2 cal 元素 n) pan size AC 情況 機器學習中的統計學方法。 統計學是機器學習的一個支柱。 原始觀察僅僅是數據, 但它們不是信息或知識。數據引發問題, 例如: 什麽是最常見的或預期的觀察? 觀察的限制是什麽? 數據是什麽樣子的?

Titanic: Machine Learning from Disaster

some see nic eth neu was github des apply Competition Description The sinking of the RMS Titanic is one of the most infamous shipwrecks i

《C4.5: Programs for Machine Learning》chaper4實驗結果重現

使用自帶的vote資料集: 實驗結果如下: 剪枝前: physician fee freeze = n: | adoption of the budget resolution = y: democrat (151.0) | adoption of the budget resolution

kaggle筆記02: Titanic: Machine Learning from Disaster(二)

5. 模型建立。 如何選擇模型? sklearn官網上演算法粗略選擇圖: 根據上圖,考慮SVC和ensemble clasifiers。本例建議從決策樹、bagging、隨機森林和boosting開始,因為好理解好除錯,然後是SVC。資料量小所以交叉

the resource for machine learning

Questions and Answers What's matrix dot product in Deep Learning? Deep Neural Network with Matrices https://matrices.io/deep-neural-network-from-scrat

[Infographic] The Best Tools for Machine Learning Gengo AI

Machine learning projects can range from small datasets and standard algorithms, to much larger projects that use neural networks engines with massive data

The 50 Best Public Datasets for Machine Learning

The 50 Best Public Datasets for Machine LearningWhat are some open datasets for machine learning? After scrapping the web for hours after hours, we have cr

Facebook's PyTorch plans to light the way to speedy workflows for Machine Learning • DEVCLASS

Facebook's development department has finished a first release candidate for v1 of its PyTorch project – just in time for the first conference dedicated to

Essential libraries for Machine Learning in Python

Python is often the language of choice for developers who need to apply statistical techniques or data analysis in their work. It is also used by data scie

Top 10 Open Image Datasets for Machine Learning Research

This article would succinctly describe the best ten image datasets used for certain fundamental computer vision problems such as classification, detecti

Why Data Normalization is necessary for Machine Learning models

Why Data Normalization is necessary for Machine Learning modelsNormalization is a technique often applied as part of data preparation for machine learning.

Real time prediction of telco customer churn using Watson Machine Learning from Cognos dashboard

Summary Cognos 11 is not only positioned toward the professional report author, but specifically toward power users and data scientis

NXP Owns the Stage for Machine Learning in Edge Devices

SAN JOSE, Calif. and BARCELONA, Spain, Oct. 16, 2018 (GLOBE NEWSWIRE) -- (ARMTECHCON and IoT World Congress Barcelona) - Mathematical advances that are dri

NXP's New Development Platform for Machine Learning in the IoT

NXP Semiconductors has launched a new machine learning toolkit. Called "eIQ", it's a software development platform that supports popular neural network fra

Marginally Interesting: Slides for Machine Learning on Streams

Tweet Yesterday I gave a talk at the Big Data Beers meetup in Berlin on

Using Amazon’s Mechanical Turk for Machine Learning Data

How to build a model from Mechanical Turk resultsAmazon Mechanical Turk will notify you when your results are ready and you will finally have a labelled da