1. 程式人生 > >Machine Learning Done Wrong(機器學習七種易犯的錯誤)

Machine Learning Done Wrong(機器學習七種易犯的錯誤)

作者總結了機器學習七種易犯的錯誤:

1.想當然用預設Loss;

2.非線性情況下用線性模型;

3.忘記Outlier;

4.樣本少時用High Viriance模型;

5.不做標準化就用L1/L2等正則;

6.不考慮線性相關直接用線性模型;

7.LR模型中用引數絕對值判斷feature重要性。

Statistical modeling is a lot like engineering.

In engineering, there are various ways to build a key-value storage, and each design makes a different set of assumptions about the usage pattern. In statistical modeling, there are various algorithms to build a classifier, and each algorithm makes a different set of assumptions about the data.

When dealing with small amounts of data, it’s reasonable to try as many algorithms as possible and to pick the best one since the cost of experimentation is low. But as we hit “big data”, it pays off to analyze the data upfront and then design the modeling pipeline (pre-processing, modeling, optimization algorithm, evaluation, productionization) accordingly.

As pointed out in my previous post, there are dozens of ways to solve a given modeling problem. Each model assumes something different, and it’s not obvious how to navigate and identify which assumptions are reasonable. In industry, most practitioners pick the modeling algorithm they are most familiar with rather than pick the one which best suits the data.

In this post, I would like to share some common mistakes (the don't-s). I’ll save some of the best practices (the do-s) in a future post.

1. Take default loss function for granted

Many practitioners train and pick the best model using the default loss function (e.g., squared error). In practice, off-the-shelf loss function rarely aligns with the business objective. Take fraud detection as an example. When trying to detect fraudulent transactions, the business objective is to minimize the fraud loss. The off-the-shelf loss function of binary classifiers weighs false positives and false negatives equally. To align with the business objective, the loss function should not only penalize false negatives more than false positives, but also penalize each false negative in proportion to the dollar amount. Also, data sets in fraud detection usually contain highly imbalanced labels. In these cases, bias the loss function in favor of the rare case (e.g., through up/down sampling).

2. Use plain linear models for non-linear interaction

When building a binary classifier, many practitioners immediately jump to logistic regression because it’s simple. But, many also forget that logistic regression is a linear model and the non-linear interaction among predictors need to be encoded manually. Returning to fraud detection, high order interaction features like "billing address = shipping address and transaction amount < $50" are required for good model performance. So one should prefer non-linear models like SVM with kernel or tree based classifiers that bake in higher-order interaction features.

3. Forget about outliers

Outliers are interesting. Depending on the context, they either deserve special attention or should be completely ignored. Take the example of revenue forecasting. If unusual spikes of revenue are observed, it's probably a good idea to pay extra attention to them and figure out what caused the spike. But if the outliers are due to mechanical error, measurement error or anything else that’s not generalizable, it’s a good idea to filter out these outliers before feeding the data to the modeling algorithm.

Some models are more sensitive to outliers than others. For instance, AdaBoost might treat those outliers as "hard" cases and put tremendous weights on outliers while decision tree might simply count each outlier as one false classification. If the data set contains a fair amount of outliers, it's important to either use modeling algorithm robust against outliers or filter the outliers out.

4. Use high variance model when n<<p

SVM is one of the most popular off-the-shelf modeling algorithms and one of its most powerful features is the ability to fit the model with different kernels. SVM kernels can be thought of as a way to automatically combine existing features to form a richer feature space. Since this power feature comes almost for free, most practitioners by default use kernel when training a SVM model. However, when the data has n<<p (number of samples << number of features) --  common in industries like medical data -- the richer feature space implies a much higher risk to overfit the data. In fact, high variance models should be avoided entirely when n<<p.

5. L1/L2/... regularization without standardization

Applying L1 or L2 to penalize large coefficients is a common way to regularize linear or logistic regression. However, many practitioners are not aware of the importance of standardizing features before applying those regularization.

Returning to fraud detection, imagine a linear regression model with a transaction amount feature. Without regularization, if the unit of transaction amount is in dollars, the fitted coefficient is going to be around 100 times larger than the fitted coefficient if the unit were in cents. With regularization, as the L1 / L2 penalize larger coefficient more, the transaction amount will get penalized more if the unit is in dollars. Hence, the regularization is biased and tend to penalize features in smaller scales. To mitigate the problem, standardize all the features and put them on equal footing as a preprocessing step.

6. Use linear model without considering multi-collinear predictors

Imagine building a linear model with two variables X1 and X2 and suppose the ground truth model is Y=X1+X2. Ideally, if the data is observed with small amount of noise, the linear regression solution would recover the ground truth. However, if X1 and X2 are collinear, to most of the optimization algorithms' concerns, Y=2*X1, Y=3*X1-X2 or Y=100*X1-99*X2 are all as good. The problem might not be detrimental as it doesn't bias the estimation. However, it does make the problem ill-conditioned and make the coefficient weight uninterpretable.

7. Interpreting absolute value of coefficients from linear or logistic regression as feature importance

Because many off-the-shelf linear regressor returns p-value for each coefficient, many practitioners believe that for linear models, the bigger the absolute value of the coefficient, the more important the corresponding feature is. This is rarely true as (a) changing the scale of the variable changes the absolute value of the coefficient (b) if features are multi-collinear, coefficients can shift from one feature to others. Also, the more features the data set has, the more likely the features are multi-collinear and the less reliable to interpret the feature importance by coefficients.

So there you go: 7 common mistakes when doing ML in practice. This list is not meant to be exhaustive but merely to provoke the reader to consider modeling assumptions that may not be applicable to the data at hand. To achieve the best model performance, it is important to pick the modeling algorithm that makes the most fitting assumptions -- not just the one you’re most familiar with.

相關推薦

Machine Learning Done Wrong(機器學習錯誤)

作者總結了機器學習七種易犯的錯誤: 1.想當然用預設Loss; 2.非線性情況下用線性模型; 3.忘記Outlier; 4.樣本少時用High Viriance模型; 5.不做標準化就用L1/L2等正則; 6.不考慮線性相關直接用線性模型; 7.LR模型中用引數絕對值判斷f

Machine Learning Done Wrong機器學習錯誤

作者總結了機器學習七種易犯的錯誤:1.想當然用預設Loss;2.非線性情況下用線性模型;3.忘記Outlier;4.樣本少時用High Viriance模型;5.不做標準化就用L1/L2等正則;6.不考慮線性相關直接用線性模型;7.LR模型中用引數絕對值判斷feature

Machine Learning, Coursera】機器學習Week6 偏斜資料集的處理

ML Week6: Handing Skewed Data 本節內容: 1. 查準率和召回率 2. F1F1 Score 偏斜類(skewed class)問題:資料集中每一類的資料量嚴重不均衡 如果資料集為偏斜類,分類正確率不是一個好的指標。

Machine Learning, Coursera】機器學習Week7 核函式

Kernels 本節內容: 核函式(Kernel)是一類計算變數間相似度的函式,它可用於構造新的特徵變數,幫助SVM解決複雜的非線性分類問題。 相關機器學習概念: 相似度函式(Similarity Function) 高斯核函式(Gaussian Kernel

Machine Learning .NET平臺機器學習

機器學習——深度學習(Deep Learning) https://blog.csdn.net/abcjennifer/article/details/7826917 機器學習系列 https://morvanzhou.github.io/tutorials/machine-learning/ &nb

Machine Learning, Coursera】機器學習Week6 機器學習應用建議

Advice for Applying ML 本節內容:機器學習系統設計的過程中,很可能會出現訓練的模型預測誤差較大的情況。選用最正確、有效的方法來改進演算法是機器學習成功的關鍵,它可以幫我們節省大量時間。本節就是關於如何高效訓練模型,把時間用在刀刃上。

Machine Learning, Coursera】機器學習Week7 支援向量機的應用

SVMs in Practice 本節內容: SVM的引數選擇、SVM解決多分類問題、實踐中logistic迴歸和SVM的選擇 相關機器學習概念: 線性核函式(Linear kernel) 1. Using SVM Packages 有許多軟體庫可以實現S

Machine Learning, Coursera】機器學習Week3 Logistic Regression

Logistic Regression Logistic regression is a method for classifying data into discrete outcomes. 邏輯迴歸將資料歸類為離散的結果並輸出 Exam

機器學習實戰(Machine Learning in Action)學習筆記————02.k-鄰近演算法(KNN)

機器學習實戰(Machine Learning in Action)學習筆記————02.k-鄰近演算法(KNN)關鍵字:鄰近演算法(kNN: k Nearest Neighbors)、python、原始碼解析、測試作者:米倉山下時間:2018-10-21機器學習實戰(Machine Learning in

機器學習實戰(Machine Learning in Action)學習筆記————05.Logistic迴歸

機器學習實戰(Machine Learning in Action)學習筆記————05.Logistic迴歸關鍵字:Logistic迴歸、python、原始碼解析、測試作者:米倉山下時間:2018-10-26機器學習實戰(Machine Learning in Action,@author: Peter H

機器學習實戰(Machine Learning in Action)學習筆記————04.樸素貝葉斯分類(bayes)

機器學習實戰(Machine Learning in Action)學習筆記————04.樸素貝葉斯分類(bayes)關鍵字:樸素貝葉斯、python、原始碼解析作者:米倉山下時間:2018-10-25機器學習實戰(Machine Learning in Action,@author: Peter Harri

機器學習實戰(Machine Learning in Action)學習筆記————03.決策樹原理、原始碼解析及測試

機器學習實戰(Machine Learning in Action)學習筆記————03.決策樹原理、原始碼解析及測試關鍵字:決策樹、python、原始碼解析、測試作者:米倉山下時間:2018-10-24機器學習實戰(Machine Learning in Action,@author: Peter Harr

機器學習實戰(Machine Learning in Action)學習筆記————08.使用FPgrowth演算法來高效發現頻繁項集

機器學習實戰(Machine Learning in Action)學習筆記————08.使用FPgrowth演算法來高效發現頻繁項集關鍵字:FPgrowth、頻繁項集、條件FP樹、非監督學習作者:米倉山下時間:2018-11-3機器學習實戰(Machine Learning in Action,@autho

機器學習實戰(Machine Learning in Action)學習筆記————07.使用Apriori演算法進行關聯分析

機器學習實戰(Machine Learning in Action)學習筆記————07.使用Apriori演算法進行關聯分析關鍵字:Apriori、關聯規則挖掘、頻繁項集作者:米倉山下時間:2018-11-2機器學習實戰(Machine Learning in Action,@author: Peter H

機器學習實戰(Machine Learning in Action)學習筆記————06.k-均值聚類演算法(kMeans)學習筆記

機器學習實戰(Machine Learning in Action)學習筆記————06.k-均值聚類演算法(kMeans)學習筆記關鍵字:k-均值、kMeans、聚類、非監督學習作者:米倉山下時間:2018-11-3機器學習實戰(Machine Learning in Action,@author: Pet

機器學習實戰(Machine Learning in Action)學習筆記————10.奇異值分解(SVD)原理、基於協同過濾的推薦引擎、資料降維

關鍵字:SVD、奇異值分解、降維、基於協同過濾的推薦引擎作者:米倉山下時間:2018-11-3機器學習實戰(Machine Learning in Action,@author: Peter Harrington)原始碼下載地址:https://www.manning.com/books/machine-le

機器學習實戰(Machine Learning in Action)學習筆記————10.奇異值分解(SVD)原理、基於協同過濾的推薦引擎、數據降維

www 實現 由於 就是 計算 學習筆記 圖片 blob 標示 關鍵字:SVD、奇異值分解、降維、基於協同過濾的推薦引擎作者:米倉山下時間:2018-11-3機器學習實戰(Machine Learning in Action,@author: Peter Harringto

摘錄-Introduction to Statistical Learning Theory(統計機器學習導論)

證明 learn mac 關於 nbsp 進行 rod 最大公約數 ros 機器學習目標:(二分類) 經驗風險: 過度擬合: 經驗風險最小化: 結構風險最小化: 正則: 特點:

機器學習--回歸--多元線性回歸Multiple Linear Regression

clas http span str 圖片 style port import num 一、不包含分類型變量 from numpy import genfromtxtimport numpy as npfrom sklearn import datasets,linear

機器學習() PCA與梯度上升法 (下)

實例 此外 tps 新的 get nsf self. -s 冗余 五、高維數據映射為低維數據 換一個坐標軸。在新的坐標軸裏面表示原來高維的數據。 低維 反向 映射為高維數據 PCA.py import numpy as np class