Feature Selection in Python with Scikit
Not all data attributes are created equal. More is not always better when it comes to attributes or columns in your dataset.
In this post you will discover how to select attributes in your data before creating a machine learning model using the
Update: For a more recent tutorial on feature selection in Python see the post:
Select Features
Feature selection is a process where you automatically select those features in your data that contribute most to the prediction variable or output in which you are interested.
Having too many irrelevant features in your data can decrease the accuracy of the models. Three benefits of performing feature selection before modeling your data are:
- Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
- Improves Accuracy: Less misleading data means modeling accuracy improves.
- Reduces Training Time: Less data means that algorithms train faster.
Two different feature selection methods provided by the scikit-learn Python library are Recursive Feature Elimination and feature importance ranking.
Need help with Machine Learning in Python?
Take my free 2-week email course and discover data prep, algorithms and more (with code).
Click to sign-up now and also get a free PDF Ebook version of the course.
Recursive Feature Elimination
The Recursive Feature Elimination (RFE) method is a feature selection approach. It works by recursively removing attributes and building a model on those attributes that remain. It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.
This recipe shows the use of RFE on the Iris floweres dataset to select 3 attributes.
Recursive Feature Elimination Python1234567891011121314 | # Recursive Feature Eliminationfromsklearn importdatasetsfromsklearn.feature_selection importRFEfromsklearn.linear_model importLogisticRegression# load the iris datasetsdataset=datasets.load_iris()# create a base classifier used to evaluate a subset of attributesmodel=LogisticRegression()# create the RFE model and select 3 attributesrfe=RFE(model,3)rfe=rfe.fit(dataset.data,dataset.target)# summarize the selection of the attributesprint(rfe.support_)print(rfe.ranking_) |
For more information see the RFE method in the API documentation.
Feature Importance
Methods that use ensembles of decision trees (like Random Forest or Extra Trees) can also compute the relative importance of each attribute. These importance values can be used to inform a feature selection process.
This recipe shows the construction of an Extra Trees ensemble of the iris flowers dataset and the display of the relative feature importance.
Feature Importance with datasets.load_iris() # fit an Extra Python1234567891011 | # Feature Importancefromsklearn importdatasetsfromsklearn importmetricsfromsklearn.ensemble importExtraTreesClassifier# load the iris datasetsdataset=datasets.load_iris()# fit an Extra Trees model to the datamodel=ExtraTreesClassifier()model.fit(dataset.data,dataset.target)# display the relative importance of each attributeprint(model.feature_importances_) |
For more information, see the ExtraTreesClassifier method in the API documentation.
Summary
Feature selection methods can give you useful information on the relative importance or relevance of features for a given problem. You can use this information to create filtered versions of your dataset and increase the accuracy of your models.
In this post you discovered two feature selection methods you can apply in Python using the scikit-learn library.
Frustrated With Python Machine Learning?
Develop Your Own Models in Minutes
…with just a few lines of scikit-learn code
Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, modeling, tuning, and much more…
Finally Bring Machine Learning To
Your Own Projects
Skip the Academics. Just Results.
相關推薦
Feature Selection in Python with Scikit
Tweet Share Share Google Plus Not all data attributes are created equal. More is not always bett
Rescaling Data for Machine Learning in Python with Scikit
Tweet Share Share Google Plus Your data must be prepared before you can build models. The data p
Save and Load Machine Learning Models in Python with scikit
Hello Jason, I am new to machine learning. I am your big fan and read a lot of your blog and books. Thank you very much for teaching us machine le
How to Load Data in Python with Scikit
Tweet Share Share Google Plus Before you can build machine learning models, you need to load you
Interactive Brokers in Python with backtrader
With the client running, we additionally need to do a couple of thingsUnder File -> Global Configuration choose Settings -> API andCheck Enable Activ
Interactive Data Visualization in Python With Bokeh
Bokeh prides itself on being a library for interactive data visualization. Unlike popular counterparts in the Python visualization space, like Matplotl
Project Spotlight: Event Recommendation in Python with Artem Yankov
Tweet Share Share Google Plus This is a project spotlight with Artem Yankov. Could you please i
Prepare Data for Machine Learning in Python with Pandas
Tweet Share Share Google Plus If you are using the Python stack for studying and applying machin
Feature Selection for Time Series Forecasting with Python
Tweet Share Share Google Plus The use of machine learning methods on time series data requires f
pip is configured with locations that require TLS/SSL, however the ssl module in Python is not available.
make 重新 test however con conf ins ava OS # 背景 安裝pip後發現執行pip install pytest,提示下面錯誤 pip is configured with locations that require TLS/S
Fully-adaptive Feature Sharing in Multi-Task Networks with Applications in Person Attribute Classifi
Fully-adaptive Feature Sharing in Multi-Task Networks with Applications in Person Attribute Classification (多工網路中的完全自適應特徵共享及其在人屬性分類中的應用 ) 原文連結:Fully
scikit-learn--Feature selection(特徵選擇)
去掉方差較小的特徵 方差閾值(VarianceThreshold)是特徵選擇的一個簡單方法,去掉那些方差沒有達到閾值的特徵。預設情況下,刪除零方差的特徵,例如那些只有一個值的樣本。 假設我們有一個有布林特徵的資料集,然後我們想去掉那些超過80%的樣本都是0(或者1)的特徵。
Deploying a Python serverless function in minutes with GCP
A few questionsWhat is Cloud Functions?Cloud Functions is a managed service for serverless functions. The acronym describing such a service is FaaS (Functi
Feature Selection: A/B Test With Tableau
Feature Selection: A/B Test With TableauDuring a data science project it is important to prepare the data before analyzing them or create a model that gene
Analysis of Stock Market Cycles with fbprophet package in Python
Introduction to fbprophetFbprophet is an open source released by Facebook in order to provide some useful guidance for producing forecast at scale. By defa
Intro to Image Processing in OpenCV with Python
Intro to Image Processing in OpenCV with PythonWelcome to this tutorial covering OpenCV. This type of program is most commonly used for video and image ana
Create a bot with NLU in Python @ Alex Pliutau's Blog
At Wizeline we have Python courses, and recent topic was how to build a Bot in Python. I always wanted to try Natural Language Understanding
Working With JSON Data in Python
Since its inception, JSON has quickly become the de facto standard for information exchange. Chances are you’re here because you need to transport some
Caching in Django With Redis β Real Python
Application performance is vital to the success of your product. In an environment where users expect website response times of less than a second, the
Analyzing Obesity in England With Python β Real Python
I saw a sign at the gym yesterday that said, “Children are getting fatter every decade”. Below that sign stood a graph that basically showed that in fiv