1. 程式人生 > >10 Standard Datasets for Practicing Applied Machine Learning

10 Standard Datasets for Practicing Applied Machine Learning

The key to getting good at applied machine learning is practicing on lots of different datasets.

This is because each problem is different, requiring subtly different data preparation and modeling methods.

In this post, you will discover 10 top standard machine learning datasets that you can use for practice.

Let’s dive in.

  • Update March/2018: Added alternate link to download the Pima Indians and Boston Housing datasets as the originals appear to have been taken down.

Overview

A structured Approach

Each dataset is summarized in a consistent way. This makes them easy to compare and navigate for you to practice a specific data preparation technique or modeling method.

The aspects that you need to know about each dataset are:

  1. Name: How to refer to the dataset.
  2. Problem Type: Whether the problem is regression or classification.
  3. Inputs and Outputs: The numbers and known names of input and output features.
  4. Performance: Baseline performance for comparison using the Zero Rule algorithm, as well as best known performance (if known).
  5. Sample: A snapshot of the first 5 rows of raw data.
  6. Links: Where you can download the dataset and learn more.

Standard Datasets

Below is a list of the 10 datasets we’ll cover.

Each dataset is small enough to fit into memory and review in a spreadsheet. All datasets are comprised of tabular data and no (explicitly) missing values.

  1. Swedish Auto Insurance Dataset.
  2. Wine Quality Dataset.
  3. Pima Indians Diabetes Dataset.
  4. Sonar Dataset.
  5. Banknote Dataset.
  6. Iris Flowers Dataset.
  7. Abalone Dataset.
  8. Ionosphere Dataset.
  9. Wheat Seeds Dataset.
  10. Boston House Price Dataset.

1. Swedish Auto Insurance Dataset

The Swedish Auto Insurance Dataset involves predicting the total payment for all claims in thousands of Swedish Kronor, given the total number of claims.

It is a regression problem. It is comprised of 63 observations with 1 input variable and one output variable. The variable names are as follows:

  1. Number of claims.
  2. Total payment for all claims in thousands of Swedish Kronor.

The baseline performance of predicting the mean value is an RMSE of approximately 72.251 thousand Kronor.

A sample of the first 5 rows is listed below.

12345 108,392.519,46.213,15.7124,422.240,119.4

Below is a scatter plot of the entire dataset.

Swedish Auto Insurance Dataset

Swedish Auto Insurance Dataset

2. Wine Quality Dataset

The Wine Quality Dataset involves predicting the quality of white wines on a scale given chemical measures of each wine.

It is a multi-class classification problem, but could also be framed as a regression problem. The number of observations for each class is not balanced. There are 4,898 observations with 11 input variables and one output variable. The variable names are as follows:

  1. Fixed acidity.
  2. Volatile acidity.
  3. Citric acid.
  4. Residual sugar.
  5. Chlorides.
  6. Free sulfur dioxide.
  7. Total sulfur dioxide.
  8. Density.
  9. pH.
  10. Sulphates.
  11. Alcohol.
  12. Quality (score between 0 and 10).

The baseline performance of predicting the mean value is an RMSE of approximately 0.148 quality points.

A sample of the first 5 rows is listed below.

12345 7,0.27,0.36,20.7,0.045,45,170,1.001,3,0.45,8.8,66.3,0.3,0.34,1.6,0.049,14,132,0.994,3.3,0.49,9.5,68.1,0.28,0.4,6.9,0.05,30,97,0.9951,3.26,0.44,10.1,67.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.4,9.9,67.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.4,9.9,6

3. Pima Indians Diabetes Dataset

The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given medical details.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. Missing values are believed to be encoded with zero values. The variable names are as follows:

  1. Number of times pregnant.
  2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
  3. Diastolic blood pressure (mm Hg).
  4. Triceps skinfold thickness (mm).
  5. 2-Hour serum insulin (mu U/ml).
  6. Body mass index (weight in kg/(height in m)^2).
  7. Diabetes pedigree function.
  8. Age (years).
  9. Class variable (0 or 1).

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 65%. Top results achieve a classification accuracy of approximately 77%.

A sample of the first 5 rows is listed below.

12345 6,148,72,35,0,33.6,0.627,50,11,85,66,29,0,26.6,0.351,31,08,183,64,0,0,23.3,0.672,32,11,89,66,23,94,28.1,0.167,21,00,137,40,35,168,43.1,2.288,33,1

4. Sonar Dataset

The Sonar Dataset involves the prediction of whether or not an object is a mine or a rock given the strength of sonar returns at different angles.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 208 observations with 60 input variables and 1 output variable. The variable names are as follows:

  1. Sonar returns at different angles
  2. Class (M for mine and R for rock)

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 53%. Top results achieve a classification accuracy of approximately 88%.

A sample of the first 5 rows is listed below.

12345 0.0200,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,0.1609,0.1582,0.2238,0.0645,0.0660,0.2273,0.3100,0.2999,0.5078,0.4797,0.5783,0.5071,0.4328,0.5550,0.6711,0.6415,0.7104,0.8080,0.6791,0.3857,0.1307,0.2604,0.5121,0.7547,0.8537,0.8507,0.6692,0.6097,0.4943,0.2744,0.0510,0.2834,0.2825,0.4256,0.2641,0.1386,0.1051,0.1343,0.0383,0.0324,0.0232,0.0027,0.0065,0.0159,0.0072,0.0167,0.0180,0.0084,0.0090,0.0032,R0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,0.4918,0.6552,0.6919,0.7797,0.7464,0.9444,1.0000,0.8874,0.8024,0.7818,0.5212,0.4052,0.3957,0.3914,0.3250,0.3200,0.3271,0.2767,0.4423,0.2028,0.3788,0.2947,0.1984,0.2341,0.1306,0.4182,0.3835,0.1057,0.1840,0.1970,0.1674,0.0583,0.1401,0.1628,0.0621,0.0203,0.0530,0.0742,0.0409,0.0061,0.0125,0.0084,0.0089,0.0048,0.0094,0.0191,0.0140,0.0049,0.0052,0.0044,R0.0262,0.0582,0.1099,0.1083,0.0974,0.2280,0.2431,0.3771,0.5598,0.6194,0.6333,0.7060,0.5544,0.5320,0.6479,0.6931,0.6759,0.7551,0.8929,0.8619,0.7974,0.6737,0.4293,0.3648,0.5331,0.2413,0.5070,0.8533,0.6036,0.8514,0.8512,0.5045,0.1862,0.2709,0.4232,0.3043,0.6116,0.6756,0.5375,0.4719,0.4647,0.2587,0.2129,0.2222,0.2111,0.0176,0.1348,0.0744,0.0130,0.0106,0.0033,0.0232,0.0166,0.0095,0.0180,0.0244,0.0316,0.0164,0.0095,0.0078,R0.0100,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598,0.1264,0.0881,0.1992,0.0184,0.2261,0.1729,0.2131,0.0693,0.2281,0.4060,0.3973,0.2741,0.3690,0.5556,0.4846,0.3140,0.5334,0.5256,0.2520,0.2090,0.3559,0.6260,0.7340,0.6120,0.3497,0.3953,0.3012,0.5408,0.8814,0.9857,0.9167,0.6121,0.5006,0.3210,0.3202,0.4295,0.3654,0.2655,0.1576,0.0681,0.0294,0.0241,0.0121,0.0036,0.0150,0.0085,0.0073,0.0050,0.0044,0.0040,0.0117,R0.0762,0.0666,0.0481,0.0394,0.0590,0.0649,0.1209,0.2467,0.3564,0.4459,0.4152,0.3952,0.4256,0.4135,0.4528,0.5326,0.7306,0.6193,0.2032,0.4636,0.4148,0.4292,0.5730,0.5399,0.3161,0.2285,0.6995,1.0000,0.7262,0.4724,0.5103,0.5459,0.2881,0.0981,0.1951,0.4181,0.4604,0.3217,0.2828,0.2430,0.1979,0.2444,0.1847,0.0841,0.0692,0.0528,0.0357,0.0085,0.0230,0.0046,0.0156,0.0031,0.0054,0.0105,0.0110,0.0015,0.0072,0.0048,0.0107,0.0094,R

5. Banknote Dataset

The Banknote Dataset involves predicting whether a given banknote is authentic given a number of measures taken from a photograph.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 1,372 observations with 4 input variables and 1 output variable. The variable names are as follows:

  1. Variance of Wavelet Transformed image (continuous).
  2. Skewness of Wavelet Transformed image (continuous).
  3. Kurtosis of Wavelet Transformed image (continuous).
  4. Entropy of image (continuous).
  5. Class (0 for authentic, 1 for inauthentic).

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 50%.

A sample of the first 5 rows is listed below.

123456 3.6216,8.6661,-2.8073,-0.44699,04.5459,8.1674,-2.4586,-1.4621,03.866,-2.6383,1.9242,0.10645,03.4566,9.5228,-4.0112,-3.5944,00.32924,-4.4552,4.5718,-0.9888,04.3684,9.6718,-3.9606,-3.1625,0

6. Iris Flowers Dataset

The Iris Flowers Dataset involves predicting the flower species given measurements of iris flowers.

It is a multi-class classification problem. The number of observations for each class is balanced. There are 150 observations with 4 input variables and 1 output variable. The variable names are as follows:

  1. Sepal length in cm.
  2. Sepal width in cm.
  3. Petal length in cm.
  4. Petal width in cm.
  5. Class (Iris Setosa, Iris Versicolour, Iris Virginica).

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 26%.

A sample of the first 5 rows is listed below.

12345 5.1,3.5,1.4,0.2,Iris-setosa4.9,3.0,1.4,0.2,Iris-setosa4.7,3.2,1.3,0.2,Iris-setosa4.6,3.1,1.5,0.2,Iris-setosa5.0,3.6,1.4,0.2,Iris-setosa

7. Abalone Dataset

The Abalone Dataset involves predicting the age of abalone given objective measures of individuals.

It is a multi-class classification problem, but can also be framed as a regression. The number of observations for each class is not balanced. There are 4,177 observations with 8 input variables and 1 output variable. The variable names are as follows:

  1. Sex (M, F, I).
  2. Length.
  3. Diameter.
  4. Height.
  5. Whole weight.
  6. Shucked weight.
  7. Viscera weight.
  8. Shell weight.
  9. Rings.

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 16%. The baseline performance of predicting the mean value is an RMSE of approximately 3.2 rings.

A sample of the first 5 rows is listed below.

12345 M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7

8. Ionosphere Dataset

The Ionosphere Dataset requires the prediction of structure in the atmosphere given radar returns targeting free electrons in the ionosphere.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 351 observations with 34 input variables and 1 output variable. The variable names are as follows:

  1. 17 pairs of radar return data.
  2. Class (g for good and b for bad).

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 64%. Top results achieve a classification accuracy of approximately 94%.

A sample of the first 5 rows is listed below.

12345 1,0,0.99539,-0.05889,0.85243,0.02306,0.83398,-0.37708,1,0.03760,0.85243,-0.17755,0.59755,-0.44945,0.60536,-0.38223,0.84356,-0.38542,0.58212,-0.32192,0.56971,-0.29674,0.36946,-0.47357,0.56811,-0.51171,0.41078,-0.46168,0.21266,-0.34090,0.42267,-0.54487,0.18641,-0.45300,g1,0,1,-0.18829,0.93035,-0.36156,-0.10868,-0.93597,1,-0.04549,0.50874,-0.67743,0.34432,-0.69707,-0.51685,-0.97515,0.05499,-0.62237,0.33109,-1,-0.13151,-0.45300,-0.18056,-0.35734,-0.20332,-0.26569,-0.20468,-0.18401,-0.19040,-0.11593,-0.16626,-0.06288,-0.13738,-0.02447,b1,0,1,-0.03365,1,0.00485,1,-0.12062,0.88965,0.01198,0.73082,0.05346,0.85443,0.00827,0.54591,0.00299,0.83775,-0.13644,0.75535,-0.08540,0.70887,-0.27502,0.43385,-0.12062,0.57528,-0.40220,0.58984,-0.22145,0.43100,-0.17365,0.60436,-0.24180,0.56045,-0.38238,g1,0,1,-0.45161,1,1,0.71216,-1,0,0,0,0,0,0,-1,0.14516,0.54094,-0.39330,-1,-0.54467,-0.69975,1,0,0,1,0.90695,0.51613,1,1,-0.20099,0.25682,1,-0.32382,1,b1,0,1,-0.02401,0.94140,0.06531,0.92106,-0.23255,0.77152,-0.16399,0.52798,-0.20275,0.56409,-0.00712,0.34395,-0.27457,0.52940,-0.21780,0.45107,-0.17813,0.05982,-0.35575,0.02309,-0.52879,0.03286,-0.65158,0.13290,-0.53206,0.02431,-0.62197,-0.05707,-0.59573,-0.04608,-0.65697,g

9. Wheat Seeds Dataset

The Wheat Seeds Dataset involves the prediction of species given measurements of seeds from different varieties of wheat.

It is a binary (2-class) classification problem. The number of observations for each class is balanced. There are 210 observations with 7 input variables and 1 output variable. The variable names are as follows:

  1. Area.
  2. Perimeter.
  3. Compactness
  4. Length of kernel.
  5. Width of kernel.
  6. Asymmetry coefficient.
  7. Length of kernel groove.
  8. Class (1, 2, 3).

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 28%.

A sample of the first 5 rows is listed below.

12345 15.26,14.84,0.871,5.763,3.312,2.221,5.22,114.88,14.57,0.8811,5.554,3.333,1.018,4.956,114.29,14.09,0.905,5.291,3.337,2.699,4.825,113.84,13.94,0.8955,5.324,3.379,2.259,4.805,116.14,14.99,0.9034,5.658,3.562,1.355,5.175,1

10. Boston House Price Dataset

The Boston House Price Dataset involves the prediction of a house price in thousands of dollars given details of the house and its neighborhood.

It is a regression problem. The number of observations for each class is balanced. There are 506 observations with 13 input variables and 1 output variable. The variable names are as follows:

  1. CRIM: per capita crime rate by town.
  2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
  3. INDUS: proportion of nonretail business acres per town.
  4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  5. NOX: nitric oxides concentration (parts per 10 million).
  6. RM: average number of rooms per dwelling.
  7. AGE: proportion of owner-occupied units built prior to 1940.
  8. DIS: weighted distances to five Boston employment centers.
  9. RAD: index of accessibility to radial highways.
  10. TAX: full-value property-tax rate per $10,000.
  11. PTRATIO: pupil-teacher ratio by town.
  12. B: 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town.
  13. LSTAT: % lower status of the population.
  14. MEDV: Median value of owner-occupied homes in $1000s.

The baseline performance of predicting the mean value is an RMSE of approximately 9.21 thousand dollars.

A sample of the first 5 rows is listed below.

12345 0.00632 18.00 2.310 0 0.5380 6.5750 65.20 4.0900 1 296.0 15.30 396.90 4.98 24.000.02731 0.00 7.070 0 0.4690 6.4210 78.90 4.9671 2 242.0 17.80 396.90 9.14 21.600.02729 0.00 7.070 0 0.4690 7.1850 61.10 4.9671 2 242.0 17.80 392.83 4.03 34.700.03237 0.00 2.180 0 0.4580 6.9980 45.80 6.0622 3 222.0 18.70 394.63 2.94 33.400.06905 0.00 2.180 0 0.4580 7.1470 54.20 6.0622 3 222.0 18.70 396.90 5.33 36.20

Summary

In this post, you discovered 10 top standard datasets that you can use to practice applied machine learning.

Here is your next step:

  1. Pick one dataset.
  2. Grab your favorite tool (like Weka, scikit-learn or R)
  3. See how much you can beat the standard scores.
  4. Report your results in the comments below.

相關推薦

10 Standard Datasets for Practicing Applied Machine Learning

Tweet Share Share Google Plus The key to getting good at applied machine learning is practicing

Python is the Growing Platform for Applied Machine Learning

Tweet Share Share Google Plus You should pick the right tool for the job. The specific predictiv

Note for Coursera《Machine Learning》1(1) | What is machine learning?

sed xpl some pro form computer from com init What is Machine Learning? Two definitions of Machine Learning are offered. Arthur Samuel des

A Gentle Introduction to Applied Machine Learning as a Search Problem (譯文)

​ A Gentle Introduction to Applied Machine Learning as a Search Problem 原文作者:Jason Brownlee 原文地址:https://machinelearningmastery.com/applied-m

10張圖來看機器學習Machine learning in 10 pictures

I find myself coming back to the same few pictures when explaining basic machine learning concepts. Below is a list I find most illumina

Template for Working through Machine Learning Problems in Weka

Tweet Share Share Google Plus When you are getting started in Weka, you may feel overwhelmed. Th

The 10 Best AI, Data Science and Machine Learning Podcasts

The 10 Best AI, Data Science and Machine Learning PodcastsLearn the basics and keep up with the latest news in data science, machine learning and artificia

Applied Machine Learning Process

Tweet Share Share Google Plus The Systematic Process For Working Through Predictive Modeling Pr

Hello World of Applied Machine Learning

Tweet Share Share Google Plus It is easy to feel overwhelmed with the large numbers of machine l

Why Applied Machine Learning Is Hard

Tweet Share Share Google Plus How to Handle the Intractability of Applied Machine Learning. Appl

Applied Machine Learning is a Meritocracy

Tweet Share Share Google Plus When making a start in a new field it is common to feel overwhelme

Top 10 Open Image Datasets for Machine Learning Research

This article would succinctly describe the best ten image datasets used for certain fundamental computer vision problems such as classification, detecti

【原】Coursera—Andrew Ng機器學習—課程筆記 Lecture 10—Advice for applying machine learning

Lecture 10—Advice for applying machine learning   10.1 如何除錯一個機器學習演算法? 有多種方案: 1、獲得更多訓練資料;2、嘗試更少特徵;3、嘗試更多特徵;4、嘗試新增多項式特徵;5、減小 λ;6、增大 λ 為了避免一個方案一個方

The 50 Best Public Datasets for Machine Learning

The 50 Best Public Datasets for Machine LearningWhat are some open datasets for machine learning? After scrapping the web for hours after hours, we have cr

Machine Learning Top 10 Articles for the Past Month (v.Oct 2018)

Machine Learning Top 10 Articles for the Past Month (v.Oct 2018)For the past month, we ranked nearly 1,400 Machine Learning articles to pick the Top 10 sto

Top 10 Machine Learning, Deep Learning, and Data Science Courses for Beginners (Python and R)

Data Science, Machine Learning, Deep Learning, and Artificial intelligence are really hot at this moment and offering a lucrative career to programmers wi

Machine Learning Datasets in R (10 datasets you can use right now)

Tweet Share Share Google Plus You need standard datasets to practice machine learning. In this s

7 Time Series Datasets for Machine Learning

Tweet Share Share Google Plus Machine learning can be applied to time series datasets. These are

Standard Machine Learning Datasets To Practice in Weka

Tweet Share Share Google Plus It is a good idea to have small well understood datasets when gett

Coursera - Machine Learning, Stanford: Week 10

minimal machine mini ica dataset pri text -c summary Overview Gradient Descent with Large Datasets Learning With Large Datasets