Standard Machine Learning Datasets To Practice in Weka

阿新 • • 發佈：2019-01-12

It is a good idea to have small well understood datasets when getting started in machine learning and learning a new tool.

The Weka machine learning workbench provides a directory of small well understood datasets in the installed directory.

In this post you will discover some of these small well understood datasets distributed with Weka, their details and where to learn more about them.

We will focus on a handful of datasets of differing types. After reading this post you will know:

Where the sample datasets are located or where to download them afresh if you need them.

Specific standard datasets you can use to explore different aspects of classification and regression predictive models.
Where to go for more information about specific datasets and state of the art results.

Let’s get started.

Standard Machine Learning Datasets Used For Practice in Weka
Photo by Marvin Foushee

, some rights reserved.

Standard Weka Datasets

An installation of the open source Weka machine learning workbench includes a data/ directory full of standard machine learning problems.

Weka Installation Directory

This is very useful when you are getting started in machine learning or learning how to get started with the Weka platform. It provides standard machine learning datasets for common classification and regression problems, for example, below is a snapshot from this directory:

Provided Datasets in Weka Installation Directory

All datasets are in the Weka native ARFF file format and can be loaded directly into Weka, meaning you can start developing practice models immediately.

There are some special distributions of Weka that may not include the data/ directory. If you have chosen to install one of these distributions, you can download the .zip distribution of Weka, unzip it and copy the data/ directory to somewhere that you can access it easily from weka.

There are many datasets to play with in the data/ directory, in the following sectionsÂ I will point out a few that you can focus on for practicing and investigating predictive modeling problems.

Need more help with Weka for Machine Learning?

Take my free 14-day email course and discover how to use the platform step-by-step.

Click to sign-up and also get a free PDF Ebook version of the course.

Binary Classification Datasets

Binary classification is where the output variable to be predicted is nominal comprised of two classes.

This is perhaps the most well studied type of predictive modeling problem and the type of problem that is good to start with.

There are three standard binary classification problems in the data/ directory that you can focus on:

Pima Indians Onset of Diabetes: (diabetes.arff) Each instance represents medical details for one patient and the task is to predict whether the patient will have an onset of diabetes within the next five years. There are 8 numerical input variables all of which have varying scales. You can learn more about this dataset on the UCI Machine Learning Repository. Top results are in the order of 77% accuracy.
Breast Cancer: (breast-cancer.arff) Each instance represents medical details of patients and samples of their tumor tissue and the task is to predict whether or not the patient has breast cancer. There are 9 input variables all of which a nominal. You can learn more about the datasets in the UCI Machine Learning Repository. Top results are in the order of 75% accuracy.
Ionosphere (ionosphere.arff) Each instance describes the properties of radar returns from the atmosphere and the task is to predict whether or not there is structure in the ionosphere. There are 34 numerical input variables of generally the same scale. You can learn more about this dataset on the UCI Machine Learning Repository. Top results are in the order of 98% accuracy.

Multi-Class Classification Datasets

There are many classification type problems, where the output variable has more than two classes. These are called multi-class classification problems.

This is a good type of problem to look at after you have some confidence with binary classification.

Three standard multi-class classification problems in the data/ directory that you can focus on are:

Iris Flowers Classification: (iris.arff) Each instance describes measurements of iris flowers and the task is to predict to which species of 3 iris flower the observation belongs. There are 4 numerical input variables with the same units and generally the same scale. You can learn more about the datasets in the UCI Machine Learning Repository. Top results are in the order of 96% accuracy.
Large Soybean Database: (soybean.arff) Each instance describes properties of a crop of soybeans and the task is to predict which of the 19 diseases the crop suffers. There are 35 nominal input variables. You can learn more about this dataset on the UCI Machine Learning Repository.
Glass Identification: (glass.arff) Each instance describes the chemical composition of samples of glass and the task is to predict the type or use of the class from one of 7 classes. There are 10 numeric attributes that describe the chemical properties of the glass ad its refractive index. You can learn more about this dataset on the UCI Machine Learning Repository.

Regression Datasets

Regression problems are those where you must predict a real valued output.

The selection of regression problems in the data/ directory is small. Regression is an important class of predictive modeling problem. As such I recommend downloading the free add-on pack of regression problems collected from the UCI Machine Learning Repository.

It is available from the datasets page on the Weka web page and is the first in the list called:

A jar file containing 37 regression problems, obtained from various sources (datasets-numeric.jar)

It is a .jar file which is a type of compressed Java archive. You should be able to unzip it with most modern unzip programs.

If you have Java installed (which you very likely do to use Weka), you can also unzip the .jar file manually on the command line using the following command in the directory where the jar was downloaded:

jar -xvf datasets-numeric.jar

1	jar -xvf datasets-numeric.jar

Unzipping the file will create a new directory called numeric that contains 37 regression datasets in ARFF native Weka format.

Three regression datasets in the numeric/ directory that you can focus on are:

Longley Economic Dataset: (longley.arff) Each instance describes the gross economic properties of a nation for a given year and the task is to predict the number of people employed as an integer. There are 6 numeric input variables of varying scales.
Boston House Price Dataset: (housing.arff) Each instance describes the properties of a Boston suburb and the task is to predict the house prices in thousands of dollars. There are 13 numerical input variables with varying scales describing the properties of suburbs. You can learn more about this dataset on the UCI Machine Learning Repository.
Sleep in Mammals Dataset: (sleep.arff) Each instance describes the properties of different mammals and the task is to predict the number of hours of total sleep they require on average. There are 7 numeric input variables of different scales and measures.

Summary

In this post you discovered the standard machine learning datasets distributed with the Weka machine learning platform.

Specifically, you learned:

Three popular binary classification problems you can use for practice: diabetes, breast-cancer and ionosphere.
Three popular multi-class classification problems you can use for practice: iris, soybean and glass.
Three popular regression problems you can use for practice: longley, housing and sleep.

Do you have any questions about standard machine learning datasets in Weka or about this post? Ask your questions in the comments and I will do my best to answer.

Want Machine Learning Without The Code?

Develop Your Own Models in Minutes

…with just a few a few clicks

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, build models, tuning, and much more…

Finally Bring The Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

Standard Machine Learning Datasets To Practice in Weka

Tweet Share Share Google Plus It is a good idea to have small well understood datasets when gett

Restoring balance in machine learning datasets

If you want to teach a child what an elephant looks like, you have an infinite number of options. Take a photo from National Geographic, a stuffed animal o

I made a machine learning chicken rice classifier in ~4 hours to tell me what type of chicken rice…

I made a machine learning chicken rice classifier in ~4 hours to tell me what type of chicken rice I bought for lunchThis entire frivolous episode started

Machine learning technique to predict human cell organization: Artificial intelligence approach could be used in cancer biology,

Fluorescence microscopy, which uses glowing molecular labels to pinpoint specific parts of cells, is very precise but only allows scientists to see a few

Standard Machine Learning Datasets To Practice in Weka

Standard Weka Datasets

Need more help with Weka for Machine Learning?

Binary Classification Datasets

Multi-Class Classification Datasets

Regression Datasets

Summary

Want Machine Learning Without The Code?

Develop Your Own Models in Minutes

Finally Bring The Machine Learning To
Your Own Projects

Standard Machine Learning Datasets To Practice in Weka

Restoring balance in machine learning datasets

I made a machine learning chicken rice classifier in ~4 hours to tell me what type of chicken rice…

Machine learning technique to predict human cell organization: Artificial intelligence approach could be used in cancer biology,

Machine Learning Datasets in R (10 datasets you can use right now)

Machine Learning Reveals Gene Changes in the Developing Brain

Machine Learning: Coming to a Communications Service Near You

Machine Learning: How to Build a Model From Scratch

Steak & chips: how IoT and machine learning will disrupt risk in animal insurance

Best 20 AI and machine learning blogs to follow religiously Gengo AI

Machine learning: How to go from theory to reality

Source Machine Learning Is Free, As In Beer | AITopics

Lewagon — Learning how to code in Rio de Janeiro

Machine Learning And Predictive Analytics in Healthcare: How It Is Redefining The Industry

What Role Can Machine Learning And AI Play In Banking And Lending?

New machine learning technology to predict human blood pressure: Study

Product Driven Machine Learning (and Parking Tickets in NYC)

AI and machine learning needed to improve the healthcare industry

Deploying your machine learning model to unlock its potential

Amazon opens up its internal machine learning training to everyone

Standard Machine Learning Datasets To Practice in Weka

Standard Weka Datasets

Need more help with Weka for Machine Learning?

Binary Classification Datasets

Multi-Class Classification Datasets

Regression Datasets

Summary

Want Machine Learning Without The Code?

Develop Your Own Models in Minutes

Finally Bring The Machine Learning ToYour Own Projects

相關推薦

Finally Bring The Machine Learning To
Your Own Projects