1. 程式人生 > >Standard Machine Learning Datasets To Practice in Weka

Standard Machine Learning Datasets To Practice in Weka

It is a good idea to have small well understood datasets when getting started in machine learning and learning a new tool.

The Weka machine learning workbench provides a directory of small well understood datasets in the installed directory.

In this post you will discover some of these small well understood datasets distributed with Weka, their details and where to learn more about them.

We will focus on a handful of datasets of differing types. After reading this post you will know:

  • Where the sample datasets are located or where to download them afresh if you need them.
  • Specific standard datasets you can use to explore different aspects of classification and regression predictive models.
  • Where to go for more information about specific datasets and state of the art results.

Let’s get started.

Standard Machine Learning Datasets Used For Practice in Weka

Standard Machine Learning Datasets Used For Practice in Weka
Photo by Marvin Foushee

, some rights reserved.

Standard Weka Datasets

An installation of the open source Weka machine learning workbench includes a data/ directory full of standard machine learning problems.

Weka Installation Directory

Weka Installation Directory

This is very useful when you are getting started in machine learning or learning how to get started with the Weka platform. It provides standard machine learning datasets for common classification and regression problems, for example, below is a snapshot from this directory:

Provided Datasets in Weka Installation Directory

Provided Datasets in Weka Installation Directory

All datasets are in the Weka native ARFF file format and can be loaded directly into Weka, meaning you can start developing practice models immediately.

There are some special distributions of Weka that may not include the data/ directory. If you have chosen to install one of these distributions, you can download the .zip distribution of Weka, unzip it and copy the data/ directory to somewhere that you can access it easily from weka.

There are many datasets to play with in the data/ directory, in the following sections I will point out a few that you can focus on for practicing and investigating predictive modeling problems.

Need more help with Weka for Machine Learning?

Take my free 14-day email course and discover how to use the platform step-by-step.

Click to sign-up and also get a free PDF Ebook version of the course.

Binary Classification Datasets

Binary classification is where the output variable to be predicted is nominal comprised of two classes.

This is perhaps the most well studied type of predictive modeling problem and the type of problem that is good to start with.

There are three standard binary classification problems in the data/ directory that you can focus on:

  1. Pima Indians Onset of Diabetes: (diabetes.arff) Each instance represents medical details for one patient and the task is to predict whether the patient will have an onset of diabetes within the next five years. There are 8 numerical input variables all of which have varying scales. You can learn more about this dataset on the UCI Machine Learning Repository. Top results are in the order of 77% accuracy.
  2. Breast Cancer: (breast-cancer.arff) Each instance represents medical details of patients and samples of their tumor tissue and the task is to predict whether or not the patient has breast cancer. There are 9 input variables all of which a nominal. You can learn more about the datasets in the UCI Machine Learning Repository. Top results are in the order of 75% accuracy.
  3. Ionosphere (ionosphere.arff) Each instance describes the properties of radar returns from the atmosphere and the task is to predict whether or not there is structure in the ionosphere. There are 34 numerical input variables of generally the same scale. You can learn more about this dataset on the UCI Machine Learning Repository. Top results are in the order of 98% accuracy.

Multi-Class Classification Datasets

There are many classification type problems, where the output variable has more than two classes. These are called multi-class classification problems.

This is a good type of problem to look at after you have some confidence with binary classification.

Three standard multi-class classification problems in the data/ directory that you can focus on are:

  1. Iris Flowers Classification: (iris.arff) Each instance describes measurements of iris flowers and the task is to predict to which species of 3 iris flower the observation belongs. There are 4 numerical input variables with the same units and generally the same scale. You can learn more about the datasets in the UCI Machine Learning Repository. Top results are in the order of 96% accuracy.
  2. Large Soybean Database: (soybean.arff) Each instance describes properties of a crop of soybeans and the task is to predict which of the 19 diseases the crop suffers. There are 35 nominal input variables. You can learn more about this dataset on the UCI Machine Learning Repository.
  3. Glass Identification: (glass.arff) Each instance describes the chemical composition of samples of glass and the task is to predict the type or use of the class from one of 7 classes. There are 10 numeric attributes that describe the chemical properties of the glass ad its refractive index. You can learn more about this dataset on the UCI Machine Learning Repository.

Regression Datasets

Regression problems are those where you must predict a real valued output.

The selection of regression problems in the data/ directory is small. Regression is an important class of predictive modeling problem. As such I recommend downloading the free add-on pack of regression problems collected from the UCI Machine Learning Repository.

It is available from the datasets page on the Weka web page and is the first in the list called:

  • A jar file containing 37 regression problems, obtained from various sources (datasets-numeric.jar)

It is a .jar file which is a type of compressed Java archive. You should be able to unzip it with most modern unzip programs.

If you have Java installed (which you very likely do to use Weka), you can also unzip the .jar file manually on the command line using the following command in the directory where the jar was downloaded:

1 jar -xvf datasets-numeric.jar

Unzipping the file will create a new directory called numeric that contains 37 regression datasets in ARFF native Weka format.

Three regression datasets in the numeric/ directory that you can focus on are:

  1. Longley Economic Dataset: (longley.arff) Each instance describes the gross economic properties of a nation for a given year and the task is to predict the number of people employed as an integer. There are 6 numeric input variables of varying scales.
  2. Boston House Price Dataset: (housing.arff) Each instance describes the properties of a Boston suburb and the task is to predict the house prices in thousands of dollars. There are 13 numerical input variables with varying scales describing the properties of suburbs. You can learn more about this dataset on the UCI Machine Learning Repository.
  3. Sleep in Mammals Dataset: (sleep.arff) Each instance describes the properties of different mammals and the task is to predict the number of hours of total sleep they require on average. There are 7 numeric input variables of different scales and measures.

Summary

In this post you discovered the standard machine learning datasets distributed with the Weka machine learning platform.

Specifically, you learned:

  • Three popular binary classification problems you can use for practice: diabetes, breast-cancer and ionosphere.
  • Three popular multi-class classification problems you can use for practice: iris, soybean and glass.
  • Three popular regression problems you can use for practice: longley, housing and sleep.

Do you have any questions about standard machine learning datasets in Weka or about this post? Ask your questions in the comments and I will do my best to answer.


Want Machine Learning Without The Code?

Master Machine Learning With Weka

Develop Your Own Models in Minutes

…with just a few a few clicks

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, build models, tuning, and much more…

Finally Bring The Machine Learning To
Your Own Projects

Skip the Academics. Just Results.


相關推薦

Standard Machine Learning Datasets To Practice in Weka

Tweet Share Share Google Plus It is a good idea to have small well understood datasets when gett

Restoring balance in machine learning datasets

If you want to teach a child what an elephant looks like, you have an infinite number of options. Take a photo from National Geographic, a stuffed animal o

I made a machine learning chicken rice classifier in ~4 hours to tell me what type of chicken rice…

I made a machine learning chicken rice classifier in ~4 hours to tell me what type of chicken rice I bought for lunchThis entire frivolous episode started

Machine learning technique to predict human cell organization: Artificial intelligence approach could be used in cancer biology,

Fluorescence microscopy, which uses glowing molecular labels to pinpoint specific parts of cells, is very precise but only allows scientists to see a few

Machine Learning Datasets in R (10 datasets you can use right now)

Tweet Share Share Google Plus You need standard datasets to practice machine learning. In this s

Machine Learning Reveals Gene Changes in the Developing Brain

Unlike most cells in the rest of our body, the DNA (the genome) in each of our brain cells is not the same: it varies from cell to cell, caused by somatic

Machine Learning: Coming to a Communications Service Near You

It's time for communication vendors to start investing in machine learning. Machine learning seems like the only topic in the computer science world today.

Machine Learning: How to Build a Model From Scratch

As an online travel booking company, Momentum Travel realized early on that identifying and preventing fraud is a vital part of their business. Hear from S

Steak & chips: how IoT and machine learning will disrupt risk in animal insurance

On the face of it, the connection between the internet of things (IoT) and animals is not an obvious one. However, a number of trials and larger-scale impl

Best 20 AI and machine learning blogs to follow religiously Gengo AI

aitopics.org uses cookies to deliver the best possible experience. By continuing to use this site, you consent to the use of cookies.  Learn more » I und

Machine learning: How to go from theory to reality

Too bad so few people know how to use them. As recent 451 Research survey data indicates, a lack of skilled people continues to stymie the AI revolution. I

Source Machine Learning Is Free, As In Beer | AITopics

Machine learning (ML) continues to amaze us with its abilities and is set to transform the economic structure of many industries -- from producers of widge

Lewagon — Learning how to code in Rio de Janeiro

Lewagon — Learning how to code in Rio de JaneiroI started to code — first with an app on my iphone — after work. I did not think about becoming a coder, I

Machine Learning And Predictive Analytics in Healthcare: How It Is Redefining The Industry

One of the most obvious developments that have taken place in the world is in the field of medical science. Radiology has allowed medical professionals to

What Role Can Machine Learning And AI Play In Banking And Lending?

You remember the old days of banking, right? We're talking the mid-1950s: The days where you'd walk in, shake the hand of a man in professional suit and ti

New machine learning technology to predict human blood pressure: Study

New York: Researchers, including one of an Indian-origin, have developed a wearable off-the-shelf and machine learning technology that can predict an indiv

Product Driven Machine Learning (and Parking Tickets in NYC)

Product Driven Machine Learning (and NYC Parking Tickets)Bridging the stats gap with acceptance criteria for data scienceTripAdvisor wants you to book just

AI and machine learning needed to improve the healthcare industry

Of all the industries set to be impacted by artificial intelligence and machine learning, the healthcare sector stands to gain the most. Often hindered by

Deploying your machine learning model to unlock its potential

Map Unavailable Date/Time Date(s) - 2018-Dec-05 2:00 pm EST - 2:30 pm EST Add to iCal calendar iCal Categories Svetlana Levitan will in

Amazon opens up its internal machine learning training to everyone

Amazon announced today that it's making the machine learning courses it uses to train its engineers available to everybody for free. The coursework is tail