1. 程式人生 > >Data, Learning and Modeling

Data, Learning and Modeling

There are key concepts in machine learning that lay the foundation for understanding the field.

In this post, you will learn the nomenclature (standard terms) that is used when describing data and datasets.

You will also learn the concepts and terms used to describe learning and modeling from data that will provide a valuable intuition for your journey through the field of machine learning.

Data

Machine learning methods learn from examples. It is important to have good grasp of input data and the various terminology used when describing data. In this section, you will learn the terminology used in machine learning when referring to data.

When I think of data, I think of rows and columns, like a database table or an Excel spreadsheet. This is a traditional structure for data and is what is common in the field of machine learning. Other data like images, videos, and text, so-called unstructured data is not considered at this time.

Table of Data Showing an Instance, Feature, and Train-Test Datasets

Table of Data Showing an Instance, Feature, and Train-Test Datasets

Instance: A single row of data is called an instance. It is an observation from the domain.

Feature: A single column of data is called a feature. It is a component of an observation and is also called an attribute of a data instance. Some features may be inputs to a model (the predictors) and others may be outputs or the features to be predicted.

Data Type: Features have a data type. They may be real or integer-valued or may have a categorical or ordinal value. You can have strings, dates, times, and more complex types, but typically they are reduced to real or categorical values when working with traditional machine learning methods.

Datasets: A collection of instances is a dataset and when working with machine learning methods we typically need a few datasets for different purposes.

Training Dataset: A dataset that we feed into our machine learning algorithm to train our model.

Testing Dataset: A dataset that we use to validate the accuracy of our model but is not used to train the model. It may be called the validation dataset.

We may have to collect instances to form our datasets or we may be given a finite dataset that we must split into sub-datasets.

Learning

Machine learning is indeed about automated learning with algorithms.

In this section, we will consider a few high-level concepts about learning.

Induction: Machine learning algorithms learn through a process called induction or inductive learning. Induction is a reasoning process that makes generalizations (a model) from specific information (training data).

Generalization: Generalization is required because the model that is prepared by a machine learning algorithm needs to make predictions or decisions based on specific data instances that were not seen during training.

Over-Learning: When a model learns the training data too closely and does not generalize, this is called over-learning. The result is poor performance on data other than the training dataset. This is also called over-fitting.

Under-Learning: When a model has not learned enough structure from the database because the learning process was terminated early, this is called under-learning. The result is good generalization but poor performance on all data, including the training dataset. This is also called under-fitting.

Online Learning: Online learning is when a method is updated with data instances from the domain as they become available. Online learning requires methods that are robust to noisy data but can produce models that are more in tune with the current state of the domain.

Offline Learning: Offline learning is when a method is created on pre-prepared data and is then used operationally on unobserved data. The training process can be controlled and can tuned carefully because the scope of the training data is known. The model is not updated after it has been prepared and performance may decrease if the domain changes.

Supervised Learning: This is a learning process for generalizing on problems where a prediction is required. A “teaching process” compares predictions by the model to known answers and makes corrections in the model.

Unsupervised Learning: This is a learning process for generalizing the structure in the data where no prediction is required. Natural structures are identified and exploited for relating instances to each other.

We have covered supervised and unsupervised learning before in the post on machine learning algorithms. These terms can be useful for classifying algorithms by their behavior.

Modeling

The artefact created by a machine learning process could be considered a program in its own right.

Model Selection: We can think of the process of configuring and training the model as a model selection process. Each iteration we have a new model that we could choose to use or to modify. Even the choice of machine learning algorithm is part of that model selection process. Of all the possible models that exist for a problem, a given algorithm and algorithm configuration on the chosen training dataset will provide a finally selected model.

Inductive Bias: Bias is the limits imposed on the selected model. All models are biased which introduces error in the model, and by definition all models have error (they are generalizations from observations). Biases are introduced by the generalizations made in the model including the configuration of the model and the selection of the algorithm to generate the model. A machine learning method can create a model with a low or a high bias and tactics can be used to reduce the bias of a highly biased model.

Model Variance: Variance is how sensitive the model is to the data on which it was trained. A machine learning method can have a high or a low variance when creating a model on a dataset. A tactic to reduce the variance of a model is to run it multiple times on a dataset with different initial conditions and take the average accuracy as the models performance.

Bias-Variance Tradeoff: Model selection can be thought of as a the trade-off of the bias and variance. A low bias model will have a high variance and will need to be trained for a long time or many times to get a usable model. A high bias model will have a low variance and will train quickly, but suffer poor and limited performance.

Resources

Below are some resources if you would like to dig deeper.

This post provided a useful glossary of terms that you can refer back to anytime for a clear definition.

Are there terms missing? Do you have a clearer description of one of the terms listed? Leave a comment and let us all know.

相關推薦

Data, Learning and Modeling

Tweet Share Share Google Plus There are key concepts in machine learning that lay the foundation

機器學習(Machine Learning and Data Mining)CS 5751——Lab1作業記錄

Activity3 繪製散點圖矩陣,顯示屬性之間的相關性: mpg,hp,disp,drat,wt,qsec。 使用散點圖,評論哪些屬性對具有最高的相關性。 plot(mtcars$wt, mtcars$mpg, main="WT vs. MPG", xla

Google Machine Learning Course NoteBook--Data Preparation and Feature Engineering in ML

Steps to Constructing Your Dataset To construct your dataset (and before doing data transformation), you should: Collect the raw d

R語言-《Learning R》-Chapter15 : Distribution and Modeling-隨機數字+線性迴歸

1. Random Numbers(隨機數字) ## 隨機數:從1到7的7個隨機數 > sample(7) [1] 5 2 7 4 3 6 1 ## 隨機數:從1到7的5個隨機數 > sample(7, 5) [1] 7 2 6 3 4 > s

How Data Integration and Machine Learning Improve Retention Marketing

Retention marketing is about preventing your valuable customers from churning. Reducing customer churn requires you to know two things: 1) which customers

10 must watch movies on Data Science and Machine Learning

Data science and machine learning are powerful technologies innovating the world in ways that sometimes seem straight out of a sci-fi film. Today's machine

AI, Machine Learning and Data Science Announcements from Microsoft Ignite

Microsoft Ignite, Microsoft's annual developer conference, wrapped up last week and many of the big announcements focused on artificial intelligence and ma

Recommended IDE for Data Scientists and Machine Learning Engineers

Integrated Development Environment, or IDE, is a tool that allows software developers to write, test and debug their programming code easier than in genera

Top 10 Machine Learning, Deep Learning, and Data Science Courses for Beginners (Python and R)

Data Science, Machine Learning, Deep Learning, and Artificial intelligence are really hot at this moment and offering a lucrative career to programmers wi

How Data Science and Machine Learning Are Related Koenig IT Learning Center

Data Science is an amalgamation of data interference, algorithm development, and technology that enables professionals to solve analytically complex proble

Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection

We describe a learning-based approach to hand-eye coordination for robotic grasping from monocular images. To learn hand-eye coordination fo

Marginally Interesting: Machine Learning and Data Sets

Tweet I’ve been busy taking care of my 11 month old daughter lately whi

Machine learning and data are fueling a new kind of car, brought to you by Intel

Here's why Intel just offered $15.3 billion for Mobileye, an Israeli company that specializes in machine vision and learning for cars. The automobile is be

How do you explain Machine Learning and Data Mining to a layman?

Suppose you go shopping for mangoes one day. The vendor has laid out a cart full of mangoes. You can handpick the mangoes, the vendor will weigh them, and

The 10 Best AI, Data Science and Machine Learning Podcasts

The 10 Best AI, Data Science and Machine Learning PodcastsLearn the basics and keep up with the latest news in data science, machine learning and artificia

The 7 Best Data Science and Machine Learning Podcasts

Data science and machine learning have long been interests of mine, but now that I’m working on Fuzzy.ai and trying to make AI and machine learning accessi

Python基本常用包整理(data analysis and machine learning),附查詢包版本語句

python 資料分析模組(Numpy、Scipy、Scikit和Pandas等) python進行機器學習(tensorflow) 一、基礎包 ①Numpy Python科學計算的基礎包 ②Pand

論文閱讀:Combining volumetric dental CT and optical scan data for teeth modeling

【論文資訊】 Combining volumetric dental CT and optical scan data for teeth modeling 2015 CAD contribution: CT結合網格,新穎 協同分割,graph-cut

Unsupervised Learning and Text Mining of Emotion Terms Using R

true nio sha heatmap this trac tel examples sed Unsupervised learning refers to data science approaches that involve learning without a p