Introduction to Machine Learning
1:Introduction To Machine Learning
In data science, we're often trying to understand a process or system using observational data.
Here are a few specific examples:
- How do the properties(價值) of a house affect it's market value?
- How does an applicant's application affect if they get into graduate school or not?
These questions are high-level and tough to answer in the abstract. We can start to narrow these questions to the following:
- How does the size of a house, the number of rooms, its neighborhood crime index, and age affect it's market value?
- How does an applicant's college GPA and GRE score affect if they get in to graduate school or not?
These more specific questions we can start to answer by applyingmachine learningtechniques on past data.
In the first problem, we're interested in trying to predict a specific, real valued number -- the market value of a house in dollars. Whenever we're trying to predict a real valued number, the process is calledregression.
In the second problem, we're interested in trying to predict a binary value(二進位制值) -- acceptance or rejection into graduate school. Whenever we're trying to predict a binary value, the process is calledclassification.
In this mission, we'll focus on a specific regression problem.
2:Introduction To The Data
- How do the properties of a car impact it's fuel efficiency(燃油效率)?
To try to answer this question, we'll work with a dataset containing(包含) fuel efficiencies of several cars compiled(美[kəm'paɪld] 收集) by Carnegie Mellon University(卡內基梅隆大學). The dataset is hosted by the University of California Irvine(加利福尼亞大學歐文分校) ontheir machine learning repository. As a side note, the UCI Machine Learning repository contains many small datasets which are useful when getting your hands dirty with machine learning.
You'll notice that theData Folder(資料資料夾) contains a few different files. We'll be working withauto-mpg.data, which omits(省略) the 8 rows containing missing values for fuel efficiency (mpg
column). Even though the file's extension is.data
, it's encoded as a plain text file and you can open it using any text editor. If you openedauto-mpg.data
in a text editor, you'll notice that the values in each line of the file are separated by a variable number of white spaces:
Since the file isn't formatted as a CSV file and instead uses a variable number of white spaces to delimit the columns, you can't useread_csv
to read into a DataFrame. You need to instead use theread_table method, setting thedelim_whitespace
parameter toTrue
so the file is parsed using the whitespace between values:
mpg = pd.read_table("auto-mpg.data", delim_whitespace=True)
The file doesn't contain the column names unfortunately so you'll have to extract the column names fromauto-mpg.namesand specify them manually. The column names can be found in theAttribute Informationsection. Just likeauto-mpg.data
,auto-mpg.names
is a text file that can be opened using a standard text editor.
As specified inauto-mpg.names
, the dataset contains 7 numerical features that could have an effect on a car's fuel efficiency:
cylinders
-- the number ofcylindersin the engine.displacement
-- thedisplacementof the engine.horsepower
-- thehorsepowerof the engine.weight
-- the weight of the car.acceleration
-- the acceleration of the car.model year
-- the year that car model was released (e.g.70
corresponds to1970
).origin
-- where the car was manufactured (0
if North America,1
if Europe,2
if Asia).
When reading inauto-mpg.data
using theread_table
method, you can use thenames
parameter to specify the list of column names, as a list of strings. Let's now read in the dataset into a DataFrame so we can explore it further.
Instructions
Read the datasetauto-mpg.data
into a DataFrame namedcars
using the Pandas methodread_table
.
- Specify that you want the whitespace between values to be used as the delimiter.
- Use the column names provided in
auto-mpg.names
to set the column names for thecars
Dataframe. - Display the
cars
DataFrame using aprint
statement or by checking the variable inspector below the code box.
import pandas as pd
columns = ["mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "model year", "origin", "car name"]
cars=pd.read_table("auto-mpg.data",delim_whitespace=True,names=columns)
print(cars.head(5))
3:Exploratory Data Analysis
Using this dataset, we can work on a more narrow problem:
- How does the number of cylinders, displacement, horsepower, weight, acceleration, and model year affect a car's fuel efficiency?
Let's perform some exploratory data analysis for a couple of the columns to see which one correlates best with fuel efficiency.
Instructions
- Create a grid of subplots containing 2 rows and 1 column.
- Generate the following data visualizations:
- Top chart: Scatter plot with the
weight
column on the x-axis and thempg
column on the y-axis. - Bottom chart: Scatter plot with the
acceleration
column on the x-axis and thempg
column on the y-axis.
- Top chart: Scatter plot with the
import matplotlib.pyplot as plt
fig=plt.figure()
ax1=fig.add_subplot(2,1,1)
ax2=fig.add_subplot(2,1,2)
cars.plot("weight","mpg",kind="scatter",ax=ax1)
cars.plot("acceleration","mpg",kind="scatter",ax=ax2)
plt.show()
4:Linear Relationship
The scatter plots hint that there's a strong negative linear relationship between theweight
andmpg
columns and a weak, positive linear relationship between theacceleration
andmpg
columns. Let's now try to quantify the relationship betweenweight
andmpg
.
Amachine learning modelis the equation that represents how the input is mapped to the output. Said another way, machine learning is the process of determining the relationship between the independent variable(s) and the dependent variable. In this case, the dependent variable is the fuel efficiency and the independent variables are the other columns in the dataset.
In this mission and the next few missions, we'll focus on a family of machine learning models known aslinear models. These models take the form of:
y=mx+by=mx+b
The input is represented asx
, transformed using the parametersm
(slope) andb
(intercept), and the output is represented asy
. We expectm
to be a negative number since the relationship is a negative linear one.
The process of finding the equation that fits the data the best is calledfitting. We won't dive into how a model is fit to the data in this mission and will instead focus on interpreting the model. We'll use the Python library scikit-learn library to handle fitting the model to the data.
5:Scikit-Learn
To fit the model to the data, we'll use the machine learning libraryscikit-learn. Scikit-learn is the most popular library for working with machine learning models for small to medium sized datasets. Even when working with larger datasets that don't fit in memory, scikit-learn is commonly used to prototype (原始模型)and explore machine learning models on a subset of the larger dataset.
Scikit-learn uses an object-oriented style(面向物件), so each machine learning model must be instantiated(例項化) before it can be fit to a dataset (similar to creating a figure in Matplotlib before you plot values). We'll be working with theLinearRegression classfromsklearn.linear_model
:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
To fit a model to the data, we use the conveniently namedfit
method:
lr.fit(inputs, output)
whereinputs
is an_rows
byn_columns
matrix andoutput
is an_rows
by1
matrix. The dataset we're working with contains 398 rows and 9 columns but since we want to only use theweight
column, we need to pass in a matrix containing 398 rows and 1 column. The catch, however, is if you just select theweight
column and pass that in as the first parameter to thefit
method, an error will be returned. This is because scikit-learn will convert Series and Dataframe objects to NumPy objects and the dimensions don't match.
You can use thevalues attributeto see which NumPy object is returned:
cars["weight"].values
A NumPy array with 398 elements will be returnedinstead of a matrix containing rows and columns. You can confirm this by using theshape
attribute:
cars["weight"].values.shape
The value(398,)
, representing 398 rows by 0 columns, will be returned. If you instead use double bracket notation:
cars[["weight"]].values
you'll get back a NumPy matrix with 398 rows and 1 column.
Instructions
- Import the
LinearRegression
class fromsklearn.linear_model
. - Instantiate a LinearRegression instance and assign to
lr
. - Use the
fit
method to fit a linear regression model using theweight
column as the input and thempg
column as the output.
from sklearn.linear_model import LinearRegression
lr=LinearRegression()
lr.fit(cars[["weight"]],cars[["mpg"]])
6:Making Predictions
Now that we have a trained linear regression model, we can use it to make predictions. Recall that this model takes in a weight value, in pounds, and outputs a fuel efficiency value, in miles per gallon. To use a model to make predictions, use the LinearRegression methodpredict. Thepredict
method has a single required parameter, then_samples
byn_features
input matrix and returns the predicted values as an_samples
by1
matrix (really just a list).
You may be wondering why we'd want to make predictions for the data we trained the model on, since we already know the true fuel efficiency values. Making predictions on data used for training is the first step in the testing & evaluation(測試與評估) process. If the model can't do a good job of even capturing the structure of the trained data, then we can't expect it to do a good job on data it wasn't trained on. This is known asunderfitting(欠擬合), since the model under performs on the data it was fit on.
Instructions
- Use the LinearRegression method
predict
to make predictions using the values from theweight
column. - Assign the resulting list of predictions to
predictions
. - Display the first 5 elements in
predictions
and the first 5 elements in thempg
column to compare the predicted values with the actual values.
import sklearn
from sklearn.linear_model import LinearRegression
lr = LinearRegression(fit_intercept=True)
lr.fit(cars[["weight"]], cars[["mpg"]])
predictions=lr.predict(cars[["weight"]])
print(predictions[0:5])
print(cars["mpg"][0:5])
7:Plotting The Model
We can now plot the actual fuel efficiency values for each car alongside the predicted fuel efficiency values to gain a visual understanding of the model's effectiveness.
Instructions
On the same subplot:
- Generate a scatter plot with
weight
on the x-axis and thempg
column on the y-axis. Specify that you want the dots in the scatter plot to be red. - Generate a scatter plot with
weight
on the x-axis and the predicted values on the y-axis. Specify that you want the dots in the scatter plot to be blue.
plt.scatter(cars["weight"],cars["mpg"],c="red")
plt.scatter(cars["weight"],predictions,c="blue")
plt.show()
8:Error Metrics
The plot from the last step gave us a visual idea of how well the linear regression model performs. To obtain a more quantitative understanding(定量的解釋), we can calculate the model'serror, or the mismatch between a model's predictions and the actual values.
One commonly used error metric for regression ismean squared error, orMSEfor short. You calculate MSE by computing the squared error between each predicted value and the actual value:
(Yi^−Yi)2(Yi^−Yi)2
whereYi^Yi^is a predicted value for fuel efficiency andYiYiis the actualmpg
value. Then, you compute the mean of all of the squared errors:
MSE=1n∑ni=1(Yi^−Yi)2MSE=1n∑i=1n(Yi^−Yi)2
Here's the same formula in psuedo-code:
sum = 0
for each data point:
diff = predicted_value - actual_value
squared_diff = diff ** 2
sum += squared_diff
mse = sum/n
We'll use themean_squared_errorfunction from scikit-learn to calculate MSE. We'll leave it to you to import the function and understand how to use it, so that you become more accustomed to reading documentation.
Instructions
- Import the
mean_squared_error
function. - Use the
mean_squared_error
function to calculate the MSE of the predicted values and assign tomse
. - Display the MSE value using a
print
statement or the variables display below the code cell after you run your code.
from sklearn.metrics import mean_squared_error
lr = LinearRegression()#fit_intercept=True)
lr.fit(cars[["weight"]], cars[["mpg"]])
predictions = lr.predict(cars[["weight"]])
mse=mean_squared_error(predictions,cars[["mpg"]])
print(mse)
9:Root Mean Squared Error
There are many error metrics you can use, each with it's own advantages and disadvantages. While the specific properties of each of the different error metrics is outside the scope of this mission, we'll introduce another error metric here.
Root mean squared error, or RMSE for short, is the square root of the MSE and does a better job of penalizing large error values. In addition, the RMSE is easier to interpret since it's units are in the same dimension as the data. When computing MSE, we squared both the predicted and actual values, calculated the differences, then summed all of the differences. This means that the MSE value will be inmiles per gallon squaredwhile the RMSE value will be inmiles per gallon.
Instructions
- Calculate the RMSE of the predicted values and assign to
rmse
. - Display the RMSE value using a
print
statement or the variables display below the code cell after you run your code.
mse = mean_squared_error(cars["mpg"], predictions)
rmse = mse ** (1/2)
print(rmse)
10:Next Steps
In this mission, we explored the basics of machine learning to better understand how the weight of a car relates to its fuel efficiency. We focused on regression, a class of machine learning techniques where the input and output values are continuous values.
Next up is a challenge where you can practice the concepts you learned in this mission.
轉載於:https://my.oschina.net/Bettyty/blog/751261