1. 程式人生 > 實用技巧 >Introduction to Machine Learning

Introduction to Machine Learning

2019獨角獸企業重金招聘Python工程師標準>>> hot3.png

1:Introduction To Machine Learning

In data science, we're often trying to understand a process or system using observational data.

Here are a few specific examples:

  • How do the properties(價值) of a house affect it's market value?
  • How does an applicant's application affect if they get into graduate school or not?

These questions are high-level and tough to answer in the abstract. We can start to narrow these questions to the following:

  • How does the size of a house, the number of rooms, its neighborhood crime index, and age affect it's market value?
  • How does an applicant's college GPA and GRE score affect if they get in to graduate school or not?

These more specific questions we can start to answer by applyingmachine learningtechniques on past data.

In the first problem, we're interested in trying to predict a specific, real valued number -- the market value of a house in dollars. Whenever we're trying to predict a real valued number, the process is calledregression.

In the second problem, we're interested in trying to predict a binary value(二進位制值) -- acceptance or rejection into graduate school. Whenever we're trying to predict a binary value, the process is calledclassification.

In this mission, we'll focus on a specific regression problem.

2:Introduction To The Data

  • How do the properties of a car impact it's fuel efficiency(燃油效率)?

To try to answer this question, we'll work with a dataset containing(包含) fuel efficiencies of several cars compiled(美[kəm'paɪld] 收集) by Carnegie Mellon University(卡內基梅隆大學). The dataset is hosted by the University of California Irvine(加利福尼亞大學歐文分校) ontheir machine learning repository. As a side note, the UCI Machine Learning repository contains many small datasets which are useful when getting your hands dirty with machine learning.

You'll notice that theData Folder(資料資料夾) contains a few different files. We'll be working withauto-mpg.data, which omits(省略) the 8 rows containing missing values for fuel efficiency (mpgcolumn). Even though the file's extension is.data, it's encoded as a plain text file and you can open it using any text editor. If you openedauto-mpg.datain a text editor, you'll notice that the values in each line of the file are separated by a variable number of white spaces:

Imgur

Since the file isn't formatted as a CSV file and instead uses a variable number of white spaces to delimit the columns, you can't useread_csvto read into a DataFrame. You need to instead use theread_table method, setting thedelim_whitespaceparameter toTrueso the file is parsed using the whitespace between values:

mpg = pd.read_table("auto-mpg.data", delim_whitespace=True)

The file doesn't contain the column names unfortunately so you'll have to extract the column names fromauto-mpg.namesand specify them manually. The column names can be found in theAttribute Informationsection. Just likeauto-mpg.data,auto-mpg.namesis a text file that can be opened using a standard text editor.

As specified inauto-mpg.names, the dataset contains 7 numerical features that could have an effect on a car's fuel efficiency:

  • cylinders-- the number ofcylindersin the engine.
  • displacement-- thedisplacementof the engine.
  • horsepower-- thehorsepowerof the engine.
  • weight-- the weight of the car.
  • acceleration-- the acceleration of the car.
  • model year-- the year that car model was released (e.g.70corresponds to1970).
  • origin-- where the car was manufactured (0if North America,1if Europe,2if Asia).

When reading inauto-mpg.datausing theread_tablemethod, you can use thenamesparameter to specify the list of column names, as a list of strings. Let's now read in the dataset into a DataFrame so we can explore it further.

Instructions

Read the datasetauto-mpg.datainto a DataFrame namedcarsusing the Pandas methodread_table.

  • Specify that you want the whitespace between values to be used as the delimiter.
  • Use the column names provided inauto-mpg.namesto set the column names for thecarsDataframe.
  • Display thecarsDataFrame using aprintstatement or by checking the variable inspector below the code box.

import pandas as pd
columns = ["mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "model year", "origin", "car name"]
cars=pd.read_table("auto-mpg.data",delim_whitespace=True,names=columns)
print(cars.head(5))

3:Exploratory Data Analysis

Using this dataset, we can work on a more narrow problem:

  • How does the number of cylinders, displacement, horsepower, weight, acceleration, and model year affect a car's fuel efficiency?

Let's perform some exploratory data analysis for a couple of the columns to see which one correlates best with fuel efficiency.

Instructions

  • Create a grid of subplots containing 2 rows and 1 column.
  • Generate the following data visualizations:
    • Top chart: Scatter plot with theweightcolumn on the x-axis and thempgcolumn on the y-axis.
    • Bottom chart: Scatter plot with theaccelerationcolumn on the x-axis and thempgcolumn on the y-axis.

import matplotlib.pyplot as plt
fig=plt.figure()
ax1=fig.add_subplot(2,1,1)
ax2=fig.add_subplot(2,1,2)
cars.plot("weight","mpg",kind="scatter",ax=ax1)
cars.plot("acceleration","mpg",kind="scatter",ax=ax2)
plt.show()

4:Linear Relationship

The scatter plots hint that there's a strong negative linear relationship between theweightandmpgcolumns and a weak, positive linear relationship between theaccelerationandmpgcolumns. Let's now try to quantify the relationship betweenweightandmpg.

Amachine learning modelis the equation that represents how the input is mapped to the output. Said another way, machine learning is the process of determining the relationship between the independent variable(s) and the dependent variable. In this case, the dependent variable is the fuel efficiency and the independent variables are the other columns in the dataset.

In this mission and the next few missions, we'll focus on a family of machine learning models known aslinear models. These models take the form of:

y=mx+by=mx+b

The input is represented asx, transformed using the parametersm(slope) andb(intercept), and the output is represented asy. We expectmto be a negative number since the relationship is a negative linear one.

The process of finding the equation that fits the data the best is calledfitting. We won't dive into how a model is fit to the data in this mission and will instead focus on interpreting the model. We'll use the Python library scikit-learn library to handle fitting the model to the data.

5:Scikit-Learn

To fit the model to the data, we'll use the machine learning libraryscikit-learn. Scikit-learn is the most popular library for working with machine learning models for small to medium sized datasets. Even when working with larger datasets that don't fit in memory, scikit-learn is commonly used to prototype (原始模型)and explore machine learning models on a subset of the larger dataset.

Scikit-learn uses an object-oriented style(面向物件), so each machine learning model must be instantiated(例項化) before it can be fit to a dataset (similar to creating a figure in Matplotlib before you plot values). We'll be working with theLinearRegression classfromsklearn.linear_model:

from sklearn.linear_model import LinearRegression
lr = LinearRegression()

To fit a model to the data, we use the conveniently namedfitmethod:

lr.fit(inputs, output)

whereinputsis an_rowsbyn_columnsmatrix andoutputis an_rowsby1matrix. The dataset we're working with contains 398 rows and 9 columns but since we want to only use theweightcolumn, we need to pass in a matrix containing 398 rows and 1 column. The catch, however, is if you just select theweightcolumn and pass that in as the first parameter to thefitmethod, an error will be returned. This is because scikit-learn will convert Series and Dataframe objects to NumPy objects and the dimensions don't match.

You can use thevalues attributeto see which NumPy object is returned:

cars["weight"].values

A NumPy array with 398 elements will be returnedinstead of a matrix containing rows and columns. You can confirm this by using theshapeattribute:

cars["weight"].values.shape

The value(398,), representing 398 rows by 0 columns, will be returned. If you instead use double bracket notation:

cars[["weight"]].values

you'll get back a NumPy matrix with 398 rows and 1 column.

Instructions

  • Import theLinearRegressionclass fromsklearn.linear_model.
  • Instantiate a LinearRegression instance and assign tolr.
  • Use thefitmethod to fit a linear regression model using theweightcolumn as the input and thempgcolumn as the output.

from sklearn.linear_model import LinearRegression
lr=LinearRegression()
lr.fit(cars[["weight"]],cars[["mpg"]])

6:Making Predictions

Now that we have a trained linear regression model, we can use it to make predictions. Recall that this model takes in a weight value, in pounds, and outputs a fuel efficiency value, in miles per gallon. To use a model to make predictions, use the LinearRegression methodpredict. Thepredictmethod has a single required parameter, then_samplesbyn_featuresinput matrix and returns the predicted values as an_samplesby1matrix (really just a list).

You may be wondering why we'd want to make predictions for the data we trained the model on, since we already know the true fuel efficiency values. Making predictions on data used for training is the first step in the testing & evaluation(測試與評估) process. If the model can't do a good job of even capturing the structure of the trained data, then we can't expect it to do a good job on data it wasn't trained on. This is known asunderfitting(欠擬合), since the model under performs on the data it was fit on.

Instructions

  • Use the LinearRegression methodpredictto make predictions using the values from theweightcolumn.
  • Assign the resulting list of predictions topredictions.
  • Display the first 5 elements inpredictionsand the first 5 elements in thempgcolumn to compare the predicted values with the actual values.

import sklearn
from sklearn.linear_model import LinearRegression
lr = LinearRegression(fit_intercept=True)
lr.fit(cars[["weight"]], cars[["mpg"]])
predictions=lr.predict(cars[["weight"]])
print(predictions[0:5])
print(cars["mpg"][0:5])

7:Plotting The Model

We can now plot the actual fuel efficiency values for each car alongside the predicted fuel efficiency values to gain a visual understanding of the model's effectiveness.

Instructions

On the same subplot:

  • Generate a scatter plot withweighton the x-axis and thempgcolumn on the y-axis. Specify that you want the dots in the scatter plot to be red.
  • Generate a scatter plot withweighton the x-axis and the predicted values on the y-axis. Specify that you want the dots in the scatter plot to be blue.

plt.scatter(cars["weight"],cars["mpg"],c="red")
plt.scatter(cars["weight"],predictions,c="blue")
plt.show()

8:Error Metrics

The plot from the last step gave us a visual idea of how well the linear regression model performs. To obtain a more quantitative understanding(定量的解釋), we can calculate the model'serror, or the mismatch between a model's predictions and the actual values.

One commonly used error metric for regression ismean squared error, orMSEfor short. You calculate MSE by computing the squared error between each predicted value and the actual value:

(Yi^−Yi)2(Yi^−Yi)2

whereYi^Yi^is a predicted value for fuel efficiency andYiYiis the actualmpgvalue. Then, you compute the mean of all of the squared errors:

MSE=1n∑ni=1(Yi^−Yi)2MSE=1n∑i=1n(Yi^−Yi)2

Here's the same formula in psuedo-code:

sum = 0
for each data point:
 diff = predicted_value - actual_value
 squared_diff = diff ** 2
 sum += squared_diff
mse = sum/n

We'll use themean_squared_errorfunction from scikit-learn to calculate MSE. We'll leave it to you to import the function and understand how to use it, so that you become more accustomed to reading documentation.

Instructions

  • Import themean_squared_errorfunction.
  • Use themean_squared_errorfunction to calculate the MSE of the predicted values and assign tomse.
  • Display the MSE value using aprintstatement or the variables display below the code cell after you run your code.

from sklearn.metrics import mean_squared_error
lr = LinearRegression()#fit_intercept=True)
lr.fit(cars[["weight"]], cars[["mpg"]])
predictions = lr.predict(cars[["weight"]])
mse=mean_squared_error(predictions,cars[["mpg"]])
print(mse)

9:Root Mean Squared Error

There are many error metrics you can use, each with it's own advantages and disadvantages. While the specific properties of each of the different error metrics is outside the scope of this mission, we'll introduce another error metric here.

Root mean squared error, or RMSE for short, is the square root of the MSE and does a better job of penalizing large error values. In addition, the RMSE is easier to interpret since it's units are in the same dimension as the data. When computing MSE, we squared both the predicted and actual values, calculated the differences, then summed all of the differences. This means that the MSE value will be inmiles per gallon squaredwhile the RMSE value will be inmiles per gallon.

Instructions

  • Calculate the RMSE of the predicted values and assign tormse.
  • Display the RMSE value using aprintstatement or the variables display below the code cell after you run your code.

mse = mean_squared_error(cars["mpg"], predictions)
rmse = mse ** (1/2)
print(rmse)

10:Next Steps

In this mission, we explored the basics of machine learning to better understand how the weight of a car relates to its fuel efficiency. We focused on regression, a class of machine learning techniques where the input and output values are continuous values.

Next up is a challenge where you can practice the concepts you learned in this mission.

轉載於:https://my.oschina.net/Bettyty/blog/751261