1. 程式人生 > >MLlib裡幾個簡單的分類模型(python)

MLlib裡幾個簡單的分類模型(python)

# 這個資料集來自www.kaggle.com/c/stumbleupon/data, 用於預測網頁是短暫存在還是長時間流行,這裡我把它存放在hdfs檔案系統內的/user/yy/stumbleupon/目錄裡
rawData = sc.textFile("hdfs:///user/yy/stumbleupon/train_noheader.tsv")
records = rawData.map(lambda line: line.split('\t'))
records.first()
# 每一條樣本是一個網頁的資訊

[u’”http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html

“’,
u‘“4042”’,
u’”{“”title”“:”“IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries”“,”“body”“:”“A sign stands outside the International Business Machines Corp IBM Almaden Research Center campus in San Jose California Photographer Tony Avelar Bloomberg Buildings stand at the International Business Machines Corp IBM Almaden Research Center campus in the Santa Teresa Hills of San Jose California Photographer Tony Avelar Bloomberg By 2015 your mobile phone will project a 3 D image of anyone who calls and your laptop will be powered by kinetic energy At least that s what International Business Machines Corp sees in its crystal ball The predictions are part of an annual tradition for the Armonk New York based company which surveys its 3 000 researchers to find five ideas expected to take root in the next five years IBM the world s largest provider of computer services looks to Silicon Valley for input gleaning many ideas from its Almaden research center in San Jose California Holographic conversations projected from mobile phones lead this year s list The predictions also include air breathing batteries computer programs that can tell when and where traffic jams will take place environmental information generated by sensors in cars and phones and cities powered by the heat thrown off by computer servers These are all stretch goals and that s good said Paul Saffo managing director of foresight at the investment advisory firm Discern in San Francisco In an era when pessimism is the new black a little dose of technological optimism is not a bad thing For IBM it s not just idle speculation The company is one of the few big corporations investing in long range research projects and it counts on innovation to fuel growth Saffo said Not all of its predictions pan out though IBM was overly optimistic about the spread of speech technology for instance When the ideas do lead to products they can have broad implications for society as well as IBM s bottom line he said Research Spending They have continued to do research when all the other grand research organizations are gone said Saffo who is also a consulting associate professor at Stanford University IBM invested 5 8 billion in research and development last year 6 1 percent of revenue While that s down from about 10 percent in the early 1990s the company spends a bigger share on research than its computing rivals Hewlett Packard Co the top maker of personal computers spent 2 4 percent last year At Almaden scientists work on projects that don t always fit in with IBM s computer business The lab s research includes efforts to develop an electric car battery that runs 500 miles on one charge a filtration system for desalination and a program that shows changes in geographic data IBM rose 9 cents to 146 04 at 11 02 a m in New York Stock Exchange composite trading The stock had gained 11 percent this year before today Citizen Science The list is meant to give a window into the company s innovation engine said Josephine Cheng a vice president at IBM s Almaden lab All this demonstrates a real culture of innovation at IBM and willingness to devote itself to solving some of the world s biggest problems she said Many of the predictions are based on projects that IBM has in the works One of this year s ideas that sensors in cars wallets and personal devices will give scientists better data about the environment is an expansion of the company s citizen science initiative Earlier this year IBM teamed up with the California State Water Resources Control Board and the City of San Jose Environmental Services to help gather information about waterways Researchers from Almaden created an application that lets smartphone users snap photos of streams and creeks and report back on conditions The hope is that these casual observations will help local and state officials who don t have the resources to do the work themselves Traffic Predictors IBM also sees data helping shorten commutes in the next five years Computer programs will use algorithms and real time traffic information to predict which roads will have backups and how to avoid getting stuck Batteries may last 10 times longer in 2015 than today IBM says Rather than using the current lithium ion technology new models could rely on energy dense metals that only need to interact with the air to recharge Some electronic devices might ditch batteries altogether and use something similar to kinetic wristwatches which only need to be shaken to generate a charge The final prediction involves recycling the heat generated by computers and data centers Almost half of the power used by data centers is currently spent keeping the computers cool IBM scientists say it would be better to harness that heat to warm houses and offices In IBM s first list of predictions compiled at the end of 2006 researchers said instantaneous speech translation would become the norm That hasn t happened yet While some programs can quickly translate electronic documents and instant messages and other apps can perform limited speech translation there s nothing widely available that acts like the universal translator in Star Trek Second Life The company also predicted that online immersive environments such as Second Life would become more widespread While immersive video games are as popular as ever Second Life s growth has slowed Internet users are flocking instead to the more 2 D environments of Facebook Inc and Twitter Inc Meanwhile a 2007 prediction that mobile phones will act as a wallet ticket broker concierge bank and shopping assistant is coming true thanks to the explosion of smartphone applications Consumers can pay bills through their banking apps buy movie tickets and get instant feedback on potential purchases all with a few taps on their phones The nice thing about the list is that it provokes thought Saffo said If everything came true they wouldn t be doing their job To contact the reporter on this story Ryan Flinn in San Francisco at rflinn bloomberg net To contact the editor responsible for this story Tom Giles at tgiles5 bloomberg net by 2015, your mobile phone will project a 3-d image of anyone who calls and your laptop will be powered by kinetic energy. at least that\u2019s what international business machines corp. sees in its crystal ball.”“,”“url”“:”“bloomberg news 2010 12 23 ibm predicts holographic calls air breathing batteries by 2015 html”“}”’,
u‘“business”’,
u‘“0.789131”’,
u‘“2.055555556”’,
u‘“0.676470588”’,
u‘“0.205882353”’,
u‘“0.047058824”’,
u‘“0.023529412”’,
u‘“0.443783175”’,
u‘“0”’,
u‘“0”’,
u‘“0.09077381”’,
u‘“0”’,
u‘“0.245831182”’,
u‘“0.003883495”’,
u‘“1”’,
u‘“1”’,
u‘“24”’,
u‘“0”’,
u‘“5424”’,
u‘“170”’,
u‘“8”’,
u‘“0.152941176”’,
u‘“0.079129575”’,
u‘“0”’]

from pyspark.mllib.regression import LabeledPoint # 標註點型別,(label, feature)通常feature是Vectors.dense型別
from pyspark.mllib.linalg import Vectors

# DenseVector型別的好處,這2行只是一個演示而已
a = sc.parallelize([Vectors.dense(1,2),Vectors.dense(3,4)])
a.sum() # 裝著DenseVector型別向量的rdd,可以按列求值

# def records_processing(record):
# trimmed = record.map(lambda e: e.replaceAll('\"', '')) # label = int(trimmed[record.size - 1]) # features = trimmed[4 : record.size - 1].map(lambda d: 0.0 if d == '?' else float(d)) # return LabeledPoint(label, Vectors.dense(features)) sizeAndData = records.map(lambda record : (len(record), map(lambda e: e.replace('\"', ''), record))) labelAndFeature = sizeAndData.map(lambda (size, data): (int(data[size-1]), map(lambda d: 0.0 if d == '?' else float(d), data[4: size-1]))) data = labelAndFeature.map(lambda (label, feature): LabeledPoint(label, Vectors.dense(map(lambda d: 0.0 if d < 0 else d, feature)))) data.first()

LabeledPoint(0.0, [0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176,0.079129575])

data.cache()
data.count()

7395

from pyspark.mllib.classification import LogisticRegressionWithSGD
from pyspark.mllib.classification import SVMWithSGD
from pyspark.mllib.classification import NaiveBayes
from pyspark.mllib.tree import DecisionTree
# from pyspark.mllib.tree.configuration import Algo 
# from pyspark.mllib.tree.impurity import Entropy

numIterations = 10
maxTreeDepth = 5
dataPoint = data.first()
dataPoint

LabeledPoint(0.0, [0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176,0.079129575])

# 訓練邏輯迴歸模型
lrModel = LogisticRegressionWithSGD.train(data, numIterations)

# 使用邏輯歸回模型
prediction = lrModel.predict(dataPoint.features)
print "預測值:%d, 真實值:%d" % (prediction, dataPoint.label)

# 看看邏輯迴歸整體的預測結果
predictionsAndRealities = data.map(lambda point: lrModel.predict(point.features) == point.label)

# 計算邏輯迴歸的準確率
lrTotalCorrect = data.map(lambda point :
                         1 if lrModel.predict(point.features) == point.label else 0
                         ).sum()
lrAccuracy = float(lrTotalCorrect) / data.count()
print "邏輯迴歸準確率:", lrAccuracy

# 計算PR(準確率-召回率)曲線下面積, 以及ROC(受試者工作特徵曲線, 真陽性率-假陽性率)曲線下面積即AUC
from pyspark.mllib.evaluation import BinaryClassificationMetrics
scoreAndLabels = data.map(lambda point: (float(lrModel.predict(point.features)), point.label))

metrics = BinaryClassificationMetrics(scoreAndLabels)
print 'Area under PR: %2.4f%%, Area under ROC: %2.4f%%' % \
    (metrics.areaUnderPR * 100, metrics.areaUnderROC * 100)

預測值:1, 真實值:0

邏輯迴歸準確率: 0.514672075727

Area under PR: 75.6759%, Area under ROC: 50.1418%

# 訓練SVM模型
svmModel = SVMWithSGD.train(data, numIterations)

# 使用SVM模型
prediction = svmModel.predict(dataPoint.features)
print "預測值:%d, 真實值:%d" % (prediction, dataPoint.label)

# 看看SVM整體的預測結果
predictionsAndRealities = data.map(lambda point: svmModel.predict(point.features) == point.label)

# 計算SVM的準確率
svmTotalCorrect = data.map(lambda point :
                         1 if svmModel.predict(point.features) == point.label else 0
                         ).sum()
svmAccuracy = float(svmTotalCorrect) / data.count()
print "SVM準確率:", svmAccuracy

# 計算PR(準確率-召回率)曲線下面積, 以及ROC(受試者工作特徵曲線, 真陽性率-假陽性率)曲線下面積即AUC
scoreAndLabels = data.map(lambda point: (float(svmModel.predict(point.features)), point.label))

metrics = BinaryClassificationMetrics(scoreAndLabels)
print 'Area under PR: %2.4f%%, Area under ROC: %2.4f%%' % \
    (metrics.areaUnderPR * 100, metrics.areaUnderROC * 100)

預測值:1, 真實值:0

SVM準確率: 0.514672075727

Area under PR: 75.6759%, Area under ROC: 50.1418%

# 訓練樸素貝葉斯模型
nbModel = NaiveBayes.train(data)

# 使用樸素貝葉斯模型
prediction = nbModel.predict(dataPoint.features)
print "預測值:%d, 真實值:%d" % (prediction, dataPoint.label)

# 看看樸素貝葉斯整體的預測結果
predictionsAndRealities = data.map(lambda point: nbModel.predict(point.features) == point.label)

# 計算樸素貝葉斯的準確率
nbTotalCorrect = data.map(lambda point :
                         1 if nbModel.predict(point.features) == point.label else 0
                         ).sum()
nbAccuracy = float(nbTotalCorrect) / data.count()
print "樸素貝葉斯準確率:", nbAccuracy

# 計算PR(準確率-召回率)曲線下面積, 以及ROC(受試者工作特徵曲線, 真陽性率-假陽性率)曲線下面積即AUC
from pyspark.mllib.evaluation import BinaryClassificationMetrics
scoreAndLabels = data.map(lambda point: (float(nbModel.predict(point.features)), point.label))

metrics = BinaryClassificationMetrics(scoreAndLabels)
print 'Area under PR: %2.4f%%, Area under ROC: %2.4f%%' % \
    (metrics.areaUnderPR * 100, metrics.areaUnderROC * 100)

預測值:1, 真實值:0

樸素貝葉斯準確率: 0.580392156863

Area under PR: 68.0851%, Area under ROC: 58.3559%

# 訓練決策樹模型
dtModel = DecisionTree.trainClassifier(data, 2, {})

# 使用決策樹模型
prediction = dtModel.predict(dataPoint.features)
print "預測值:%d, 真實值:%d" % (prediction, dataPoint.label)

# 看看決策樹整體的預測結果, 決策樹模型只能在主節點上使用
predictions = dtModel.predict(data.map(lambda point: point.features))
results = predictions.zip(data.map(lambda point: point.label))

# 計算決策樹的準確率
dtTotalCorrect = results.filter(lambda (real, pred): pred == real).count()
dtAccuracy = float(dtTotalCorrect) / data.count()
print "決策樹準確率:", dtAccuracy

# 計算PR(準確率-召回率)曲線下面積, 以及ROC(受試者工作特徵曲線, 真陽性率-假陽性率)曲線下面積即AUC
metrics = BinaryClassificationMetrics(results)
print 'Area under PR: %2.4f%%, Area under ROC: %2.4f%%' % \
    (metrics.areaUnderPR * 100, metrics.areaUnderROC * 100)

預測值:0, 真實值:0

決策樹準確率: 0.648275862069

Area under PR: 74.2894%, Area under ROC: 64.8916%

# 上面的結果跟隨機差不多,原因是我們僅僅只是把資料送入模型
# 改進模型效能以及引數調優

# from pyspark.mllib.linalg.distributed import RowMatrix

matrix = data.map(lambda point: point.features)
print "矩陣的第一行:\n", matrix.first()

result = matrix.stats()
print "矩陣各列的平均值:\n", result.mean()
print "矩陣各列的最小值:\n", result.min()
print "矩陣各列的最大值:\n", result.max()
print "矩陣各列的標準差:\n", result.stdev()
print "矩陣各列的方差:\n", result.variance()

矩陣的第一行:
[0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176,0.079129575]

矩陣各列的平均值:
[0.412258052995,2.76182319199,0.468230473286,0.214079926384,0.092062360719,0.0492621604391,2.25510345221,0.000914886880189,0.0,0.0564227449842,0.02123056119,0.233778176655,0.369015325398,0.615551048005,0.66031102096,30.0770791075,0.0397565922921,5716.59824206,178.754563895,4.96064908722,0.17286405047,0.101220791893]

矩陣各列的最小值:

[ 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.04556422 0. 0.
0. 0. 0. 0. 1. 0. 0.
0. ]

矩陣各列的最大值:

[ 9.99426000e-01 3.63000000e+02 1.00000000e+00 1.00000000e+00
9.80392157e-01 9.80392157e-01 2.10000000e+01 2.50000000e-01
0.00000000e+00 4.44444444e-01 1.00000000e+00 7.16883117e-01
1.13333333e+02 1.00000000e+00 1.00000000e+00 1.00000000e+02
1.00000000e+00 2.07952000e+05 4.99700000e+03 2.20000000e+01
1.00000000e+00 1.00000000e+00]

矩陣各列的標準差:

[ 3.31251568e-01 8.61920979e+00 2.03119645e-01 1.46732834e-01
9.59717238e-02 7.26238264e-02 5.70392733e+00 7.42829375e-03
0.00000000e+00 4.14428371e-02 1.44152088e-01 5.24830153e-02
1.87878356e+00 4.86464752e-01 4.73603607e-01 2.03917219e+01
1.95386810e-01 8.87483231e+03 1.79454063e+02 3.23289275e+00
1.83273269e-01 7.92255262e-02]

矩陣各列的方差:
[0.109727601589,74.2907773273,0.041257590017,0.0215305244407,0.00921057176901,0.00527422015895,32.5347869854,5.51795480353e-05,0.0,0.00171750875055,0.0207798244616,0.00275446689512,3.52982764865,0.236647955305,0.224300376559,415.822321151,0.0381760056614,78762648.5733,32203.7608565,10.4515955219,0.0335890912886,0.00627668399539]

from pyspark.mllib.feature import StandardScaler

# 第一個True表示均值正則化(每個值減去均值),第二個True表示正則化標準差(每個值除以標準差進行縮放)
scaler = StandardScaler(withMean=True, withStd=True).fit(matrix)
scaledData = scaler.transform(matrix)
scaledData = data.map(lambda lp: lp.label).zip(scaledData).map(lambda (label, feature): LabeledPoint(label, feature))

print "標準化前:\n", data.first().features
print "標準化後:\n", scaledData.first().features

import numpy as np
vector = np.array(matrix.first())
mean = np.array(result.mean())
stdev = np.array(result.stdev())
print "手動標準化:\n", (vector - mean) / stdev

標準化前:
[0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176,0.079129575]

標準化後:
[1.1376473365,-0.0819355716929,1.02513981289,-0.0558635644254,-0.468893253129,-0.354305326308,-0.317535217236,-0.123154125351,0.0,0.828822173315,-0.147268943346,0.229639823578,-0.194331667814,0.790238049918,0.717194729453,-0.297996816496,-0.20346257793,-0.0329672096969,-0.0487811297558,0.940069975117,-0.108698488525,-0.278820782314]

手動標準化:

[ 1.13772426 -0.08194111 1.02520913 -0.05586734 -0.46892496 -0.35432928
-0.31755669 -0.12316245 nan 0.82887822 -0.1472789 0.22965535
-0.19434481 0.79029149 0.71724323 -0.29801697 -0.20347634 -0.03296944
-0.04878443 0.94013354 -0.10870584 -0.27883964]

# 用標準化後的資料重新訓練模型,不訓練決策樹和樸素貝葉斯,因為他倆不受特徵標準化的影響

# 訓練邏輯迴歸模型
lrModelScaled = LogisticRegressionWithSGD.train(scaledData, numIterations)

# 看看邏輯迴歸整體的預測結果
lrTotalCorrectScaled = scaledData.filter(lambda point: lrModelScaled.predict(point.features) == point.label).count()
print "預測正確的樣本個數", lrTotalCorrectScaled

# 計算邏輯迴歸的準確率
lrAccuracyScaled = float(lrTotalCorrectScaled) / data.count()
print "邏輯迴歸準確率:", lrAccuracyScaled

# 計算PR(準確率-召回率)曲線下面積, 以及ROC(受試者工作特徵曲線, 真陽性率-假陽性率)曲線下面積即AUC
lrPredictionsVsTrue = scaledData.map(lambda point: (float(lrModelScaled.predict(point.features)), point.label))
lrMetricsScaled = BinaryClassificationMetrics(lrPredictionsVsTrue)
print 'Area under PR: %2.4f%%, Area under ROC: %2.4f%%' % \
    (lrMetricsScaled.areaUnderPR * 100, lrMetricsScaled.areaUnderROC * 100)

預測正確的樣本個數 4609

邏輯迴歸準確率: 0.623258958756

Area under PR: 73.0204%, Area under ROC: 62.2292%

# 考慮網頁的類別特徵,比如這個網頁是屬於sports類還是business類
categories = dict(records.map(lambda record: record[3].replace('\"', '')).distinct().zipWithIndex().collect())
print categories
numCategories = len(categories)
print numCategories

categories = sc.broadcast(categories)
numCategories = sc.broadcast(numCategories)

{u’gaming’: 7, u’recreation’: 0, u’business’: 1, u’computer_internet’: 2, u’unknown’: 8, u’culture_politics’: 3, u’science_technology’: 9, u’law_crime’: 4, u’sports’: 10, u’religion’: 11, u’weather’: 12, u’health’: 5, u’?’: 6, u’arts_entertainment’: 13}
14

def buildCategoriesVector(data):
    vector = [0] * numCategories.value
    vector[categories.value[data[3]]] = 1
    return vector + data[4:-1]

sizeAndData = records.map(lambda record : (len(record), map(lambda e: e.replace('\"', ''), record)))
labelAndFeature = sizeAndData.map(lambda (size, data): (int(data[-1]), map(lambda d: 0.0 if d == '?' else float(d), buildCategoriesVector(data))))
dataCategories = labelAndFeature.map(lambda (label, feature): LabeledPoint(label, Vectors.dense(map(lambda d: 0.0 if d < 0 else d, feature))))
dataPointCats = dataCategories.first()
print labelAndFeature.first()
print dataPointCats 

(0, [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.789131, 2.055555556, 0.676470588, 0.205882353, 0.047058824, 0.023529412, 0.443783175, 0.0, 0.0, 0.09077381, 0.0, 0.245831182, 0.003883495, 1.0, 1.0, 24.0, 0.0, 5424.0, 170.0, 8.0, 0.152941176, 0.079129575])

(0.0,[0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176,0.079129575])

# 標準化轉換
labels = dataCategories.map(lambda lp: lp.label)
featuresMatrix = dataCategories.map(lambda lp: lp.features)
scalerCats = StandardScaler(withMean=True, withStd=True).fit(featuresMatrix)
scaledDataCats = scalerCats.transform(featuresMatrix).zip(labels).map(lambda (features, label): LabeledPoint(label, features))
scaledDataCats.first()

LabeledPoint(0.0, [-0.446421204794,2.72073665645,-0.204182210579,-0.220526884579,-0.0648775723926,-0.270999069693,-0.680752790425,-0.101894690972,-0.028494000387,-0.201654052319,-0.232727977095,-0.0991499193088,-0.0232621058984,-0.381813223243,1.1376473365,-0.0819355716929,1.02513981289,-0.0558635644254,-0.468893253129,-0.354305326308,-0.317535217236,-0.123154125351,0.0,0.828822173315,-0.147268943346,0.229639823578,-0.194331667814,0.790238049918,0.717194729453,-0.297996816496,-0.20346257793,-0.0329672096969,-0.0487811297558,0.940069975117,-0.108698488525,-0.278820782314])

lrModelScaledCats = LogisticRegressionWithSGD.train(scaledDataCats, numIterations)

# 看看邏輯迴歸整體的預測結果
lrTotalCorrectScaledCats = scaledDataCats.filter(lambda point: lrModelScaledCats.predict(point.features) == point.label).count()
print "預測正確的樣本個數", lrTotalCorrectScaledCats

# 計算邏輯迴歸的準確率
lrAccuracyScaledCats = float(lrTotalCorrectScaledCats) / scaledDataCats.count()
print "邏輯迴歸準確率:", lrAccuracyScaledCats

# 計算PR(準確率-召回率)曲線下面積, 以及ROC(受試者工作特徵曲線, 真陽性率-假陽性率)曲線下面積即AUC
lrPredictionsVsTrue = scaledDataCats.map(lambda point: (float(lrModelScaledCats.predict(point.features)), point.label))
lrMetricsScaledCats = BinaryClassificationMetrics(lrPredictionsVsTrue)
print 'Area under PR: %2.4f%%, Area under ROC: %2.4f%%' % \
    (lrMetricsScaledCats.areaUnderPR * 100, lrMetricsScaledCats.areaUnderROC * 100)

預測正確的樣本個數 4928

邏輯迴歸準確率: 0.666396213658

Area under PR: 75.8535%, Area under ROC: 66.6127%

# 只使用1-of-k編碼的類別特徵
def onlyCategoriesVector(data):
    vector = [0] * numCategories.value
    vector[categories.value[data[3]]] = 1
    return vector

sizeAndData = records.map(lambda record : (len(record), map(lambda e: e.replace('\"', ''), record)))
labelAndFeature = sizeAndData.map(lambda (size, data): \
                                  (int(data[-1]), map(lambda d: 0.0 if d == '?' else float(d), onlyCategoriesVector(data))))
dataNB = labelAndFeature.map(lambda (label, feature): \
                             LabeledPoint(label, Vectors.dense(map(lambda d: 0.0 if d < 0 else d, feature))))

# 訓練樸素貝葉斯模型
nbModelCats = NaiveBayes.train(dataNB)

# 計算樸素貝葉斯的準確率
nbTotalCorrectCats = dataNB.map(lambda point :
                         1 if nbModelCats.predict(point.features) == point.label else 0
                         ).sum()
nbAccuracyCats = float(nbTotalCorrectCats) / dataNB.count()
print "樸素貝葉斯準確率:", nbAccuracyCats

nbPredictionsVsTrueCats = dataNB.map(lambda point: (float(nbModelCats.predict(point.features)), point.label))
nbMetricsCats = BinaryClassificationMetrics(nbPredictionsVsTrueCats)
print 'Area under PR: %2.4f%%, Area under ROC: %2.4f%%' % \
    (nbMetricsCats.areaUnderPR * 100, nbMetricsCats.areaUnderROC * 100)

樸素貝葉斯準確率: 0.609601081812

Area under PR: 74.0522%, Area under ROC: 60.5138%

相關推薦

MLlib簡單分類模型(python)

# 這個資料集來自www.kaggle.com/c/stumbleupon/data, 用於預測網頁是短暫存在還是長時間流行,這裡我把它存放在hdfs檔案系統內的/user/yy/stumbleupon/目錄裡 rawData = sc.textFile("hd

python 簡單算法詳解

把他 color 無序 def 就是 基本 tro 找到 第一個 一、冒泡排序 基本思想:它的思路很有特點循環,兩兩向後比較。它重復地走訪過要排序的數列,一次比較兩個元素,如果他們的順序錯誤就把他們交換過來。走訪數列的工作是重復地進行直到沒有再需要交換,也就是說該數列已經排

資料結構與演算法(1)連結串列,基於Python解決簡單的面試題

最近頭一直很大,老闆不停地佈置各種任務,根本沒有時間幹自己的事情,真的好想鼓起勇氣和他說,我以後不想幹這個了,我文章也發了您就讓我安安穩穩混到畢業行不行啊……作為我們這些想要跨專業的人來說,其實很大的一個劣勢就是沒有經歷過一個計算機學科完整的培養,所以對計算機專業的一些很基本

Python中多個數組行合併及列合併的簡單方法

採用numpy快速將兩個矩陣或數組合併成一個數組: import numpy as np 陣列 a = [[1,2,3],[4,5,6]] b = [[1,1,1],[2,2,2]] 1、陣列縱向合併 1) c = np.vstack((a,b)) c = array([

簡單步驟可以提高網站瀏覽體驗

活動 構圖 進一步 故事 層級 單純 例如 我們 建站寶盒 STEP 1確定網站定位 你的網站定位是要以企業形象官網為主,或是以單純販售商品的購物網站,還是要跟使用者做互動活動的網站? 每個網站的目的都不盡相同,形象官網,可能需要著重於品牌理念、品牌故事、公司介紹

JavaSE 最easy出錯的簡單的問題

成員 tde bsp ase log package 標識 多次 年齡 案例1. package cn.itcast.oop; public class ThisDemo { public static void main(String[] args) { S

獲取命令使用的方法和簡單的操作命令

網絡運維獲取命令的使用幫助的幾種方法: 1.man (針對LINUX內部的命令) Man 九章:1.用戶命令(所有用戶都可以執行的命令)2.系統調用3.C庫調用4.設備和特殊文件格式5.配置文件格式6.遊戲相關的幫助信息7.雜項8.系統管理類命令9.與內核相關的幫助信息 在每個man手冊中都常見到的一些字

簡單用artTemplate的例子

ets lpar icheck security 支持 新版 render sts 手機號換綁 寫幾個簡單的artTemplate的例子,很多框架都有自己的模板(template),沒得時候,可以利用artTemplate.js完成 html文件是: <!DOCTY

圖書管理系統的簡單示例圖

活動圖 body log 業務 圖書管理 活動 class pos 技術 圖書館業務用例圖

php操作數據庫的簡單語句

res mysql while 創建 register 添加數據 對象 創建數據庫 .frm 數據存儲文件介紹 .frm是描述了表的結構 .MYD保存了表的數據記錄 .MYI則是表的索引 windows 命令: window + R -> cmd打開命令窗口 配

關於Fibonacci數列簡單的演算法

無窮數列1,1,2,3,5,8,13,21,34,55,……,稱為Fibonacci數列。它可以遞迴地定義為: 第n個Fibonacci數可遞迴地計算如下: int fibonacci(int n)    {     

聊聊簡單的排序演算法

前言 排序是演算法的入門知識,其經典思想可以用在許多演算法中,在實際應用中是相當常見的一類。記得在本科的資料結構課上就有講過幾個經典的排序演算法,現在來好好地回顧下。 在回顧之前,瞭解一個概念,這個概念也是我剛剛瞭解的。(手動扶額-。-) 排序演算法穩定性:假定在待排序的記錄序列

簡單if程式的細節比較與加法程式設計

關於簡單的if判斷語句的不同寫法: 輸出0-9十個整數: 第一個程式:   #include<stdio.h>#include<stdlib.h>int main(){int i = 0;while (i <= 10){printf("%d ", i);i =

簡單的技巧讓你寫出的vue.js程式碼更優雅

本文參考自油管上某個國外大神的公開演講視訊,學習了一下覺得很不錯,所以在專案中也使用了這些不錯的技巧。 1. watch 與 computed 的巧妙結合 如上圖,一個簡單的列表頁面。 你可能會這麼做: created(){ this.fetchData() }, watch: { keyword(

胡八一之Java(六):表示式的簡單的小陷阱

1、複合賦值運算子的陷阱 a=a+5與a +=5 是有區別的。a +=5等價於 a=(a的型別)(a+5);這就是複合運算子中包含的隱式型別轉換。 在什麼時候會遇到此型別的錯誤呢? short a =5; a = (a-2); 此句編譯不通過,把一個int

分享簡單的技巧讓你的 vue.js 代碼更優雅

turn per alarm 傳遞參數 rep ams spa () foreach 1. watch 與 computed 的巧妙結合 一個簡單的列表頁面。 你可能會這麽做: created(){ this.fetchData() }, watch: { key

我的Java第二天——簡單小程式

1.計算圓的面積。 程式碼: import java.util.Scanner; public class 計算圓的面積 { static double getScannerDouble() { Scanner s = new Scanner(System.i

html5的簡單例子

(宣告:html5是較先進的技術,請在較新瀏覽器上測試,如ie9,firefox,chrome的較新版本) 1.audio元素 audio和video標籤可以直接在網頁上播放,不用任何外掛! 程式碼如

有關開發模式的概念-瀑布模型,極限程式設計,敏捷開發

做開發已經不算是一個新人了,但很多朋友問一些開發模式方面的概念,比如瀑布模型,極限程式設計,敏捷開發等,總感覺給人解釋的一知半解,模模糊糊,今天無意間看到一篇比較淺顯的這方面的文章,轉載一下,也算給自己補充下概念性的知識。  如何快速響應客戶需求變動 ————ruby on

JQuery 的簡單實現示例

幾種常見選擇器的練習例題 熟練使用JQuery操作表單元素 一、實現表格 隔行變色(例題都引入了css做表單的樣式設定) <html> <head> <meta http-equiv="Content-Type