樸素貝葉斯學習
樸素貝葉斯,為什麽叫“樸素”,就在於是假定所有的特征之間是“獨立同分布”的。這樣的假設肯定不是百分百合理的,在現實中,特征與特征之間肯定還是存在千絲萬縷的聯系的,但是假設特征之間是“獨立同分布”,還是有合理性在裏面,而且針對某些特定的任務,用樸素貝葉斯得到的效果還不錯,根據“實踐是檢驗真理的唯一標準”,這個模型就具備意義了。這其實和那個“馬爾科夫”假設有類似的地方。
樸素貝葉斯的一個思想是,根據現有的一些材料,通常叫做訓練語料,這些語料包含很多信息,而這些現實中的信息會蘊含著某種規律,樸素貝葉斯就是一個不是十分完美,但效果也還過得去的擬合這個潛在的規律的一個模型。
比如,現在現實中有女孩子所選擇的老公的情況,從這些情況信息中,我們可以試圖用樸素貝葉斯這一模型來找出女生選擇老公的規律(當然,不是一個百分百準確的規律,但準確性過得去)。
而樸素貝葉斯的核心思想就是:針對某一個實際中的男生,他的四個特征分別為:x1,x2,x3,x4,如果p(嫁|x1,x2,x3,x4)>p(不嫁|x1,x2,x3,x4),這說明這個男生大概率情況下會有女生願意嫁他,反之則是大概率不嫁
而根據貝葉斯公式:
而根據樸素貝葉斯的假設,特征之間是“獨立同分布”的,所以,上面的公式可以寫為:
而p(x1),p(x2),p(x3),p(x4),p(嫁),p(x1|嫁),p(x2|嫁),p(x3|嫁),p(x4|嫁)根據訓練語料,可以輕松求得,因此p(嫁|x1,x2,x3,x4)>p(不嫁|x1,x2,x3,x4)與否這一問題就可以得到答案
假設,現在有一個男生的特征是:不帥,性格不好,矮,不上進,那麽所需要的幾個概率分別為:
p(不帥)=5/12,p(性格不好)=4/12,p(矮)=7/12,p(不上進)=5/12,P(嫁)=6/12.
p(不帥|嫁)=3/12,p(性格不好|嫁)=1/12,p(矮|嫁)=1/12,p(不上進|嫁)=1/12
因此:p(嫁|x1,x2,x3,x4)=3/12*1/12*1/12*1/12*6/12 / ( 5/12*4/12*7/12*5/12 )=3/1400=0.0021
而p(不嫁|x1,x2,x3,x4)=72/700=0.103,
顯然,這個男生大概率情況下不會有女生願意嫁他
具體的代碼實現如下,這裏隨機產生10個男生的情況,根據訓練語料判斷他們是否大概率情況下有女生願意嫁他們
分別用python代碼和java代碼實現,其中,java的邏輯上有一點小小的問題,雖然也能得到正確的結果
1 #Import Library of Gaussian Naive Bayes model 2 from sklearn.naive_bayes import GaussianNB 3 import random 4 import codecs 5 6 f=codecs.open("trainData.txt",‘r‘,‘utf-8‘) 7 a=[] 8 b=[] 9 for l in f: 10 temp=l.split() 11 i=0 12 for m in temp: 13 if m.find("不"): 14 temp[i]=0 15 i+=1 16 elif m.find("高"): 17 temp[i]=1 18 i+=1 19 elif m.find("矮"): 20 temp[i]=0 21 i+=1 22 else: 23 temp[i]=1 24 i+=1 25 a.append(temp[:4]) 26 b.append(temp[-1]) 27 #Create a Gaussian Classifier 28 model = GaussianNB() 29 30 # Train the model using the training sets 31 model.fit(a, b) 32 for i in range(0,9): 33 if random.random()>0.5: 34 x1=1 35 s1="帥" 36 else: 37 x1=0 38 s1="不帥" 39 if random.random()>0.5: 40 x2=1 41 s2="性格好" 42 else: 43 x2=0 44 s2="性格不好" 45 if random.random()>0.5: 46 x3=1 47 s3="高" 48 else: 49 x3=0 50 s3="矮" 51 if random.random()>0.5: 52 x4=1 53 s4="上進" 54 else: 55 x4=0 56 s4="不上進" 57 predicted= model.predict([[x1,x2,x3,x4]]) 58 if 0 in predicted: 59 print(s1,s2,s3,s4,"不嫁") 60 else: 61 print(s1,s2,s3,s4,"嫁")
JAVA代碼:
1 package bayesTest; 2 3 import java.io.*; 4 5 public class bayesTest { 6 7 public static void main(String[] args) throws IOException { 8 9 FileReader reader = new FileReader("Data\\trainData.txt"); 10 BufferedReader br = new BufferedReader(reader); 11 String str = null; 12 int countHansome=0,countUnHansome=0,countChaGood=0,countChaBad=0,countHigh=0,countShort=0,countAggre=0, 13 countUnAggre=0; 14 int feature[][]=new int[4][2]; 15 int feature2[][]=new int[4][2]; 16 int lineNum=0; 17 int location1,location2,location3,location4,location5; 18 int x1,x2,x3,x4,x5; 19 int m1,m2,m3,m4,m5; 20 String s1=null,s2=null,s3=null,s4=null; 21 double answer1,answer2; 22 int marryCount=0; 23 while((str = br.readLine()) != null){ 24 location5=str.indexOf("不嫁"); 25 if(location5==-1){ //嫁 26 marryCount++; 27 28 location1=str.indexOf("不帥"); 29 if(location1==-1){ //帥 30 feature[0][1]++; 31 }else{ 32 feature[0][0]++; 33 } 34 location2=str.indexOf("不好"); 35 if(location2==-1){//好 36 feature[1][1]++; 37 }else{ 38 feature[1][0]++; 39 } 40 location3=str.indexOf("矮"); 41 if(location3==-1){//高 42 feature[2][1]++; 43 }else{ 44 feature[2][0]++; 45 } 46 location4=str.indexOf("不上進"); 47 if(location4==-1){//上進 48 feature[3][1]++; 49 }else{ 50 feature[3][0]++; 51 } 52 }else{ 53 location1=str.indexOf("不帥"); 54 if(location1==-1){ //帥 55 feature2[0][1]++; 56 }else{ 57 feature2[0][0]++; 58 } 59 location2=str.indexOf("不好"); 60 if(location2==-1){//好 61 feature2[1][1]++; 62 }else{ 63 feature2[1][0]++; 64 } 65 location3=str.indexOf("矮"); 66 if(location3==-1){//高 67 feature2[2][1]++; 68 }else{ 69 feature2[2][0]++; 70 } 71 location4=str.indexOf("不上進"); 72 if(location4==-1){//上進 73 feature2[3][1]++; 74 }else{ 75 feature2[3][0]++; 76 } 77 } 78 lineNum++; 79 } 80 81 //p(嫁|x1,x2,x3,x4)=p(x1|嫁)*p(x2|嫁)*p(x3|嫁)*p(x4|嫁)*p(嫁)/p(x1)*p(x2)*p(x3)*p(x4) 82 for(int i=0;i<10;i++){ 83 x1=Math.random()>0.5?0:1; 84 switch(x1){ 85 case 0:s1="不帥";break; 86 case 1:s1="帥";break; 87 } 88 x2=Math.random()>0.5?0:1; 89 switch(x2){ 90 case 0:s2="性格不好";break; 91 case 1:s2="性格好";break; 92 } 93 x3=Math.random()>0.5?0:1; 94 switch(x3){ 95 case 0:s3="矮";break; 96 case 1:s3="高";break; 97 } 98 x4=Math.random()>0.5?0:1; 99 switch(x4){ 100 case 0:s4="不上進";break; 101 case 1:s4="上進";break; 102 } 103 104 105 106 answer1=((double)feature[0][x1]/(double)marryCount)*((double)feature[1][x2]/(double)marryCount)* 107 ((double)feature[2][x3]/(double)marryCount)*((double)feature[3][x4]/(double)marryCount)* 108 ((double)marryCount/(double)lineNum)/ 109 (((double)(feature[0][x1]+feature2[0][x1])/(double)lineNum)* 110 ((double)(feature[1][x2]+feature2[1][x2])/(double)lineNum)* 111 ((double)(feature[2][x3]+feature2[2][x3])/(double)lineNum)* 112 ((double)(feature[3][x4]+feature2[3][x4])/(double)lineNum)); 113 answer2=((double)feature2[0][x1]/(double)marryCount)*((double)feature2[1][x2]/(double)marryCount)* 114 ((double)feature2[2][x3]/(double)marryCount)*((double)feature2[3][x4]/(double)marryCount)* 115 ((double)(lineNum-marryCount)/(double)lineNum)/ 116 (((double)(feature[0][x1]+feature2[0][x1])/(double)lineNum)* 117 ((double)(feature[1][x2]+feature2[1][x2])/(double)lineNum)* 118 ((double)(feature[2][x3]+feature2[2][x3])/(double)lineNum)* 119 ((double)(feature[3][x4]+feature2[3][x4])/(double)lineNum)); 120 121 if(answer1>answer2){ 122 System.out.println(s1+","+s2+","+s3+","+s4+","+"要嫁"+answer1+","+answer2); 123 }else{ 124 System.out.println(s1+","+s2+","+s3+","+s4+","+"不嫁"+answer1+","+answer2); 125 } 126 } 127 128 129 } 130 131 }
從這裏可以看出,python確實是特別適合用於機器學習當中,代碼要簡潔得多。
樸素貝葉斯學習