1. 程式人生 > >kaggle比賽練習_1: 做的什麼菜?(What's Cooking?)

kaggle比賽練習_1: 做的什麼菜?(What's Cooking?)

Kaggle簡介

Kaggle是一個數據分析的競賽平臺,網址:https://www.kaggle.com/。企業或者研究者可以將資料、問題描述、期望的指標釋出到Kaggle上,以競賽的形式向廣大的資料科學家徵集解決方案,類似於KDD-CUP(國際知識發現和資料探勘競賽)。Kaggle上的參賽者將資料下載下來,分析資料,然後運用機器學習、資料探勘等知識,建立演算法模型,解決問題得出結果,最後將結果提交,排名前面的可能會獲得豐厚的報酬

我們又不是大牛,那麼kaggle對於我們有什麼意義呢?個人覺得是練習,紙上得來終覺淺,把書本,paper中的機器學習演算法,寫成程式碼用於實際問題中,可以提高自己對演算法的理解與寫程式碼的能力。(從最簡單的開始)工作之餘,也要提高自己,所以最近準備在kaggle上做一些題目,也寫出來與大家分享討論。 本部落格給出的程式碼都是準確度一般的實驗性程式碼,排名不會靠前,只做練習分享。

菜品是什麼?what’s cooking

訓練資料包含,”ID,菜品,菜品,菜的原料”,菜品是要預測的類,菜的原料可以認為是feature. 訓練資料只包含“ID, 菜的原料”,需要預測菜品是哪一類。由 Yummly公司提供的真實資料。
問題提供的資料為json格式。

LR解法

本文用多類的邏輯迴歸演算法,採用隨即梯度下降法優化,得到了76%的預測準確率。程式碼如下。
# -*- coding: utf-8 -*-
''' Created on 2015-9-25 @author: joeyqzhou ''' import json import numpy as np import csv t_set = set() x_set = set() #x : feature learning_rate = 1 alpha = 0.00001 #正則項懲罰係數 iter_time = 5 if __name__ == "__main__": with open("train.json") as json_file: train_data = json.load(json_file) #list
N = len(train_data) #number of training data for datai in train_data: t_set.add(datai['cuisine']) for ingredient in datai['ingredients']: x_set.add(ingredient) K = len(t_set) M = len(x_set) t_list = list(t_set) x_list = list(x_set) T = np.zeros([N,K]) #把分類結果放在一個向量中,K維度 #那所有N個結果放在T中 X = np.zeros([N,M]) for i in range(N): datai = train_data[i] cuisine_i = datai['cuisine'] cuisine_i_index = t_list.index(cuisine_i) T[i,cuisine_i_index] = 1 x_i = datai['ingredients'] for itemj in x_i: itemj_index = x_list.index(itemj) X[i,itemj_index] = 1 pass #optimization #To get W W = np.zeros([K,M]) #initialization for i in range(K): for j in range(M): W[i,j] = np.random.rand(1) -0.5 #split data randomly training data training_label = np.zeros(N) for i in range(N): if np.random.rand(1)<1.01: #if split training data to get training:testing data = 9:1, this can be 0.9 training_label[i] = 1 #training #Update with each piece of data y = np.zeros(K) #prediction y for iter in range(iter_time): #迭代3次 learning_rate = learning_rate*0.1 print "training: ",iter," time" for i in range(N): #each data if training_label[i] == 1: summ = 0.0 for j in range(K): #each class y[j] = np.exp( np.dot(W[j],X[i]) ) summ += y[j] for j in range(K): #normalization y[j] = y[j]/summ for j in range(K): W[j] = W[j] - learning_rate*(y[j]-T[i,j])*X[i] - alpha * learning_rate * W[j] print "Finish training" print "begin testing" with open("test.json") as json_file: test_data = json.load(json_file) #list N_test = len(test_data) #把分類結果放在一個向量中,K維度 #那所有N個結果放在T中 #open the csv to write with open('test_prediction.csv', 'wb') as csvfile: fieldnames = ['id' , 'cuisine'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() X_test = np.zeros([N_test,M]) for i in range(N_test): datai = test_data[i] x_i = datai['ingredients'] id = datai['id'] for itemj in x_i: if itemj in x_list: itemj_index = x_list.index(itemj) X_test[i,itemj_index] = 1 #to calculate the prediction y for j in range(K): y[j] = np.exp( np.dot(W[j],X_test[i]) ) max_index = y.argmax(axis=0) cuisine = t_list[max_index] writer.writerow({'id': id, 'cuisine': cuisine}) ###If we do not have testing data, we use the splitted training data as testing data # right_count = 0.0 # all_count = 0.0 # y = np.zeros(K) #prediction y # for i in range(N): # if training_label[i] == 0: # all_count += 1 # for j in range(K): # y[j] = np.exp( np.dot(W[j],X[i]) ) # max_index = y.argmax(axis=0) # if T[i,max_index] == 1: # right_count +=1 # print "precision: ", right_count/all_count