kaggle比賽練習_1: 做的什麼菜?(What's Cooking?)
阿新 • • 發佈:2018-12-17
Kaggle簡介
Kaggle是一個數據分析的競賽平臺,網址:https://www.kaggle.com/。企業或者研究者可以將資料、問題描述、期望的指標釋出到Kaggle上,以競賽的形式向廣大的資料科學家徵集解決方案,類似於KDD-CUP(國際知識發現和資料探勘競賽)。Kaggle上的參賽者將資料下載下來,分析資料,然後運用機器學習、資料探勘等知識,建立演算法模型,解決問題得出結果,最後將結果提交,排名前面的可能會獲得豐厚的報酬
我們又不是大牛,那麼kaggle對於我們有什麼意義呢?個人覺得是練習,紙上得來終覺淺,把書本,paper中的機器學習演算法,寫成程式碼用於實際問題中,可以提高自己對演算法的理解與寫程式碼的能力。(從最簡單的開始)工作之餘,也要提高自己,所以最近準備在kaggle上做一些題目,也寫出來與大家分享討論。 本部落格給出的程式碼都是準確度一般的實驗性程式碼,排名不會靠前,只做練習分享。
菜品是什麼?what’s cooking
訓練資料包含,”ID,菜品,菜品,菜的原料”,菜品是要預測的類,菜的原料可以認為是feature. 訓練資料只包含“ID, 菜的原料”,需要預測菜品是哪一類。由 Yummly公司提供的真實資料。
問題提供的資料為json格式。
LR解法
本文用多類的邏輯迴歸演算法,採用隨即梯度下降法優化,得到了76%的預測準確率。程式碼如下。
- 資料獲取在https://www.kaggle.com/c/whats-cooking/data
- 可生成test_prediction.csv的檔案,上傳,就可以得到預測的準確率結果
# -*- coding: utf-8 -*-
'''
Created on 2015-9-25
@author: joeyqzhou
'''
import json
import numpy as np
import csv
t_set = set()
x_set = set() #x : feature
learning_rate = 1
alpha = 0.00001 #正則項懲罰係數
iter_time = 5
if __name__ == "__main__":
with open("train.json") as json_file:
train_data = json.load(json_file) #list
N = len(train_data) #number of training data
for datai in train_data:
t_set.add(datai['cuisine'])
for ingredient in datai['ingredients']:
x_set.add(ingredient)
K = len(t_set)
M = len(x_set)
t_list = list(t_set)
x_list = list(x_set)
T = np.zeros([N,K])
#把分類結果放在一個向量中,K維度
#那所有N個結果放在T中
X = np.zeros([N,M])
for i in range(N):
datai = train_data[i]
cuisine_i = datai['cuisine']
cuisine_i_index = t_list.index(cuisine_i)
T[i,cuisine_i_index] = 1
x_i = datai['ingredients']
for itemj in x_i:
itemj_index = x_list.index(itemj)
X[i,itemj_index] = 1
pass
#optimization
#To get W
W = np.zeros([K,M])
#initialization
for i in range(K):
for j in range(M):
W[i,j] = np.random.rand(1) -0.5
#split data randomly training data
training_label = np.zeros(N)
for i in range(N):
if np.random.rand(1)<1.01: #if split training data to get training:testing data = 9:1, this can be 0.9
training_label[i] = 1
#training
#Update with each piece of data
y = np.zeros(K) #prediction y
for iter in range(iter_time): #迭代3次
learning_rate = learning_rate*0.1
print "training: ",iter," time"
for i in range(N): #each data
if training_label[i] == 1:
summ = 0.0
for j in range(K): #each class
y[j] = np.exp( np.dot(W[j],X[i]) )
summ += y[j]
for j in range(K): #normalization
y[j] = y[j]/summ
for j in range(K):
W[j] = W[j] - learning_rate*(y[j]-T[i,j])*X[i] - alpha * learning_rate * W[j]
print "Finish training"
print "begin testing"
with open("test.json") as json_file:
test_data = json.load(json_file) #list
N_test = len(test_data)
#把分類結果放在一個向量中,K維度
#那所有N個結果放在T中
#open the csv to write
with open('test_prediction.csv', 'wb') as csvfile:
fieldnames = ['id' , 'cuisine']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
X_test = np.zeros([N_test,M])
for i in range(N_test):
datai = test_data[i]
x_i = datai['ingredients']
id = datai['id']
for itemj in x_i:
if itemj in x_list:
itemj_index = x_list.index(itemj)
X_test[i,itemj_index] = 1
#to calculate the prediction y
for j in range(K):
y[j] = np.exp( np.dot(W[j],X_test[i]) )
max_index = y.argmax(axis=0)
cuisine = t_list[max_index]
writer.writerow({'id': id, 'cuisine': cuisine})
###If we do not have testing data, we use the splitted training data as testing data
# right_count = 0.0
# all_count = 0.0
# y = np.zeros(K) #prediction y
# for i in range(N):
# if training_label[i] == 0:
# all_count += 1
# for j in range(K):
# y[j] = np.exp( np.dot(W[j],X[i]) )
# max_index = y.argmax(axis=0)
# if T[i,max_index] == 1:
# right_count +=1
# print "precision: ", right_count/all_count