kaggle比賽練習_1: 做的什麼菜？（What's Cooking?）

阿新 • • 發佈：2018-12-17

Kaggle簡介

Kaggle是一個數據分析的競賽平臺，網址：https://www.kaggle.com/。企業或者研究者可以將資料、問題描述、期望的指標釋出到Kaggle上，以競賽的形式向廣大的資料科學家徵集解決方案，類似於KDD-CUP（國際知識發現和資料探勘競賽）。Kaggle上的參賽者將資料下載下來，分析資料，然後運用機器學習、資料探勘等知識，建立演算法模型，解決問題得出結果，最後將結果提交，排名前面的可能會獲得豐厚的報酬

我們又不是大牛，那麼kaggle對於我們有什麼意義呢？個人覺得是練習，紙上得來終覺淺，把書本,paper中的機器學習演算法，寫成程式碼用於實際問題中，可以提高自己對演算法的理解與寫程式碼的能力。（從最簡單的開始）工作之餘，也要提高自己，所以最近準備在kaggle上做一些題目，也寫出來與大家分享討論。本部落格給出的程式碼都是準確度一般的實驗性程式碼，排名不會靠前，只做練習分享。

菜品是什麼？what’s cooking

訓練資料包含,”ID,菜品，菜品，菜的原料”,菜品是要預測的類，菜的原料可以認為是feature. 訓練資料只包含“ID, 菜的原料”,需要預測菜品是哪一類。由 Yummly公司提供的真實資料。
問題提供的資料為json格式。

LR解法

本文用多類的邏輯迴歸演算法，採用隨即梯度下降法優化，得到了76%的預測準確率。程式碼如下。

資料獲取在https://www.kaggle.com/c/whats-cooking/data
可生成test_prediction.csv的檔案，上傳，就可以得到預測的準確率結果

# -*- coding: utf-8 -*- 

'''
Created on 2015-9-25

@author: joeyqzhou
'''
import json
import numpy as np
import csv


t_set = set()
x_set = set() #x : feature

learning_rate = 1

alpha = 0.00001 #正則項懲罰係數
iter_time = 5

if __name__ == "__main__":
    with open("train.json") as json_file:
        train_data = json.load(json_file) #list 

        N = len(train_data) #number of training data
        for datai in train_data:
            t_set.add(datai['cuisine'])
            for ingredient in datai['ingredients']:
                x_set.add(ingredient)



    K = len(t_set)
    M = len(x_set)
    t_list = list(t_set)
    x_list = list(x_set)

    T = np.zeros([N,K])
    #把分類結果放在一個向量中，K維度
    #那所有N個結果放在T中
    X = np.zeros([N,M])
    for i in range(N):
        datai = train_data[i]
        cuisine_i = datai['cuisine']
        cuisine_i_index = t_list.index(cuisine_i)
        T[i,cuisine_i_index] = 1
        x_i = datai['ingredients']
        for itemj in x_i:
            itemj_index = x_list.index(itemj)
            X[i,itemj_index] = 1



    pass
    #optimization
    #To get W
    W = np.zeros([K,M])
    #initialization
    for i in range(K):
        for j in range(M):
            W[i,j] = np.random.rand(1) -0.5

    #split data randomly  training data
    training_label = np.zeros(N)
    for i in range(N):
        if np.random.rand(1)<1.01: #if split training data to get training:testing data = 9:1, this can be 0.9 
            training_label[i] = 1

    #training
    #Update with each piece of data
    y = np.zeros(K) #prediction y
    for iter in range(iter_time): #迭代3次
        learning_rate = learning_rate*0.1
        print "training: ",iter," time"
        for i in range(N): #each data
            if training_label[i] == 1:
                summ = 0.0
                for j in range(K): #each class
                    y[j] = np.exp( np.dot(W[j],X[i]) )
                    summ += y[j]
                for j in range(K): #normalization
                    y[j] = y[j]/summ

                for j in range(K):
                    W[j] = W[j] - learning_rate*(y[j]-T[i,j])*X[i] - alpha * learning_rate * W[j]


    print "Finish training"



    print "begin testing"
    with open("test.json") as json_file:
        test_data = json.load(json_file) #list
        N_test = len(test_data)


    #把分類結果放在一個向量中，K維度
    #那所有N個結果放在T中

    #open the csv to write
    with open('test_prediction.csv', 'wb') as csvfile:
        fieldnames = ['id' , 'cuisine']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()

        X_test = np.zeros([N_test,M])
        for i in range(N_test):
            datai = test_data[i]
            x_i = datai['ingredients']
            id = datai['id']
            for itemj in x_i:
                if itemj in x_list:
                    itemj_index = x_list.index(itemj)
                    X_test[i,itemj_index] = 1

            #to calculate the prediction y
            for j in range(K):
                y[j] = np.exp( np.dot(W[j],X_test[i]) )
            max_index = y.argmax(axis=0)
            cuisine = t_list[max_index]
            writer.writerow({'id': id, 'cuisine': cuisine})


###If we do not have testing data, we use the splitted training data as testing data          
#    right_count = 0.0
#    all_count = 0.0    
#    y = np.zeros(K) #prediction y    
#    for i in range(N):
#        if training_label[i] == 0:
#            all_count += 1
#            for j in range(K):
#                y[j] = np.exp( np.dot(W[j],X[i]) )
#            max_index = y.argmax(axis=0)
#            if T[i,max_index] == 1:
#                right_count +=1
#    print "precision: ",  right_count/all_count

kaggle比賽練習_1: 做的什麼菜？（What's Cooking?）

Kaggle簡介

菜品是什麼？what’s cooking

LR解法

kaggle比賽練習_1: 做的什麼菜？（What's Cooking?）

np.corrcoef()方法計算數據皮爾遜積矩相關系數（Pearson's r）

LeetCode刷題Easy篇列印楊輝三角（Pascal's Triangle）---動態規劃

Flann特徵點匹配簡述（Lowe's algorithm）

雞尾酒會公式帕金森定律（Parkinson's Law）

what's the python之可叠代對象、叠代器與生成器（附面試題）

貝葉斯（Kaggle比賽之影評與觀影者情感判定）

kaggle比賽相關準備內容（更新中）

What's new on safari 11

算法競賽入門經典-訓練指南（10881-Piotr's Ants）

What's New In DevTools (Chrome 59)來看看最新Chrome 59的開發者工具又有哪些新功能

[遊戲分析]刺客信條：兄弟會（Assassin's Creed: Brotherhood）

what's the python之基本運算符及字符串、列表、元祖、集合、字典的內置方法

asp.net: what's the page life cycle order of a control/page compared to a user contorl inside it?

what's the python之函數及裝飾器

what's the python之內置函數

what's the 爬蟲之基本原理

what's the 數據結構

“百度杯”CTF比賽九月場_Test（海洋cms前臺getshell）

What's New In MySQL 8.0

kaggle比賽練習_1: 做的什麼菜？（What's Cooking?）

Kaggle簡介

菜品是什麼？what’s cooking

LR解法

相關推薦