1. 程式人生 > >《CS224n斯坦福課程》-----第一部分的大作業

《CS224n斯坦福課程》-----第一部分的大作業

         看到簡化版的題目,我覺得我就像一個腦殘,根本看不懂,只有看到原題目,我才知道要做啥。我現在把原題目貼出來,然後一一的解答。

       

題目意思:

(a) 證明softmax函式的一個性質,在輸入中存在偏移,但softmax的值是不隨著偏移而改變。在實踐中,我們認為這個偏移值一般是輸入中的最大值。

(b) 給出輸入矩陣,N行D列,然後計算每行的softmax函式值,最好是採用向量化來實現,以便為後續提供一個好的基礎。一個非向量化實現的方式,不會得到全部的分數。

解答二個部分:

(a)

根據指數函式的性質,可以得到此偏移不變形。

(b)程式設計實現

在這個問題中,我們需要主要的幾點:

1,在處理此類問題的時候,向量化操作真的很重要,所以能向量化就向量化,當然對於初學者來說,這種向量化的思路可能剛開始很難理解,但需要不斷的熟悉這種思想,然後不斷的去應用。在向量化中,必不可少的一個庫就是numpy。

numpy的學習網站:https://www.jianshu.com/p/358948fbbc6e

2,在這個問題中,一個技巧,就是利用到了上一步的偏移以及trick,就是這個偏移值是最大的值,否則在官網給出的測試用例就會有溢位的問題。

import numpy as np

def softmax(x):
    """Compute the softmax function for each row of the input x.

    It is crucial that this function is optimized for speed because
    it will be used frequently in later code. You might find numpy
    functions np.exp, np.sum, np.reshape, np.max, and numpy
    broadcasting useful for this task.

    Numpy broadcasting documentation:
    http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html

    You should also make sure that your code works for a single
    D-dimensional vector (treat the vector as a single row) and
    for N x D matrices. This may be useful for testing later. Also,
    make sure that the dimensions of the output match the input.

    You must implement the optimization in problem 1(a) of the
    written assignment!

    Arguments:
    x -- A D dimensional vector or N x D dimensional numpy matrix.

    Return:
    x -- You are allowed to modify x in-place
    """
    orig_shape = x.shape

    if len(x.shape) > 1:
        # Matrix
        x = x - np.max(x, axis=1, keepdims=True)
        x = np.exp(x)/np.sum(np.exp(x), axis=1, keepdims=True)
    else:
        # Vector
        x = x - np.max(x)
        x = np.exp(x)/np.sum(np.exp(x))

    assert x.shape == orig_shape
    return x

def test_softmax_basic():
    """
    Some simple tests to get you started.
    Warning: these are not exhaustive.
    """
    print "Running basic tests..."
    test1 = softmax(np.array([1,2]))
    print test1
    ans1 = np.array([0.26894142,  0.73105858])
    assert np.allclose(test1, ans1, rtol=1e-05, atol=1e-06)

    test2 = softmax(np.array([[1001,1002],[3,4]]))
    print test2
    ans2 = np.array([
        [0.26894142, 0.73105858],
        [0.26894142, 0.73105858]])
    assert np.allclose(test2, ans2, rtol=1e-05, atol=1e-06)

    test3 = softmax(np.array([[-1001,-1002]]))
    print test3
    ans3 = np.array([0.73105858, 0.26894142])
    assert np.allclose(test3, ans3, rtol=1e-05, atol=1e-06)

    print "You should be able to verify these results by hand!\n"


if __name__ == "__main__":
    test_softmax_basic()

測試結果:

我把最後一個print註釋掉了,截圖如下:

手工去測試一下程式:對於指數函式而言,影象如下:

所以對於數值很大的情況,數值可以說很爆炸了,對於數值很小的情況,又太小了,所以採用偏移不變性是一個很好的解決措施。

對於程式裡,老師給出的框架中,assert是做斷言,用於捕捉錯誤資訊,看是否計算出來的數值與真實的數值相差的情況。

 題目解釋:

(a)推導sigmod函式的梯度,並且將其寫成複合函式的形式。假定輸入的x是標量。

(b)推導梯度下降(採用交叉熵的softmax函式), 此時class label可以視為0-1的one-hot編碼形式,也就是隻有一個1,其餘均為0。

(c)推導梯度下降,輸入x,只有一層隱藏層的神經網路,損失函式利用交叉熵來度量,神經網路中啟用函式利用sigmod函式來作為啟用函式,利用softmax函式來作用於輸出層, 標籤採用one-hot的形式。(其實就是神經網路的常規推導)

(d)在上圖的神經網路中有多少個引數,假定輸入有Dx維,輸出有Dy維,有H個神經元。

(e)編寫sigmod啟用函式和其梯度的程式

(f)編寫梯度檢測的程式

(g)編寫只有一個sigmod隱藏層的神經網路的前向和後向推導。

解答這幾個問題:

(a)sigmod函式求導

 

(b) 輸出層的求導情況

(c)主要考的就是鏈式法則的應用,在這個問題中,主要注意的是矩陣的維度的問題,就是到底用不用轉置,需要根據具體的維度變化來決定。

 

(d)考慮引數的個數,第一層(隱藏層)+第二層(輸出層)

(e) 編寫sigmod函式及其求導函式

#!/usr/bin/env python

import numpy as np


def sigmoid(x):
    """
    Compute the sigmoid function for the input here.

    Arguments:
    x -- A scalar or numpy array.

    Return:
    s -- sigmoid(x)
    """
    s = 1 / (1 + np.exp(-x))
    return s


def sigmoid_grad(s):
    """
    Compute the gradient for the sigmoid function here. Note that
    for this implementation, the input s should be the sigmoid
    function value of your original input x.

    Arguments:
    s -- A scalar or numpy array.

    Return:
    ds -- Your computed gradient.
    """
    ds = s * (1 - s)

    return ds


def test_sigmoid_basic():
    """
    Some simple tests to get you started.
    Warning: these are not exhaustive.
    """
    print "Running basic tests..."
    x = np.array([[1, 2], [-1, -2]])
    f = sigmoid(x)
    g = sigmoid_grad(f)
    print f
    f_ans = np.array([
        [0.73105858, 0.88079708],
        [0.26894142, 0.11920292]])
    assert np.allclose(f, f_ans, rtol=1e-05, atol=1e-06)
    print g
    g_ans = np.array([
        [0.19661193, 0.10499359],
        [0.19661193, 0.10499359]])
    assert np.allclose(g, g_ans, rtol=1e-05, atol=1e-06)
    print "You should verify these results by hand!\n"


if __name__ == "__main__":
    test_sigmoid_basic()

測試結果:

(f) 梯度檢查,利用雙邊檢查,得到的精確度更高。

#!/usr/bin/env python

import numpy as np
import random

# First implement a gradient checker by filling in the following functions
def gradcheck_naive(f, x):
    """ Gradient check for a function f.

    Arguments:
    f -- a function that takes a single argument and outputs the
         cost and its gradients
    x -- the point (numpy array) to check the gradient at
    """

    rndstate = random.getstate()
    random.setstate(rndstate)
    fx, grad = f(x)  # Evaluate function value at original point
    h = 1e-4         # Do not change this!

    # Iterate over all indexes ix in x to check the gradient.
    it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
    while not it.finished:
        ix = it.multi_index

        # Try modifying x[ix] with h defined above to compute numerical
        # gradients (numgrad).

        # Use the centered difference of the gradient.
        # It has smaller asymptotic error than forward / backward difference
        # methods. If you are curious, check out here:
        # https://math.stackexchange.com/questions/2326181/when-to-use-forward-or-central-difference-approximations

        # Make sure you call random.setstate(rndstate)
        # before calling f(x) each time. This will make it possible
        # to test cost functions with built in randomness later.
        x[ix] += h
        f1 = f(x)[0]
        x[ix] -= 2 * h
        f2 = f(x)[0]
        x[ix] += h
        numgrad = (f1-f2)/(2*h)

        # Compare gradients
        reldiff = abs(numgrad - grad[ix]) / max(1, abs(numgrad), abs(grad[ix]))
        if reldiff > 1e-5:
            print "Gradient check failed."
            print "First gradient error found at index %s" % str(ix)
            print "Your gradient: %f \t Numerical gradient: %f" % (
                grad[ix], numgrad)
            return

        it.iternext() # Step to next dimension

    print "Gradient check passed!"

def sanity_check():
    """
    Some basic sanity checks.
    """
    quad = lambda x: (np.sum(x ** 2), x * 2)

    print "Running sanity checks..."
    gradcheck_naive(quad, np.array(123.456))      # scalar test
    gradcheck_naive(quad, np.random.randn(3,))    # 1-D test
    gradcheck_naive(quad, np.random.randn(4,5))   # 2-D test
    print ""

if __name__ == "__main__":
    sanity_check()

程式解釋:在函式gradcheck_naive(f, x) , 其中f是一個函式,接受一個引數的函式,返回的是一個元祖,包含二項,第一項為損失函式cost的數值,第二項為梯度數值;x為進行檢測的輸入的數值,可以是標量,也可以是矩陣(向量)。設定了一個隨機種子,以便你的測試是同一的隨機種子產生,產生正確的結果。然後np.nditer就是一個迭代器,多重索引的迭代器,然後基於索引的基礎上,然後進行雙邊的梯度檢查,然後換下一個資料進行迭代。

【只有實踐,才能發現原來有含糊的地方,是一定會出錯的,不過早出錯比較好。】

在這個地方:我有一個小bug的出現,在這個單個測試中,輸入只是一個數值的情況下,這種方法是適用的,但如果是多個的情況下,f(x[ix]+h)是會出錯的,因為這個時候輸入的是一個數值,而不是整個資料。

(g) 最後一個是實現二層的神經網路(其中一層為隱藏層,一層為輸出層)

#!/usr/bin/env python

import numpy as np
import random

from q1_softmax import softmax
from q2_sigmoid import sigmoid, sigmoid_grad
from q2_gradcheck import gradcheck_naive

def forward_backward_prop(X, labels, params, dimensions):
    """
    Forward and backward propagation for a two-layer sigmoidal network

    Compute the forward propagation and for the cross entropy cost,
    the backward propagation for the gradients for all parameters.

    Notice the gradients computed here are different from the gradients in
    the assignment sheet: they are w.r.t. weights, not inputs.

    Arguments:
    X -- M x Dx matrix, where each row is a training example x.
    labels -- M x Dy matrix, where each row is a one-hot vector.
    params -- Model parameters, these are unpacked for you.
    dimensions -- A tuple of input dimension, number of hidden units
                  and output dimension
    """

    ### Unpack network parameters (do not modify)
    ofs = 0
    Dx, H, Dy = (dimensions[0], dimensions[1], dimensions[2])

    W1 = np.reshape(params[ofs:ofs + Dx * H], (Dx, H))
    ofs += Dx * H
    b1 = np.reshape(params[ofs:ofs + H], (1, H))
    ofs += H
    W2 = np.reshape(params[ofs:ofs + H * Dy], (H, Dy))
    ofs += H * Dy
    b2 = np.reshape(params[ofs:ofs + Dy], (1, Dy))

    # Note: compute cost based on `sum` not `mean`.
    z1 = X.dot(W1) + b1
    a1 = sigmoid(z1)
    z2 = a1.dot(W2) + b2
    a2 = softmax(z2)
    cost = -np.sum(labels * np.log(a2))

    gradz2 = (a2 - labels)
    gradW2 = a1.T.dot(gradz2)
    gradb2 = np.sum(gradz2, axis=0, keepdims=True)
    grada1 = gradz2.dot(W2.T)
    gradz1 = grada1*sigmoid_grad(a1)
    gradW1 = X.T.dot(gradz1)
    gradb1 = np.sum(gradz1, axis=0, keepdims=True)


    ### Stack gradients (do not modify)
    grad = np.concatenate((gradW1.flatten(), gradb1.flatten(), gradW2.flatten(), gradb2.flatten()))
    grad.resize((len(grad), 1))

    return cost, grad

def sanity_check():
    """
    Set up fake data and parameters for the neural network, and test using
    gradcheck.
    """
    print "Running sanity check..."

    N = 20
    dimensions = [10, 5, 10]
    data = np.random.randn(N, dimensions[0])   # each row will be a datum
    labels = np.zeros((N, dimensions[2]))
    for i in xrange(N):
        labels[i, random.randint(0, dimensions[2]-1)] = 1

    params = np.random.randn((dimensions[0] + 1) * dimensions[1] + (
        dimensions[1] + 1) * dimensions[2], 1)

    gradcheck_naive(lambda params: forward_backward_prop(data, labels, params, dimensions), params)



if __name__ == "__main__":
    sanity_check()

在這個問題的實現中,我也犯了一個蠢,就是對於sigmod的求導的地方有含糊,就是不知道到底是那個進行變化,我現在對其重新確定一下。

 然後在本程式中,呼叫sigmod_grad(a1)就是求上面的這個倒數值。

測試結果:

 

 

題目解釋:

(a) 中心詞的索引為c,預測索引為o的詞是否為中心詞的視窗範圍的詞,其中u(w)為字典中的所有的詞的詞向量,其實就是用二套詞向量來進行表示,方便解耦合,簡化學習過程。說了這麼多,這個題目就是求一個梯度。

 (b)仍然求一個梯度。

(c)在(a)與(b)中,採用的傳統的,也就是初步的word2vec來實現的,但我們知道採用負取樣的方法,實現效率更高。所以,這個題目就是用來驗證這個結論。用CE loss的執行時間除以negative sampling loss的執行時間來作為speed-up ratio。

(d)word2vec中有二種類別,一種為CBOW, 另一種為skip-gram。視窗大小為m, 然後二種方式的梯度的推導。這是一個不斷擴充套件的問題,一步步的從抽象的情況,擴充套件到具體的情況。

(e)補充word2vec模型,然後利用隨機梯度下降來訓練你自己的詞向量。在檔案中,需要書寫的有:一個歸一化矩陣的行的函式,填充softmax函式和梯度下降函式,補充skip-gram的損失函式和梯度下降函式。(其實就是實現skip-gram)

(f)實現隨機梯度下降演算法

(g)下載實際生活的資料,然後利用補充的程式來訓練詞向量。使用的是斯坦福的語義分析樹語料庫來訓練,這個訓練好的部分會用於下一個部分的語義分析任務。

(h)擴充套件補充CBOW演算法。

題目解答:

(a)(b)(這個地方涉及到了矩陣微分的知識,這個地方的求導有些許的問題,待我看完矩陣微分方法之後來解決這個BUG)

(c)word2vec的負取樣實現中,一次迭代中只需要計算的是K+1個數據, 而對於傳統的softmax方式中,則需要計算的是W+1個數據,所以,時間花費大約為(W+1)/(K+1)

(d)對於skip-gram而言, 推導如下:

現在我們考慮的是一個具體的情況,然後更新的就是vc,對於vj而言,是不做處理的,所以就是視窗範圍內的更新引數。

對於CBOW而言,推導如下:

對於skip-gram而言,簡單說:就是已知中心詞,然後求視窗上下文的情況;對於CBOW而言,簡單說:就是知道視窗上下文的情況,然後求中心詞的情況,中心詞采用上下文次的平均來計算。這樣就可以知道,損失函式到底與什麼有關,與什麼無關了。

(e)這一部分卡了好久,尤其在負取樣那個函式裡面,卡了真的好久好久。雖然難產,但還是產出來了。

在這個程式中,第一個函式NormalizeRows(x)是對輸入資料進行歸一化操作,歸一化的方法是通過按行的模長的歸一化方法。也就是每一行的每一個數值除以此行的向量的模長來實現。(根據給出的例子,也就是test_normalize_rows我們可以推斷出來。)

對於softmaxCostAndGradient函式,就是通過(a)(b)的公式來實現。但一定注意的是矩陣的維度,尤其是對於(3L, )這個問題,很容易出現各種莫名其妙的問題,所以最好就把矩陣的維度統一,這樣比較容易求。

對於getNegativeSamples函式,就是通過隨機取樣函式,來得到K個負取樣的單詞的索引。

對於negSamplingCostAndGradient函式,就是通過(c)的公式來是實現,也是要注意矩陣維度的變化。

對於skipgram(cbow)函式,就是通過(d)的公式,來把每一箇中心詞,上下文詞串起來,最後得到結果。

對於word2vec_sgd_wrapper函式,類似與一個框架,將Word2vecmodel全部框起來,也就是可以採用skipgram實現,也可以採用cbow來實現。

對於test_word2vec函式,就是測試的介面。

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import numpy as np
import random

from q1_softmax import softmax
from q2_gradcheck import gradcheck_naive
from q2_sigmoid import sigmoid, sigmoid_grad

def normalizeRows(x):
    """ Row normalization function
    # 除以模長的歸一化方法
    Implement a function that normalizes each row of a matrix to have
    unit length.
    """

    x = x / (np.sqrt(np.sum(x*x, axis=1, keepdims=True)))
    return x


def test_normalize_rows():
    print "Testing normalizeRows..."
    x = normalizeRows(np.array([[3.0,4.0],[1, 2]]))
    print x
    ans = np.array([[0.6,0.8],[0.4472136,0.89442719]])
    assert np.allclose(x, ans, rtol=1e-05, atol=1e-06)
    print ""


def softmaxCostAndGradient(predicted, target, outputVectors, dataset):
    """ Softmax cost function for word2vec models

    Implement the cost and gradients for one predicted word vector
    and one target word vector as a building block for word2vec
    models, assuming the softmax prediction function and cross
    entropy loss.

    Arguments:
    predicted -- numpy ndarray, predicted word vector (\hat{v} in
                 the written component)
    target -- integer, the index of the target word
    outputVectors -- "output" vectors (as rows) for all tokens
    dataset -- needed for negative sampling, unused here.

    Return:
    cost -- cross entropy cost for the softmax word prediction
    gradPred -- the gradient with respect to the predicted word
           vector
    grad -- the gradient with respect to all the other word
           vectors

    We will not provide starter code for this function, but feel
    free to reference the code you previously wrote for this
    assignment!
    """
    # 為了避免出錯,最好利用reshape來將矩陣來轉變為自己需要的那一種型別, 因為softmax是對行來進行
    predicted = predicted.reshape([1, predicted.shape[0]])
    y_hot = softmax(predicted.dot(outputVectors.T)).reshape([outputVectors.shape[0], 1])
    y_real = np.zeros_like(y_hot)
    y_real[target] = 1
    cost = -np.log(y_hot[target])
    gradPred = (y_hot-y_real).T.dot(outputVectors)
    grad = (y_hot-y_real).dot(predicted)
    return cost, gradPred, grad


def getNegativeSamples(target, dataset, K):

    """ Samples K indexes which are not the target
    隨機負取樣K個數值
    """

    indices = [None] * K
    for k in xrange(K):
        newidx = dataset.sampleTokenIdx()
        while newidx == target:
            newidx = dataset.sampleTokenIdx()
        indices[k] = newidx
    return indices


def negSamplingCostAndGradient(predicted, target, outputVectors, dataset,
                               K=10):
    """ Negative sampling cost function for word2vec models

    Implement the cost and gradients for one predicted word vector
    and one target word vector as a building block for word2vec
    models, using the negative sampling technique. K is the sample
    size.

    Note: See test_word2vec below for dataset's initialization.

    Arguments/Return Specifications: same as softmaxCostAndGradient
    """

    # Sampling of indices is done for you. Do not modify this if you
    # wish to match the autograder and receive points!



    indices = [target]
    indices.extend(getNegativeSamples(target, dataset, K))

    predicted = predicted.reshape([predicted.shape[0], 1])
    gradPred = np.zeros(predicted.shape)
    cost = 0

    soft_vc = sigmoid(outputVectors[target, :].dot(predicted))  # [1, D]*[D, 1]=[1, 1]
    cost -= np.log(soft_vc)
    gradPred += (soft_vc-1.0) * outputVectors[target, :].reshape(predicted.shape)  # [D,1]
    grad_temp = np.zeros([outputVectors.shape[0], 1])    # [M, 1]
    grad_temp[target] = soft_vc-1.0
    for i in range(1, len(indices)):
        soft_vk = sigmoid(-outputVectors[indices[i], :].dot(predicted))
        cost -= np.log(soft_vk)
        gradPred -= (soft_vk-1.0) * outputVectors[indices[i], :].reshape(predicted.shape)
        grad_temp[indices[i]] -= (soft_vk-1.0)
    grad = grad_temp.dot(predicted.T)   # [M, 1]*[1, D]=[M, D]

    return cost, gradPred, grad

def skipgram(currentWord, C, contextWords, tokens, inputVectors, outputVectors,
             dataset, word2vecCostAndGradient=softmaxCostAndGradient):
    """ Skip-gram model in word2vec

    Implement the skip-gram model in this function.

    Arguments:
    currentWord -- a string of the current center word
    C -- integer, context size
    contextWords -- list of no more than 2*C strings, the context words
    tokens -- a dictionary that maps words to their indices in
              the word vector list
    inputVectors -- "input" word vectors (as rows) for all tokens
    outputVectors -- "output" word vectors (as rows) for all tokens
    word2vecCostAndGradient -- the cost and gradient function for
                               a prediction vector given the target
                               word vectors, could be one of the two
                               cost functions you implemented above.

    Return:
    cost -- the cost function value for the skip-gram model
    grad -- the gradient with respect to the word vectors
    """

    cost = 0.0
    gradIn = np.zeros(inputVectors.shape)
    gradOut = np.zeros(outputVectors.shape)

    for word in contextWords:
        cost_1, gradPred1, grad1 = word2vecCostAndGradient(inputVectors[tokens[currentWord]], tokens[word],
                                                              outputVectors, dataset)
        cost += cost_1
        gradIn[tokens[currentWord], :] += np.squeeze([gradPred1])
        gradOut += grad1

    return cost, gradIn, gradOut


def cbow(currentWord, C, contextWords, tokens, inputVectors, outputVectors,
         dataset, word2vecCostAndGradient=softmaxCostAndGradient):
    """CBOW model in word2vec

    Implement the continuous bag-of-words model in this function.

    Arguments/Return specifications: same as the skip-gram model

    Extra credit: Implementing CBOW is optional, but the gradient
    derivations are not. If you decide not to implement CBOW, remove
    the NotImplementedError.
    """

    cost = 0.0
    gradIn = np.zeros(inputVectors.shape)
    gradOut = np.zeros(outputVectors.shape)

    ### YOUR CODE HERE
    raise NotImplementedError
    ### END YOUR CODE

    return cost, gradIn, gradOut


#############################################
# Testing functions below. DO NOT MODIFY!   #
#############################################

def word2vec_sgd_wrapper(word2vecModel, tokens, wordVectors, dataset, C,
                         word2vecCostAndGradient=softmaxCostAndGradient):
    batchsize = 50
    cost = 0.0
    grad = np.zeros(wordVectors.shape)
    N = wordVectors.shape[0]
    inputVectors = wordVectors[:N/2,:]
    outputVectors = wordVectors[N/2:,:]
    for i in xrange(batchsize):
        C1 = random.randint(1,C)
        centerword, context = dataset.getRandomContext(C1)
        if word2vecModel == skipgram:
            denom = 1
        else:
            denom = 1
        c, gin, gout = word2vecModel(
            centerword, C1, context, tokens, inputVectors, outputVectors,
            dataset, word2vecCostAndGradient)
        cost += c / batchsize / denom
        grad[:N/2, :] += gin / batchsize / denom
        grad[N/2:, :] += gout / batchsize / denom

    return cost, grad


def test_word2vec():
    """ Interface to the dataset for negative sampling """
    dataset = type('dummy', (), {})()
    def dummySampleTokenIdx():
        return random.randint(0, 4)

    def getRandomContext(C):
        tokens = ["a", "b", "c", "d", "e"]
        return tokens[random.randint(0,4)], \
            [tokens[random.randint(0,4)] for i in xrange(2*C)]
    dataset.sampleTokenIdx = dummySampleTokenIdx
    dataset.getRandomContext = getRandomContext

    random.seed(31415)
    np.random.seed(9265)
    dummy_vectors = normalizeRows(np.random.randn(10,3))
    dummy_tokens = dict([("a",0), ("b",1), ("c",2),("d",3),("e",4)])
    print "==== Gradient check for skip-gram ===="
    gradcheck_naive(lambda vec: word2vec_sgd_wrapper(
        skipgram, dummy_tokens, vec, dataset, 5, softmaxCostAndGradient),
        dummy_vectors)
    gradcheck_naive(lambda vec: word2vec_sgd_wrapper(
        skipgram, dummy_tokens, vec, dataset, 5, negSamplingCostAndGradient),
        dummy_vectors)
    # print "\n==== Gradient check for CBOW      ===="
    # gradcheck_naive(lambda vec: word2vec_sgd_wrapper(
    #     cbow, dummy_tokens, vec, dataset, 5, softmaxCostAndGradient),
    #     dummy_vectors)
    # gradcheck_naive(lambda vec: word2vec_sgd_wrapper(
    #     cbow, dummy_tokens, vec, dataset, 5, negSamplingCostAndGradient),
    #     dummy_vectors)

    print "\n=== Results ==="
    print skipgram("c", 3, ["a", "b", "e", "d", "b", "c"],
        dummy_tokens, dummy_vectors[:5,:], dummy_vectors[5:,:], dataset)
    print skipgram("c", 1, ["a", "b"],
        dummy_tokens, dummy_vectors[:5,:], dummy_vectors[5:,:], dataset,
        negSamplingCostAndGradient)
    # print cbow("a", 2, ["a", "b", "c", "a"],
    #     dummy_tokens, dummy_vectors[:5,:], dummy_vectors[5:,:], dataset)
    # print cbow("a", 2, ["a", "b", "a", "c"],
    #     dummy_tokens, dummy_vectors[:5,:], dummy_vectors[5:,:], dataset,
    #     negSamplingCostAndGradient)


if __name__ == "__main__":
    test_normalize_rows()
    test_word2vec()

(f)這個是實現SGD函式,這一個填寫的部分比較簡單。

拿到一個這樣需要補全的程式,第一步首先需要明確自己需要補全的那部分程式是那部分了,第二步,if __name__=='__main__'看起,因為這個是程式的入口,然後根據程式的流程來理解。

對於sanity_check函式,就是呼叫sgd的一個介面。

對於sgd函式,引數的每個的意思都在下面。然後在迭代過程中,我們需要通過梯度來更新變數,尤其注意,不要忘記postprocessing函式,因為是一個迭代的過程,不能只在初始的時候對變數進行預處理(歸一化),在迭代過程,也不能忘記呀。

對於save_params函式,就是每隔多少次的迭代,就儲存一下引數的作用。

對於load_saved_params函式,就是匯入之前儲存的引數。

#!/usr/bin/env python

# Save parameters every a few SGD iterations as fail-safe
SAVE_PARAMS_EVERY = 5000

import glob
import random
import numpy as np
import os.path as op
import cPickle as pickle


def load_saved_params():
    """
    A helper function that loads previously saved parameters and resets
    iteration start.
    """
    st = 0
    for f in glob.glob("saved_params_*.npy"):
        iter = int(op.splitext(op.basename(f))[0].split("_")[2])
        if (iter > st):
            st = iter

    if st > 0:
        with open("saved_params_%d.npy" % st, "r") as f:
            params = pickle.load(f)
            state = pickle.load(f)
        return st, params, state
    else:
        return st, None, None


def save_params(iter, params):
    with open("saved_params_%d.npy" % iter, "w") as f:
        pickle.dump(params, f)
        pickle.dump(random.getstate(), f)


def sgd(f, x0, step, iterations, postprocessing=None, useSaved=False,
        PRINT_EVERY=10):
    """ Stochastic Gradient Descent

    Implement the stochastic gradient descent method in this function.

    Arguments:
    f -- the function to optimize, it should take a single
         argument and yield two outputs, a cost and the gradient
         with respect to the arguments
    x0 -- the initial point to start SGD from
    step -- the step size for SGD
    iterations -- total iterations to run SGD for
    postprocessing -- postprocessing function for the parameters
                      if necessary. In the case of word2vec we will need to
                      normalize the word vectors to have unit length.
    PRINT_EVERY -- specifies how many iterations to output loss

    Return:
    x -- the parameter value after SGD finishes
    """

    # Anneal learning rate every several iterations
    ANNEAL_EVERY = 20000

    if useSaved:
        start_iter, oldx, state = load_saved_params()
        if start_iter > 0:
            x0 = oldx
            step *= 0.5 ** (start_iter / ANNEAL_EVERY)

        if state:
            random.setstate(state)
    else:
        start_iter = 0

    x = x0

    if not postprocessing:
        postprocessing = lambda x: x

    expcost = None

    for iter in xrange(start_iter + 1, iterations + 1):
        # Don't forget to apply the postprocessing after every iteration!
        # You might want to print the progress every few iterations.

        cost = None
        ### YOUR CODE HERE
        cost, grad = f(x)
        x -= step * grad
        postprocessing(x)
        ### END YOUR CODE

        if iter % PRINT_EVERY == 0:
            if not expcost:
                expcost = cost
            else:
                expcost = .95 * expcost + .05 * cost
            print "iter %d: %f" % (iter, expcost)

        if iter % SAVE_PARAMS_EVERY == 0 and useSaved:
            save_params(iter, x)

        if iter % ANNEAL_EVERY == 0:
            step *= 0.5

    return x


def sanity_check():
    quad = lambda x: (np.sum(x ** 2), x * 2)

    print "Running sanity checks..."
    t1 = sgd(quad, 0.5, 0.01, 1000, PRINT_EVERY=100)
    print "test 1 result:", t1
    assert abs(t1) <= 1e-6

    t2 = sgd(quad, 0.0, 0.01, 1000, PRINT_EVERY=100)
    print "test 2 result:", t2
    assert abs(t2) <= 1e-6

    t3 = sgd(quad, -1.5, 0.01, 1000, PRINT_EVERY=100)
    print "test 3 result:", t3
    assert abs(t3) <= 1e-6

    print ""




if __name__ == "__main__":
    sanity_check()

(g) 訓練一個語料庫,程式碼如下。

資料集採用的是斯坦福語義分析的資料集。詞向量的維度為10,單詞上下文窗戶大小為5,WordVectors包含二個部分的向量,也就是我們常說的u,v的情況。外層是sgd函式,然後利用sgd的函式應用到word2vec_sgd_warpper上。

迭代40000次,,真的好花費時間呀。

然後對得到的詞向量的情況,對其中的某些單詞進行降維,來看最後的情況單詞的情況。

#!/usr/bin/env python

import random
import numpy as np
from utils.treebank import StanfordSentiment
import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt
import time

from q3_word2vec import *
from q3_sgd import *

# Reset the random seed to make sure that everyone gets the same results
random.seed(314)
dataset = StanfordSentiment()
tokens = dataset.tokens()
nWords = len(tokens)

# We are going to train 10-dimensional vectors for this assignment
dimVectors = 10

# Context size
C = 5

# Reset the random seed to make sure that everyone gets the same results
random.seed(31415)
np.random.seed(9265)

startTime=time.time()
wordVectors = np.concatenate(
    ((np.random.rand(nWords, dimVectors) - 0.5) /
       dimVectors, np.zeros((nWords, dimVectors))),
    axis=0)
wordVectors = sgd(
    lambda vec: word2vec_sgd_wrapper(skipgram, tokens, vec, dataset, C,
        negSamplingCostAndGradient),
    wordVectors, 0.3, 40000, None, True, PRINT_EVERY=10)
# Note that normalization is not called here. This is not a bug,
# normalizing during training loses the notion of length.

print "sanity check: cost at convergence should be around or below 10"
print "training took %d seconds" % (time.time() - startTime)

# concatenate the input and output word vectors
wordVectors = np.concatenate(
    (wordVectors[:nWords,:], wordVectors[nWords:,:]),
    axis=0)
# wordVectors = wordVectors[:nWords,:] + wordVectors[nWords:,:]

visualizeWords = [
    "the", "a", "an", ",", ".", "?", "!", "``", "''", "--",
    "good", "great", "cool", "brilliant", "wonderful", "well", "amazing",
    "worth", "sweet", "enjoyable", "boring", "bad", "waste", "dumb",
    "annoying"]

visualizeIdx = [tokens[word] for word in visualizeWords]
visualizeVecs = wordVectors[visualizeIdx, :]
temp = (visualizeVecs - np.mean(visualizeVecs, axis=0))
covariance = 1.0 / len(visualizeIdx) * temp.T.dot(temp)
U,S,V = np.linalg.svd(covariance)
coord = temp.dot(U[:,0:2])

for i in xrange(len(visualizeWords)):
    plt.text(coord[i,0], coord[i,1], visualizeWords[i],
        bbox=dict(facecolor='green', alpha=0.1))

plt.xlim((np.min(coord[:,0]), np.max(coord[:,0])))
plt.ylim((np.min(coord[:,1]), np.max(coord[:,1])))

plt.savefig('q3_word_vectors.png')

畫出的單詞的影象如下:

 

 題目解釋:使用已經訓練好的詞向量,然後進行一個語義情感分析的步驟,對於語料庫中的每個句子,我們採用單詞的平均詞向量來作為句子的特徵,然後預測句子的情感分析。分為了5個等級,訓練一個softmax分類器來實現目的。

很負面(0),負面(1),中立(2), 積極(3),很積極(4)

(a)完成句子向量特徵的計算。利用句子中的詞向量的平均來表示。

(b)解釋為什麼需要在分類時進行正則化。(normalization, regulariza就tion)

(c)填寫超引數選擇的程式碼來尋找最好的超引數。至少要在驗證集和測試集上達到36.5%的準確率。

(d)使用自己訓練的詞向量來跑情感分類的程式,然後使用已經訓練好的GLOVE模型來跑情感分類的程式,比較在訓練集,驗證集和測試集上的準確率。為什麼預訓練的GLOVE模型效果更好,明確並提出至少3個不同的原因。

(e)畫出使用GLOVE詞向量的訓練集合驗證集上的準確率曲線,採用log函式來做處理。剪短的解釋從曲線中得到什麼。

(f)分析模型產生錯誤的原因,簡短的解釋混淆矩陣。

(g)分析3個分類錯誤的例子來進行解釋,並且簡短的說明什麼樣的特徵總能夠將會被分類錯誤。盡力的找出錯誤的原因。

題目解答:

(a)就是利用句子的每個單詞的詞向量的平均來作為句子的特徵,來進行處理。具體程式在後面。

但自己寫的程式效率不是特別高,因為沒有很好的利用到向量化的手段,需要不斷的進行學習。

(b)正則化的好壞就在於防止過擬合。正則化的常用的方法為L1正則化,L2正則化。對於L2正則化而言,減少引數的數值,以便在資料發生偏移的時候,結果影響不大,也就是提高模型對於未知例項的泛化能力。

(c)在getRegularizarionValues函式中,通過函式來給values賦一系列的浮點數的值。

        在chooseBestModel函式中,通過字典中的關鍵字dev來進行排序。

(d)結果顯示:

對於youevectors而言:

=== Recap ===
Reg             Train   Dev     Test
1.00E-02        30.946  32.334  29.910
1.10E-02        30.922  32.334  29.955
1.20E-02        30.946  32.334  29.910
1.32E-02        30.922  32.243  29.955
1.45E-02        30.840  32.153  30.000
1.59E-02        30.770  32.153  30.000
1.75E-02        30.735  32.243  29.910
1.92E-02        30.817  31.789  29.955
2.10E-02        30.735  31.698  29.955
2.31E-02        30.770  31.698  29.955
2.54E-02        30.618  31.608  30.000
2.78E-02        30.501  31.608  30.090
3.05E-02        30.524  31.698  29.910
3.35E-02        30.431  31.608  29.955
3.68E-02        30.360  31.698  29.819
4.04E-02        30.325  31.608  29.864
4.43E-02        30.302  31.880  30.045
4.86E-02        30.349  31.880  30.136
5.34E-02        30.384  31.971  29.955
5.86E-02        30.396  32.062  29.955
6.43E-02        30.349  32.153  30.000
7.05E-02        30.372  32.243  30.045
7.74E-02        30.325  32.425  30.045
8.50E-02        30.290  32.062  30.136
9.33E-02        30.302  31.880  29.955
1.02E-01        30.302  31.971  29.910
1.12E-01        30.314  31.789  29.729
1.23E-01        30.279  31.971  29.638
1.35E-01        30.185  31.789  29.774
1.48E-01        30.162  31.880  29.638
1.63E-01        30.044  31.789  29.502
1.79E-01        29.998  32.062  29.367
1.96E-01        29.963  31.971  29.412
2.15E-01        29.740  32.062  29.502
2.36E-01        29.635  31.698  29.321
2.60E-01        29.717  31.971  29.095
2.85E-01        29.494  32.062  29.005
3.13E-01        29.459  31.880  28.824
3.43E-01        29.506  31.789  28.778
3.76E-01        29.459  31.517  28.371
4.13E-01        29.424  31.335  28.326
4.53E-01        29.295  31.335  28.281
4.98E-01        29.260  31.244  28.326
5.46E-01        29.377  30.790  28.100
5.99E-01        29.412  31.244  28.054
6.58E-01        29.436  31.153  28.145
7.22E-01        29.389  30.881  28.145
7.92E-01        29.377  30.336  27.873
8.70E-01        29.190  29.973  27.783
9.55E-01        28.968  29.882  27.240
1.05E+00        28.816  29.609  27.059
1.15E+00        28.862  29.064  26.561
1.26E+00        28.663  28.520  26.335
1.38E+00        28.640  28.065  25.928
1.52E+00        28.546  27.520  25.928
1.67E+00        28.500  27.430  25.701
1.83E+00        28.265  27.339  25.339
2.01E+00        27.926  26.794  25.204
2.21E+00        27.938  26.431  24.887
2.42E+00        27.961  26.703  24.615
2.66E+00        27.891  26.703  24.525
2.92E+00        27.680  26.431  24.208
3.20E+00        27.680  26.158  23.846
3.51E+00        27.575  25.704  23.846
3.85E+00        27.551  25.704  23.665
4.23E+00        27.446  25.522  23.620
4.64E+00        27.353  25.522  23.394
5.09E+00        27.353  25.522  23.213
5.59E+00        27.317  25.704  23.122
6.14E+00        27.306  25.704  23.122
6.73E+00        27.271  25.704  23.122
7.39E+00        27.247  25.522  23.122
8.11E+00        27.247  25.522  23.122
8.90E+00        27.235  25.522  23.122
9.77E+00        27.247  25.522  23.077
1.07E+01        27.235  25.522  23.077
1.18E+01        27.235  25.522  23.077
1.29E+01        27.235  25.522  23.077
1.42E+01        27.235  25.522  23.077
1.56E+01        27.235  25.522  23.032
1.71E+01        27.235  25.522  23.032
1.87E+01        27.235  25.522  23.032
2.06E+01        27.235  25.522  23.032
2.26E+01        27.235  25.522  23.032
2.48E+01        27.235  25.522  23.032
2.72E+01        27.235  25.522  23.032
2.98E+01        27.247  25.522  23.032
3.27E+01        27.247  25.522  23.032
3.59E+01        27.247  25.522  23.032
3.94E+01        27.247  25.522  23.032
4.33E+01        27.247  25.522  23.032
4.75E+01        27.247  25.522  23.032
5.21E+01        27.247  25.522  23.032
5.72E+01        27.247  25.522  23.032
6.28E+01        27.247  25.522  23.032
6.89E+01        27.247  25.522  23.032
7.56E+01        27.247  25.522  23.032
8.30E+01        27.247  25.522  23.032
9.11E+01        27.247  25.522  23.032
1.00E+02        27.247  25.522  23.032

Best regularization value: 7.74E-02
Test accuracy (%): 30.045249

 對於pretrained而言:

=== Recap ===
Reg             Train   Dev     Test
1.00E-02        39.923  36.331  37.195
1.10E-02        39.934  36.331  37.195
1.20E-02        39.911  36.240  37.195
1.32E-02        39.899  36.240  37.195
1.45E-02        39.899  36.421  37.285
1.59E-02        39.888  36.694  37.285
1.75E-02        39.876  36.603  37.240
1.92E-02        39.841  36.603  37.195
2.10E-02        39.853  36.694  37.285
2.31E-02        39.864  36.421  37.240
2.54E-02        39.864  36.421  37.421
2.78E-02        39.853  36.331  37.285
3.05E-02        39.853  36.421  37.376
3.35E-02        39.876  36.421  37.466
3.68E-02        39.841  36.331  37.511
4.04E-02        39.817  36.331  37.511
4.43E-02        39.829  36.331  37.466
4.86E-02        39.923  36.240  37.511
5.34E-02        39.888  36.240  37.466
5.86E-02        39.853  36.331  37.421
6.43E-02        39.876  36.331  37.330
7.05E-02        39.864  36.331  37.195
7.74E-02        39.864  36.421  37.195
8.50E-02        39.864  36.331  37.240
9.33E-02        39.853  36.331  37.195
1.02E-01        39.817  36.240  37.149
1.12E-01        39.735  36.240  37.195
1.23E-01        39.771  36.512  37.285
1.35E-01        39.735  36.512  37.466
1.48E-01        39.794  36.512  37.511
1.63E-01        39.806  36.512  37.466
1.79E-01        39.841  36.421  37.376
1.96E-01        39.747  36.421  37.330
2.15E-01        39.724  36.512  37.240
2.36E-01        39.665  36.512  37.195
2.60E-01        39.654  36.512  37.195
2.85E-01        39.560  36.421  37.330
3.13E-01        39.583  36.240  37.285
3.43E-01        39.642  36.240  37.285
3.76E-01        39.630  36.331  37.285
4.13E-01        39.630  36.331  37.285
4.53E-01        39.607  36.149  37.330
4.98E-01        39.618  36.149  37.330
5.46E-01        39.583  36.149  37.330
5.99E-01        39.583  36.149  37.285
6.58E-01        39.607  36.421  37.240
7.22E-01        39.548  36.512  37.285
7.92E-01        39.537  36.512  37.285
8.70E-01        39.525  36.603  37.376
9.55E-01        39.525  36.603  37.285
1.05E+00        39.478  36.512  37.330
1.15E+00        39.525  36.603  37.285
1.26E+00        39.537  36.512  37.330
1.38E+00        39.548  36.512  37.330
1.52E+00        39.490  36.512  37.285
1.67E+00        39.490  36.603  37.059
1.83E+00        39.466  36.694  37.195
2.01E+00        39.501  36.876  37.240
2.21E+00        39.431  36.694  37.240
2.42E+00        39.408  36.603  37.240
2.66E+00        39.302  36.876  37.195
2.92E+00        39.279  36.876  37.195
3.20E+00        39.197  36.966  37.104
3.51E+00        39.022  36.876  37.240
3.85E+00        39.115  36.694  37.285
4.23E+00        39.092  36.785  37.376
4.64E+00        39.010  36.876  37.285
5.09E+00        38.963  36.694  37.376
5.59E+00        39.010  36.512  37.466
6.14E+00        38.951  36.785  37.466
6.73E+00        38.928  36.876  37.783
7.39E+00        38.893  36.785  37.828
8.11E+00        38.846  36.694  37.828
8.90E+00        38.729  36.694  37.828
9.77E+00        38.647  36.876  37.738
1.07E+01        38.706  36.876  37.466
1.18E+01        38.659  37.239  37.511
1.29E+01        38.542  37.057  37.421
1.42E+01        38.448  36.876  37.330
1.56E+01        38.343  36.694  37.149
1.71E+01        38.319  36.966  37.240
1.87E+01        38.191  36.694  37.149
2.06E+01        38.191  36.603  37.059
2.26E+01        38.003  36.421  36.968
2.48E+01        37.863  36.331  36.833
2.72E+01        37.746  36.694  36.923
2.98E+01        37.617  36.966  36.697
3.27E+01        37.500  36.876  36.471
3.59E+01        37.512  36.785  36.290
3.94E+01        37.477  36.603  36.425
4.33E+01        37.383  36.512  36.244
4.75E+01        37.161  36.512  36.199
5.21E+01        37.125  36.421  36.290
5.72E+01        36.997  36.149  36.063
6.28E+01        36.821  35.786  36.154
6.89E+01        36.809  35.695  36.018
7.56E+01        36.610  35.876  35.882
8.30E+01        36.575  35.332  35.611
9.11E+01        36.482  34.968  35.656
1.00E+02        36.330  35.059  35.701

Best regularization value: 1.18E+01
Test accuracy (%): 37.511312

對於Glove比自己訓練的詞向量效能更好的原因:

1,因為Glove使用的維度更高,使用的是50維,而我們自己訓練的詞向量是10維。

2,訓練GLove時採用的是很大的語料庫,然後可以得到一個更全面的效果,而我們訓練的語料庫資料量不夠大,不能得到無偏的詞向量。

3,對於Glove而言,使用到了全域性的資訊,使用了詞向量共現的資訊,而對於Word2vec而言,使用的是上下文的區域性關係。

(e)畫出的不同正則化引數的訓練集,驗證集的準確率的情況。(採用Glove訓練的)

 這幅圖展現的是不同的正則化係數對於訓練集和驗證集的準確率的影響。對於正則化係數一直增大,對於訓練集的準確率一直在下降,而驗證集資料有一個小範圍的上升的過程,說明起到了避免過擬合訓練資料的效果,而正則化係數太大,二者的準確率都很低,說明模型太簡單了,導致了模型沒有很好的擬合數據。

(f)畫出的混淆矩陣圖形如下:(使用Glove模型)

分析上面的混淆矩陣,在很消極的資料中,很多的被分為了消極,次多被分為了積極;消極的資料中,很多被分為了消極,但也有次多的被分為了積極;在中立的資料彙總,很多被分為了消極,次多被分為了積極;在積極的資料中,很多的被分為積極,少部分被分為了消極;在很積極的資料中,很多被分為了積極,次多的被分為很積極。

在我們構建的這個模型中,對於積極的資料分類效果最好,其次是消極的資料,其次是很積極的資料,中立的資料,最後是很消極的資料。

總體來看,對於積極方面的資料效果更好,對於中立和消極的資料分類效果一般。

(g)分析三條分錯的資料。

資料:

True	Predicted	Text
3	4	it 's a lovely film with lovely performances by buy and accorsi .
2	1	no one goes unindicted here , which is probably for the best .
3	1	and if you 're not nearly moved to tears by a couple of scenes , you 've got ice water in your veins .

對於第一條:

從積極的資料預測為很積極的資料。因為這二者的界限很模糊,不是將明顯的積極的資料分為消極的這類嚴重的錯誤。

對於第二條:

可能否定詞和不確定的詞有點多,所以將其偏向了消極的觀點。

對於第三條:

涉及到了反語,有一些偏消極的詞,沒有學習到反語的意思,出錯。

解釋程式:這個程式的入口程式:main函式,引數為另一個函式getArguments。

首先說,getArguments函式,這個呼叫了argparse包,這個包用於從python內建的一個用於命令項選項與引數解析的模組,通過在程式中定義好需要的引數,然後argparse會幫我們從sys.argv中解析出這些引數,並自動生成幫助和使用資訊。這個是需要在命令列下進行執行的。

具體講解連結:https://www.jianshu.com/p/fef2d215b91d

https://blog.csdn.net/u013177568/article/details/62432761/

我覺得我差一點就陣亡在這個地方,但我最後還是解決了這個問題。在這個地方遇到的問題是我對於那個argparse包的不熟悉,不知道該怎麼執行,具體詳見上面的連結。

還有一個問題,因為我是windows環境,利用命令列的時候要cd到這個資料夾下執行,否則的話會報找不到檔案的錯誤,因為是從系統盤C盤找的,這當然找不到了。

這裡再貼一個連結,https://www.cnblogs.com/wangguoyuan-09/p/6866798.html,講的是pycharm來執行命令列程式。

其次,說的是main函式,資料集採用的是斯坦福情感分析的資料集,然後根據命令列的引數(互斥的),選擇是採用pretrained,還是採用yourvectors;然後讀取到資料集中的訓練集,驗證集和測試集,抽取資料的特徵;然後驗證不同的正則化引數的訓練結果,然後輸出結果。對於pretrained的類別,還會畫出圖來進行錯誤分析。

對於getSentenceFeatures函式,就是(a)中,利用句子的每個單詞的詞向量的平均來作為整個句子的特徵來進行處理。

對於getRegularizationValues函式,就是(b)中,得到一系列的正則化的引數,然後對這些資料進行排序。

對於chooseBestModel函式,就是(b)中,根據驗證集的準確率來選擇一個好的模型。

對於accuracy函式,就是通過向量化的手段,來得到準確率的計算。

對於plotRegVsAccuracy函式,就是通過畫出正則化引數以及準確率的變化的曲線的函式。

對於outputConfusionMatrix是用來畫出混淆矩陣的函式。

在這個地方有比較多的學習的地方:

1)對於混淆矩陣的畫法,不是僅僅得到一個矩陣就OK, 也是可以畫出影象的,很美觀。

2)自己有很多可以借鑑的地方

貼出matplotlib官網的連結以便後續檢視:https://matplotlib.org/api/pyplot_summary.html

對於outputPredictions是用來輸出一個txt文件,裡面記錄了驗證集資料的真實的標籤,預測的標籤,以及資料。

最後的程式實現如下:

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import argparse
import numpy as np
import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt
import itertools

from utils.treebank import StanfordSentiment
import utils.glove as glove

from q3_sgd import load_saved_params, sgd

# We will use sklearn here because it will run faster than implementing
# ourselves. However, for other parts of this assignment you must implement
# the functions yourself!
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix


def getArguments():
    parser = argparse.ArgumentParser()
    group = parser.add_mutually_exclusive_group(required=True)
    group.add_argument("--pretrained", dest="pretrained", action="store_true",
                       help="Use pretrained GloVe vectors.")
    group.add_argument("--yourvectors", dest="yourvectors", action="store_true",
                       help="Use your vectors from q3.")
    return parser.parse_args()


def getSentenceFeatures(tokens, wordVectors, sentence):
    """
    Obtain the sentence feature for sentiment analysis by averaging its
    word vectors
    """

    # Implement computation for the sentence features given a sentence.

    # Inputs:
    # tokens -- a dictionary that maps words to their indices in
    #           the word vector list
    # wordVectors -- word vectors (each row) for all tokens
    # sentence -- a list of words in the sentence of interest

    # Output:
    # - sentVector: feature vector for the sentence

    sentVector = np.zeros((wordVectors.shape[1],))

    for word in sentence:
        sentVector += wordVectors[tokens[word]]
    sentVector *= 1.0/len(sentence)

    assert sentVector.shape == (wordVectors.shape[1],)
    return sentVector


def getRegularizationValues():
    """Try different regularizations

    Return a sorted list of values to try.
    """
    # Assign a list of floats in the block below
    values = np.logspace(-2, 2, num=100, base=10)
    return sorted(values)


def chooseBestModel(results):
    """Choose the best model based on dev set performance.

    Arguments:
    results -- A list of python dictionaries of the following format:
        {
            "reg": regularization,
            "clf": classifier,
            "train": trainAccuracy,
            "dev": devAccuracy,
            "test": testAccuracy
        }

    Each dictionary represents the performance of one model.

    Returns:
    Your chosen result dictionary.
    """
    # 對於利用dev的關鍵字來進行排序
    bestResult = max(results, key=lambda x: x['dev'])

    return bestResult


def accuracy(y, yhat):
    """ Precision for classifier """
    assert(y.shape == yhat.shape)
    return np.sum(y == yhat) * 100.0 / y.size


def plotRegVsAccuracy(regValues, results, filename):
    """ Make a plot of regularization vs accuracy """
    plt.plot(regValues, [x["train"] for x in results])
    plt.plot(regValues, [x["dev"] for x in results])
    plt.xscale('log')
    plt.xlabel("regularization")
    plt.ylabel("accuracy")
    plt.legend(['train', 'dev'], loc='upper left')
    plt.savefig(filename)


def outputConfusionMatrix(features, labels, clf, filename):
    """ Generate a confusion matrix """
    pred = clf.predict(features)
    cm = confusion_matrix(labels, pred, labels=range(5))
    plt.figure()
    plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Reds)
    plt.colorbar()
    classes = ["- -", "-", "neut", "+", "+ +"]
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes)
    plt.yticks(tick_marks, classes)
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.savefig(filename)


def outputPredictions(dataset, features, labels, clf, filename):
    """ Write the predictions to file """
    pred = clf.predict(features)
    with open(filename, "w") as f:
        print >> f, "True\tPredicted\tText"
        for i in xrange(len(dataset)):
            print >> f, "%d\t%d\t%s" % (
                labels[i], pred[i], " ".join(dataset[i][0]))


def main(args):
    """ Train a model to do sentiment analyis"""

    # Load the dataset
    dataset = StanfordSentiment()
    tokens = dataset.tokens()
    nWords = len(tokens)

    if args.yourvectors:
        _, wordVectors, _ = load_saved_params()
        wordVectors = np.concatenate(
            (wordVectors[:nWords,:], wordVectors[nWords:,:]),
            axis=1)
    elif args.pretrained:
        wordVectors = glove.loadWordVectors(tokens)
    dimVectors = wordVectors.shape[1]

    # Load the train set
    trainset = dataset.getTrainSentences()
    nTrain = len(trainset)
    trainFeatures = np.zeros((nTrain, dimVectors))
    trainLabels = np.zeros((nTrain,), dtype=np.int32)
    for i in xrange(nTrain):
        words, trainLabels[i] = trainset[i]
        trainFeatures[i, :] = getSentenceFeatures(tokens, wordVectors, words)

    # Prepare dev set features
    devset = dataset.getDevSentences()
    nDev = len(devset)
    devFeatures = np.zeros((nDev, dimVectors))
    devLabels = np.zeros((nDev,), dtype=np.int32)
    for i in xrange(nDev):
        words, devLabels[i] = devset[i]
        devFeatures[i, :] = getSentenceFeatures(tokens, wordVectors, words)

    # Prepare test set features
    testset = dataset.getTestSentences()
    nTest = len(testset)
    testFeatures = np.zeros((nTest, dimVectors))
    testLabels = np.zeros((nTest,), dtype=np.int32)
    for i in xrange(nTest):
        words, testLabels[i] = testset[i]
        testFeatures[i, :] = getSentenceFeatures(tokens, wordVectors, words)

    # We will save our results from each run
    results = []
    regValues = getRegularizationValues()
    for reg in regValues:
        print "Training for reg=%f" % reg
        # Note: add a very small number to regularization to please the library
        clf = LogisticRegression(C=1.0/(reg + 1e-12))
        clf.fit(trainFeatures, trainLabels)

        # Test on train set
        pred = clf.predict(trainFeatures)
        trainAccuracy = accuracy(trainLabels, pred)
        print "Train accuracy (%%): %f" % trainAccuracy

        # Test on dev set
        pred = clf.predict(devFeatures)
        devAccuracy = accuracy(devLabels, pred)
        print "Dev accuracy (%%): %f" % devAccuracy

        # Test on test set
        # Note: always running on test is poor style. Typically, you should
        # do this only after validation.
        pred = clf.predict(testFeatures)
        testAccuracy = accuracy(testLabels, pred)
        print "Test accuracy (%%): %f" % testAccuracy

        results.append({
            "reg": reg,
            "clf": clf,
            "train": trainAccuracy,
            "dev":