pytorch + visdom 使用RNN分類預測姓名來自哪個國家
阿新 • • 發佈:2019-01-01
環境
系統:win10
cpu:i7-6700HQ
gpu:gtx965m
python : 3.6
pytorch :0.3
資料集
下載之後解壓,放在專案根目錄:
資料集方面,我們要解決的問題有:轉碼問題,資料整合,每一個數據的表示形態(本文應用還是one-hot,每個字元的one-hot)
每個名字構建成 [字元數,1(batch數,因為不用故1),one-hot表示的字元位置]
# string.ascii_letters生成所有字母, string.digits 生成數字
all_letters = string.ascii_letters+" .,;'"
n_letters = len(all_letters)
print(n_letters)
# >>57
轉碼部分,我也不是很懂這些,NFD、Mn應該都是代表編碼種類把,作用是把所有名字都轉化一種編碼模式:
# unicode 轉 標準ASCII編碼
def unicode_to_ascii(s):
s = "".join(c for c in unicodedata.normalize("NFD", s) if unicodedata.category(c) != 'Mn'and c in all_letters)
return s
獲取名字資訊,存入字典中:
# 輸入檔案獲取名字資訊
def readline(filename):
lines = open(filename, encoding='utf-8').read().strip().split('\n')
return [unicode_to_ascii(line) for line in lines]
category_lines = {}
all_categories = []
# 獲取資料夾中所有.txt 檔名
filenames = glob.glob('data/names/*.txt')
# {'國家':'[名字]'}的格式所有資料, [國家] 的形式儲存label
for filename in filenames:
category = filename.split("\\")[-1].split(".")[0]
all_categories.append(category)
category_lines[category] = readline(filename)
n_categories = len(all_categories)
神經網路
上圖就是一個很簡單的RNN網路,特別之處就是hidden的回傳,用以確定每一個姓本身字元之間的關聯性。
def __init__(self, input_size, hidden_size, output_size):
super(RNN, self).__init__()
self.hidden_size = hidden_size
self.in2h = nn.Linear(input_size+hidden_size, hidden_size)
self.in2o = nn.Linear(input_size+hidden_size, output_size)
self.softmax = nn.LogSoftmax(dim=1)
def forward(self, input, hidden):
combined = torch.cat((input, hidden), 1)
hidden = self.in2h(combined)
output = self.in2o(combined)
output = self.softmax(output)
return output, hidden
隨機選取訓練資料:
def random_choice(categories):
return categories[random.randint(0, len(categories)-1)]
# 隨機選取國家進行訓練
def random_train_example():
category = random_choice(all_categories)
line = random_choice(category_lines[category])
category_tensor = Variable(torch.from_numpy(np.array([all_categories.index(category)])).long())
line_tensor = Variable(line_to_tensor(line))
return category, line, category_tensor, line_tensor
category, line, category_tensor, line_tensor = random_train_example()
loss,和優化函式,原作learning-rate 是0.005 親測 0.001 準確率更高一些(雖然也不是很高)。
loss_f = nn.NLLLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001)
訓練 2個 100000 次:
結果如下
90000 | 90.0% | loss: 1.1007| acc: 64.26%| time: 79.80
95000 | 95.0% | loss: 1.0988| acc: 64.70%| time: 84.15
100000 | 100.0% | loss: 1.1016| acc: 65.18%| time: 88.52
準確率不高,全集測試收集對比資料,作圖:
confusion = torch.zeros(n_categories, n_categories)
for category in all_categories:
for line in category_lines[category]:
line_tensor = Variable(line_to_tensor(line))
output = evaluate(line_tensor)
guess, guess_i = category_from_output(output)
category_i = all_categories.index(category)
confusion[category_i][guess_i] += 1
# 由於每個國家的姓名數量不同,於是求佔比
for i in range(n_categories):
confusion[i] = confusion[i] / confusion[i].sum()
# 視覺化輸出
viz = visdom.Visdom()
viz.heatmap(X=confusion, opts=dict(
columnnames=all_categories,
rownames=all_categories,
colormap="Jet",
xlabel='guess',
ylabel='category',
marginleft=100,
marginbottom=100,
))
根據資料我們看出,英語似乎是最沒特色區分的最不好,中國和韓國容易搞混,這個也可以理解,你跟我說姓李,我也不知道是中國還是韓國。。