使用GIZA++進行詞對齊

阿新 • • 發佈：2018-12-09

準備雙語語料

zh.txt：源語言

海洋 是 一個 非常 複雜 的 事物 。
人類 的 健康 也 是 一 件 非常 複雜 的 事情 。
將 兩者 統一 起來 看 起來 是 一 件 艱鉅 的 任務 。 但 我 想 要 試圖 去 說明 的 是 即使 是 如此 複雜 的 情況 ， 也 存在 一些 我 認為 簡單 的 話題 ， 一些 如果 我們 能 理解 ， 就 很 容易 向前 發展 的 話題 。
這些 簡單 的 話題 確實 不 是 有關 那 複雜 的 科學 有 了 怎樣 的 發展 ， 而是 一些 我們 都 恰好 知道 的 事情 。
接下來 我 就 來說 一個 。 如果 老 媽 不 高興 了 ， 大家 都 別 想 開心 。

en.txt：目標語言

It can be a very complicated thing , the ocean . 
And it can be a very complicated thing , what human health is .
And bringing those two together might seem a very daunting task , but what I 'm going to try to say is that even in that complexity , there 's some simple themes that I think , if we understand , we can really move forward .
And those simple themes aren 't really themes about the complex science of what 's going on , but things that we all pretty well know .
And I 'm going to start with this one : If momma ain 't happy , ain 't nobody happy .

注意UTF-8編碼。

下載編譯GIZA++

$ git clone https://github.com/moses-smt/giza-pp.git
$ cd giza-pp
$ make

編譯完會在GIZA++-v2/和mkcls-v2/目錄下生成以下可執行檔案：plain2snt.out、snt2cooc.out、GIZA++、mkcls

將這四個程式移動到工作目錄workspace下：

執行命令進行詞對齊

文字單詞編號

./plain2snt.out zh.txt en.txt

得到en.vcb、zh.vcb、en_zh.snt、zh_en.snt四個檔案

en.vcb / zh.vcb：字典檔案，id : token : count

2 海洋 1
3 是 6
4 一個 2
5 非常 2
6 複雜 4
7 的 12
8 事物 1
9 。 7
10 人類 1
...

en_zh.snt / zh_en.snt：編號表示句對，第一行表示句對出現次數

1
2 3 4 5 6 7 8 9
2 3 4 5 6 7 8 9 10 11 12 13
1
10 7 11 12 3 13 14 5 6 7 15 9
14 15 4 5 6 7 8 9 10 16 17 18 19 13
...

生成共現檔案

./snt2cooc.out zh.vcb en.vcb zh_en.snt > zh_en.cooc ./snt2cooc.out en.vcb zh.vcb en_zh.snt > en_zh.cooc

zh_en.cooc / en_zh.cooc

生成詞類

./mkcls -pzh.txt -Vzh.vcb.classes opt ./mkcls -pen.txt -Ven.vcb.classes opt

***** 1 runs. (algorithm:TA)*****
;KategProblem:cats: 100   words: 68

start-costs: MEAN: 262.907 (262.907-262.907)  SIGMA:0
  end-costs: MEAN: 190.591 (190.591-190.591)  SIGMA:0
   start-pp: MEAN: 3.52623 (3.52623-3.52623)  SIGMA:0
     end-pp: MEAN: 1.95873 (1.95873-1.95873)  SIGMA:0
 iterations: MEAN: 50117 (50117-50117)  SIGMA:0
       time: MEAN: 1.468 (1.468-1.468)  SIGMA:0

引數： -c 詞類數目 -n 優化次數，預設是1，越大越好 -p 輸入檔案 -V 輸出檔案 opt 優化輸出

en.vcb.classes / zh.vcb.classes：單詞所屬類別編號

en.vcb.classes.cats / zh.vcb.classes.cats：類別所擁有的一組單詞

0:$,
1:
2:science,
3:seem,
4:things,
5:some,
6:start,
7:task,
...

GIZA++

先在當前目錄新建兩個輸出資料夾z2e、e2z，否則下面的程式會出錯，沒有輸出。

$ ./GIZA++ -S zh.vcb -T en.vcb -C zh_en.snt -CoocurrenceFile zh_en.cooc -o z2e -OutputPath z2e $ ./GIZA++ -S en.vcb -T zh.vcb -C en_zh.snt -CoocurrenceFile en_zh.cooc -o e2z -OutputPath e2z

引數： -o 檔案字首 -OutputPath 輸出所有檔案到資料夾

輸出檔案詳解（以z2e為例）：

z2e.perp 困惑度

#trnsz  tstsz   iter    model   trn-pp          test-pp         trn-vit-pp              tst-vit-pp
5       0       0       Model1  80.5872         N/A             2250.77         N/A
5       0       1       Model1  36.0705         N/A             648.066         N/A
5       0       2       Model1  34.0664         N/A             523.575         N/A
5       0       3       Model1  32.628          N/A             423.928         N/A
5       0       4       Model1  31.5709         N/A             359.343         N/A
5       0       5       HMM     30.7896         N/A             314.58          N/A
5       0       6       HMM     31.1412         N/A             172.128         N/A
5       0       7       HMM     26.1343         N/A             111.444         N/A
5       0       8       HMM     22.177          N/A             79.3055         N/A
5       0       9       HMM     19.0506         N/A             58.4415         N/A
5       0       10      THTo3   32.6538         N/A             37.8575         N/A
5       0       11      Model3  11.1194         N/A             11.944          N/A
5       0       12      Model3  8.93033         N/A             9.50349         N/A
5       0       13      Model3  7.68766         N/A             8.19622         N/A
5       0       14      Model3  6.64154         N/A             7.04977         N/A
5       0       15      T3To4   6.17993         N/A             6.55567         N/A
5       0       16      Model4  6.16858         N/A             6.4715          N/A
5       0       17      Model4  6.0819          N/A             6.39317         N/A
5       0       18      Model4  6.04302         N/A             6.34387         N/A
5       0       19      Model4  5.95066         N/A             6.2234          N/A

z2e.a3.final：i j l m p(i/j, l, m)：i代表源語言Token位置；j代表目標語言Token位置；l代表源語言句子長度；m代表目標語言句子長度；p(i/j, l, m)代表the probability that a source word in position i is moved to position j in a pair of sentences of length l and m。

1 1 8 100 1
5 2 8 100 1
4 3 8 100 1
2 4 8 100 1
5 5 8 100 1
4 6 8 100 1
4 7 8 100 1
0 8 8 100 1
...

z2e.d3.final：類似於z2e.a3.final檔案，只是交換了i 和 j 的位置。

2 0 100 8 0.0491948
6 0 100 8 0.950805
3 1 100 8 1
5 2 100 8 1
4 3 100 8 1
2 4 100 8 0.175424
5 4 100 8 0.824576
2 5 100 8 1
4 6 100 8 1
4 7 100 8 1
...

z2e.n3.final：source_id p0 p1 p2 … pn；源語言Token的Fertility分別為0,1,…,n時的概率表，比如p0是Fertility為0時的概率。

2 1.22234e-05 0.781188 0.218799 0 0 0 0 0 0 0
3 0.723068 0.223864 0 0.053068 0 0 0 0 0 0
4 0.349668 0.439519 0.0423205 0.168493 0 0 0 0 0 0
5 0.457435 0.447043 0.0955223 0 0 0 0 0 0 0
6 0.214326 0.737912 0 0.0477612 0 0 0 0 0 0
7 1 0 0 0 0 0 0 0 0 0
8 1.48673e-05 0.784501 0.215484 0 0 0 0 0 0 0
...

z2e.t3.final：s_id t_id p(t_id/s_id)； IBM Model 3訓練後的翻譯表；p(t_id/s_id)表示源語言Token翻譯為目標語言Token的概率

0 3 0.196945
0 7 0.74039
0 33 0.0626657
2 4 1
3 6 1
4 5 1
5 3 0.822024
5 6 0.177976
6 3 0.593075
...

z2e.A3.final 單向對齊檔案，數字代表Token所在句子位置（1為起點）

# Sentence pair (1) source length 8 target length 11 alignment score : 8.99868e-08
It can be a very complicated thing , the ocean .
NULL ({ 8 }) 海洋 ({ 1 }) 是 ({ 4 }) 一個 ({ 9 }) 非常 ({ 3 6 7 }) 複雜 ({ 2 5 }) 的 ({ }) 事物 ({ 10 }) 。 ({ 11 })

# Sentence pair (2) source length 12 target length 14 alignment score : 9.55938e-12
And it can be a very complicated thing , what human health is .
NULL ({ 9 }) 人類 ({ 2 11 }) 的 ({ }) 健康 ({ 12 }) 也 ({ }) 是 ({ 5 }) 一 ({ }) 件 ({ 13 }) 非常 ({ 4 7 8 }) 複雜 ({ 3 6 }) 的 ({ }) 事情 ({ 1 10 }) 。 ({ 14 })
...

z2e.d4.final：IBM Model 4 翻譯表

# Translation tables for Model 4 .
# Table for head of cept.
F: 20 E: 26
SUM: 0.125337
9 0.125337

F: 20 E: 15
SUM: 0.0387214
-2 0.0387214

F: 20 E: 24
SUM: 0.0387214
21 0.0387214
...

z2e.D4.final：IBM Model 4的Distortion表

26 20 9 1
15 20 -2 1
24 20 21 1
2 20 -2 1
40 20 -4 1
22 20 -3 0.0841064
22 20 9 0.915894
32 20 28 1
21 20 24 1
29 2 -3 0.472234
29 2 1 0.527766
5 2 1 0.475592
...

z2e.gizacfg：GIZA++配置檔案，超引數

adbackoff 0
c zh_en.snt
compactadtable 1
compactalignmentformat 0
coocurrencefile zh_en.cooc
corpusfile zh_en.snt
countcutoff 1e-06
countcutoffal 1e-05
countincreasecutoff 1e-06
countincreasecutoffal 1e-05
d
deficientdistortionforemptyword 0
depm4 76
depm5 68
dictionary
dopeggingyn 0
emalignmentdependencies 2
emalsmooth 0.2
emprobforempty 0.4
emsmoothhmm 2
hmmdumpfrequency 0
hmmiterations 5
l z2e/118-03-20.215009.gld.log
log 0
logfile z2e/118-03-20.215009.gld.log
m1 5
m2 0
m3 5
m4 5
m5 0
m5p0 -1
m6 0
manlexfactor1 0
manlexfactor2 0
manlexmaxmultiplicity 20
maxfertility 10
maxsentencelength 101
mh 5
mincountincrease 1e-07
ml 101
model1dumpfrequency 0
model1iterations 5
model23smoothfactor 0
model2dumpfrequency 0
model2iterations 0
model345dumpfrequency 0
model3dumpfrequency 0
model3iterations 5
model4iterations 5
model4smoothfactor 0.2
model5iterations 0
model5smoothfactor 0.1
model6iterations 0
nbestalignments 0
nodumps 0
nofiledumpsyn 0
noiterationsmodel1 5
noiterationsmodel2 0
noiterationsmodel3 5
noiterationsmodel4 5
noiterationsmodel5 0
noiterationsmodel6 0
nsmooth 64
nsmoothgeneral 0
numberofiterationsforhmmalignmentmodel 5
o z2e/z2e
onlyaldumps 0
outputfileprefix z2e/z2e
outputpath z2e/
p 0
p0 -1
peggedcutoff 0.03
pegging 0
probcutoff 1e-07
probsmooth 1e-07
readtableprefix
s zh.vcb
sourcevocabularyfile zh.vcb
t en.vcb
t1 0
t2 0
t2to3 0
t3 0
t345 0
targetvocabularyfile en.vcb
tc
testcorpusfile
th 0
transferdumpfrequency 0
v 0
verbose 0
verbosesentence -10

z2e.Decoder.config：用於ISI Rewrite Decoder解碼器

# Template for Configuration File for the Rewrite Decoder
# Syntax:
#         <Variable> = <value>
#         '#' is the comment character
#================================================================
#================================================================
# LANGUAGE MODEL FILE
# The full path and file name of the language model file:
LanguageModelFile =
#================================================================
#================================================================
# TRANSLATION MODEL FILES
# The directory where the translation model tables as created
# by Giza are located:
#
# Notes: - All translation model "source" files are assumed to be in
#          TM_RawDataDir, the binaries will be put in TM_BinDataDir
#
#        - Attention: RELATIVE PATH NAMES DO NOT WORK!!!
#
#        - Absolute paths (file name starts with /) will override
#          the default directory.

TM_RawDataDir = z2e/
TM_BinDataDir = z2e/

# file names of the TM tables
# Notes:
# 1. TTable and InversTTable are expected to use word IDs not
#    strings (Giza produces both, whereby the *.actual.* files
#    use strings and are THE WRONG CHOICE.
# 2. FZeroWords, on the other hand, is a simple list of strings
#    with one word per line. This file is typically edited
#    manually. Hoeever, this one listed here is generated by GIZA

TTable = z2e.t3.final
InverseTTable = z2e.ti.final
NTable = z2e.n3.final
D3Table = z2e.d3.final
D4Table = z2e.D4.final
PZero = z2e.p0_3.final
Source.vcb = zh.vcb
Target.vcb = en.vcb
Source.classes = zh.vcb.classes
Target.classes = en.vcb.classes
FZeroWords       = z2e.fe0_3.final

下面兩個Python檔案地址：GitHub

詞對齊對稱化

上面的得到的*.A3.final檔案是單向對齊的，我們這裡需要對稱化，對稱化方法有很多，我們這裡使用最流行的“grow-diag-final-and”方法

python align_sym.py e2z.A3.final z2e.A3.final > aligned.grow-diag-final-and

1-1 2-4 3-1 3-9 4-3 4-6 4-7 5-2 5-5 7-10 8-11
1-2 1-11 1-12 3-2 4-13 5-5 6-13 7-13 8-4 8-7 8-8 9-3 9-6 9-10 11-1 12-14
1-2 2-2 2-25 3-2 3-26 4-2 4-11 5-36 6-6 6-29 7-8 8-22 9-22 10-2 12-7 12-21 12-42 12-46 13-1 14-23 15-15 15-19 16-16 16-20 17-25 18-29 19-24 19-31 20-6 23-5 23-30 24-8 25-5 25-10 26-9 26-14 28-4 29-3 30-9 31-5 31-43 32-3 33-35 34-36 34-45 35-33 37-13 37-44 38-17 39-3 40-16 40-18 41-30 41-34 42-41 43-5 44-17 45-15 45-16 45-17 46-24 47-29 47-38 48-27 48-39 48-40 49-32 51-31 52-47
1-5 1-23 1-25 2-4 2-8 4-7 4-19 4-22 5-9 6-6 8-5 9-26 10-14 12-9 13-13 14-6 14-20 15-5 15-17 16-15 17-3 18-2 18-16 19-10 20-2 21-15 21-21 22-6 23-10 24-11 24-12 24-13 24-24 26-1 27-27
1-15 2-2 3-2 3-3 3-4 3-5 4-12 5-7 6-1 6-21 7-20 8-19 8-20 9-8 10-14 10-18 11-7 11-9 11-10 11-11 12-14 13-4 14-13 15-14 16-6 16-13 17-3 18-6 18-17 19-21

詞對齊視覺化

將第一句詞對齊結果進行視覺化

python align_plot.py en.txt zh.txt aligned.grow-diag-final-and 0

使用GIZA++進行詞對齊

準備雙語語料 zh.txt：源語言海洋是一個非常複雜的事物。人類的健康也是一件非常複雜的事情。將兩者統一起來看起來是一件艱鉅的任務。但我想要試圖去說明的是即使是如此複雜的情

python基礎===對字符串進行左右中對齊

soft nbsp int 有一個 == () for add 基礎例如，有一個字典如下： >>> dic = { "name": "botoo", "url": "http://www.123.com", "page": "88",

opencv 仿射變換根據眼睛座標進行人臉對齊計算變換後對應座標

//根據眼睛座標對影象進行仿射變換 //src - 原影象 //landmarks - 原影象中68個關鍵點 Mat getwarpAffineImg(Mat &src, vector<Point2f> &landmarks) { Mat oral;src.copyTo(

python中文分詞，使用結巴分詞對python進行分詞

php 分詞在采集美女站時,需要對關鍵詞進行分詞,最終采用的是python的結巴分詞方法.中文分詞是中文文本處理的一個基礎性工作，結巴分詞利用進行中文分詞。其基本實現原理有三點：基於Trie樹結構實現高效的詞圖掃描，生成句子中漢字所有可能成詞情況所構成的有向無環圖（DAG)采用了動態規劃查找最大概率

Mtcnn進行人臉剪裁和對齊B

pos steps app inter pil std tdi creating port 1 from scipy import misc 2 import tensorflow as tf 3 import detect_face 4 import cv2

利用pyrealsense獲取深度圖，並進行畫素對齊

系統：Ubuntu16.04 python版本：python2.7 核心版本：4.13.0 realsense SDK：librealsense1.12.1 python wrapper：pyrealsense2.2 這裡的pyrealsense2.2指的是pyrealsense

C/C++結構體對齊方式詳解，從記憶體地址進行解析

注意：童鞋們如果仔仔細細看完這篇部落格，肯定能明白結構體的對齊方式。最近在做一個專案的時候，客戶給的鐳射點雲檔案是二進位制形式，因此需要根據客戶定義的結構體，將點雲檔案儲存為文字檔案方便在第三方軟體如cloudCompare中檢視。但是發現客戶的結構體所佔記憶體空間跟我的

什麼叫4K對齊、如何進行硬碟4K對齊？

什麼是叫做4K對齊？其實“4K對齊”相關聯的是一個叫做“高階格式化”的分割槽技術。“高階格式化”是國際硬碟裝置與材料協會為新型資料結構格式所採用的名稱。這是主要鑑於目前的硬碟容量不斷擴充套件，使得之前定義的每個扇區512位元組不再是那麼的合理，於是將每個扇區512位元組改為每個扇區4096 個位元組，也就是

為什麼要進行結構體記憶體對齊

結構體記憶體對齊什麼是結構體記憶體對齊結構體不像陣列，結構體中可以存放不同型別的資料，它的大小也不是簡單的各個資料成員大小之和，限於讀取記憶體的要求，而是每個成員在記憶體中的儲存都要按照一定偏移量來儲存，根據型別的不同，每個成員都要按照一定的對齊數進

CSS-中英文兩端對齊，英文不斷詞，自動換行

CSS： { word-break: keep-all; word-wrap: break-word; // 只對英文起作用，以單詞作為換行依據。 white-space: pre-wrap; //只對中文起作用，強制換行。

使用Dlib庫進行人臉檢測與對齊

簡介上一篇中，講述瞭如何在windows上編譯dlib的靜態庫dlib.lib。現在來使用dlib.lib進行人臉檢測與對齊。例子中原始碼來自官方案例，進行稍微修改。準備 1.編譯好的靜態庫檔案，dlib.lib 程式 1.新建win32控制檯程式，修改為 Relea

python 4-5 如何對字串進行左, 右, 居中對齊str.ljust/rjust/center/format(s,'20'/'^20')

python 4-5 如何對字串進行左, 右, 居中對齊str.ljust/rjust/center/format(s,’<20’/’>20’/’^20’) 解決方案: 使用字串的str.ljust() str.rjust() str.cente

.NET WinForm下StatusStrip控件如何設置分隔線及部分子控件右對齊

sin 控件 mali date flow spa images upd 子控件 ssInfo.LayoutStyle = ToolStripLayoutStyle.StackWithOverflow;//StatusStrip 控件

c++字節對齊

技術分享代碼必須 .net 存儲 logs 規則數組我們參考URL: http://blog.csdn.net/hairetz/article/details/4084088 0 字節對齊的意義按我的理解是便於cpu一次取完所有數據, 提高代碼的執行效 1 字節對

par函數的adj 參數- 控制文字的對齊方式

div 效果 lin .cn 表示對齊方式制圖技術 cnblogs adj 用來控制文字的對齊方式，取值範圍為0到1，控制圖片中x軸和y軸標簽，標題，以及通過text 添加的文字的對齊方式 0表示左對齊，代碼示例： par(adj = 0)plot(1:5, 1:5

內存變量邊界對齊

轉載 space net names ima logs 分配內存 .net sin 一、什麽是內存對齊 (1) 原理 a) 編譯器按照成員列表的順序給每個成員分配內存. b) 當成員需要滿足正確的邊界對齊時,成員之間用額外字節填充. c) 結構體的首地址必須滿足結

label按鈕和文字對齊

idt cnblogs ase img check 賬號 top http wrap label按鈕和文字對齊做表單的時候，經常遇到：復選框和文字對不齊的情況 ========================== 下面方法可以對齊 <!--label [[

移動端垂直居中對齊

absolut -i pos family items wid font display -c 方法一：利用CSS3的transform:translate .center{ width:50%; position: absolute; top: 5

字節對齊方法

pac pack 大於定義字節數方法 () 取消偽指令使用偽指令 #pragma pack (n)， c編譯器將按照n個字節對齊，使用偽指令#pragma pack()，取消自定義字節對齊方式。 __attribute__((aligned(n)))，讓所作用的

分配粒度和內存頁面大小（x86處理器平臺的分配粒度是64K，內存頁是4K，所以section都是0x1000對齊，硬盤扇區大小是512字節，所以PE文件默認文件對齊是0x200）

tail details lpad 硬件 512字節地址系統 pad 原因分配粒度和內存頁面大小x86處理器平臺的分配粒度是64K，32位CPU的內存頁面大小是4K,64位是8K,保留內存地址空間總是要和分配粒度對齊。一個分配粒度裏包含16個內存頁面。這是個概念，具體

使用GIZA++進行詞對齊

準備雙語語料

下載編譯GIZA++

執行命令進行詞對齊

相關推薦