日本語NLP

阿新 • • 發佈：2018-11-04

各種日本語分詞器綜述：

Mecab

下載地址http://taku910.github.io/mecab/

http://mecab.sourceforge.net/

Mecab is open source tokenizer system for various language(if you have dictionary for it)

See english documentation for detail

http://www.52nlp.cn/%E6%97%A5%E6%96%87%E5%88%86%E8%AF%8D%E5%99%A8-mecab-%E6%96%87%E6%A1%A3

Mecab是奈良先端科學技術大學院的Taku Kudo(工藤拓)開發的日文分詞系統。該作者寫過多個 machine learning 方面的軟體包，最有名的就是 CRF++。目前該作者在 [email protected] 工作。

Mecab設計的基本方針是不依賴於具體的語言，詞典，語料庫，採用 Conditional Random Fields (CRF) 模型進行引數估計, 效能優於使用隱馬模型的 ChaSen 。同時，平均解析速度高於 ChaSen, Juman, KAKASI 這些日文詞法分析器。

現有Mecab日語分詞詞典有ipadic詞典、neologd詞典。

ipadic詞典是Mecab的標準詞典，於2015年3月之後少有更新維護，所以有很多新詞分不出來。neologd詞典全稱mecab-ipadic-neologd詞典，包含許多新詞，可以在Mecab搭配使用。另外neoglod詞典可以經過格式轉換，在Juman/Juman++中使用。

支援c/c++整合，支援perl/python等各種指令碼呼叫。

mecab 安裝

 ％ tar zxfv mecab-XX.tar.gz
 ％ cd mecab-XX
 ％ ./configure --prefix=***
 ％ make
 ％ make check
 ％ make install

mecab ipadic 詞典編碼

沒有特別說明，預設使用 euc 編碼. 如果要使用 shift-jis 和 utf8 編碼，可以修改詞典的 configure 指令碼中 charset 選項，重新編譯詞典, 這樣就能生成 shift-jis 和 utf8 編碼的詞典.

% tar zxfv mecab-ipadic-2.7.0-xxxx
% cd mecab-ipadic-2.7.0-xxxx
% ./configure --with-charset=sjis
% make
% make install

% tar zxfv mecab-ipadic-2.7.0-xxxx
% ./configure --with-charset=utf8
% make
% make install

說明文件：http://www.flickering.cn/nlp/2014/06/%E6%97%A5%E6%96%87%E5%88%86%E8%AF%8D%E5%99%A8-mecab-%E6%96%87%E6%A1%A3/

最新版本MeCab 0.97 2008-02-03更新

	MeCab	ChaSen	JUMAN	KAKASI
解析模型	bi-gram 馬爾科夫模型	可變長馬爾科夫模型	bi-gram 馬爾科夫模型	最長一致
cost 估計	從語料庫學習	從語料庫學習	人手	沒有 cost 的概念
學習模型	CRF (區別式模型)	HMM (生成式模型)
詞典檢索演算法	Double Array	Double Array	Patricia Tree	Hash?
求解演算法	Viterbi	Viterbi	Viterbi	決定的?
連線表的實現	2元 Table	自動機	2元 Table?	沒有連線表?
詞性層級	無限制多級詞性	無限制多級詞性	固定2級	沒有詞性概念?
未登陸詞處理	字元種類 (動作定義可變更)	字元種類 (不可變更)	字元種類 (不可變更)
帶約束的解析	可能	2.4.0 以後可能	不可能	不可能
N-best解	可能	不可能	不可能	不可能

Juman

Juman/Juman++由京都大學的黑橋・河原研究室（Kurohashi & Kawahara Laboratory, Kyoto University, Japan, 主攻自然語言處理，http://nlp.ist.i.kyoto-u.ac.jp）開發。Juman/Juman++提供分詞和POS能力。

Juman is tokenizer tool developped by Kurohashi laboratory, Kyoto University, Japan.

Juman is strong for ambigious writing style in Japanese, and is strong for new-comming words thanks to Web based huge dictionary.

And, Juman tells you semantic meaning of words.

Juman++

Juman++ is tokenizer developped by Kurohashi laboratory, Kyoto University, Japan.

Juman++ is succeeding system of Juman. It adopts RNN model for tokenization.

Juman++ is strong for ambigious writing style in Japanese, and is strong for new-comming words thanks to Web based huge dictionary.

And, Juman tells you semantic meaning of words.

Kytea

Kytea is tokenizer tool developped by Graham Neubig.

Kytea has a different algorithm from one of Mecab or Juman.

http://www.phontron.com/kytea/

Kyoto Text Analysis Toolkit

Chasen

http://chasen-legacy.osdn.jp/

Kuromoji

http://www.atilika.org

如何使用http://rensanning.iteye.com/blog/2008575

wirtten by JAVA

Kuromoji supports standard morphological analysis features such as

Word segmentation - segmenting text into words (morphemes)
Part-of-speech tagging - assign word-categories (nouns, verbs, particles, adjectives, etc.)
Lemmatization - get dictionary forms for inflected verbs and adjectives
Readings - extract readings for kanji

Gosen

網址https://github.com/westei/stanbol-gosen

http://code.google.com/p/lucene-gosen/

支援分句、分詞、POS和NER功能。是否是獨立工具待明確？

python package:

JapaneseTokenizer

用python實現對多個分詞器的封裝，包括Mecab、Juman、Juman++、Kytea，下載網址https://pypi.org/project/JapaneseTokenizer/1.3.0/

natto-py

https://pypi.org/project/natto-py/

日語NLP相關的資源

Stopwrods：

https://www.ranks.nl/stopwords/japanese

これ

それ
あれ
この
その
あの
ここ
そこ
あそこ
こちら
どこ
だれ
なに
なん
何
私
貴方

貴方方
我々
私達
あの人
あのかた
彼女
彼
です
あります
おります
います
は
が
の
に
を
で

え
から
まで
より
も
どの
と
し
それで
しかし

判斷是否是日文字元

\u0800-\u4e00

全形轉半形

日本語NLP

mecab ipadic 詞典編碼

Kuromoji

日本語NLP

日本語の文法? day1　限る型…意向形の紹介

[090626]痕　～きずあと～【日文硬盤版】[痕+雫&Origin][全CG存檔&攻略=日本語啟動+打開存檔補丁]

日本語語料庫

標準日本語（初級）——第2課これは本です

標準日本語（初級）——第1課李さんは中國人です

アマゾンウェブサービス（AWS 日本語）

『原創經典』標準日本語初級筆記完整版(1)

[120928]リバースコロニー -ロストアズライト[Re：birth colony -Lost azurite-]【漢化硬碟版】[帶全CG存檔&攻略+漢化+日本語啟動+修正補丁]

テレビ日本語教室トレーニング　第二回

日語學習之新版初級標準日本語-16 ホテルの部屋は広くて明るいです

【NLP】語義與語用的區別

NLP漢語自然語言處理原理與實踐 5 詞性、語塊和命名實體識別

日本程序開發式自定義的malloc/free函數（三）-源代碼（ソースコード）

C語言中switch...case語句中break的重要性

c語言中char* 代表什麽

NLP任務中的基本指標(precision and recall )

【NLP】Python實例：基於文本相似度對申報項目進行查重設計

go語言中container容器數據結構heap、list、ring

C語言中“段錯誤”出現的場景

日本語NLP

Kuromoji

相關推薦