coreseek一元切分模式中英文單詞不切分問題
(http://www.coreseek.cn/products-install/ngram_len_cjk/ 文件地址,此處僅列出主要部分)
#部分文件:
ngram_chars = U+4E00..U+9FBF, U+3400..U+4DBF, U+20000..U+2A6DF, U+F900..U+FAFF,\
U+2F800..U+2FA1F, U+2E80..U+2EFF, U+2F00..U+2FDF, U+3100..U+312F, U+31A0..U+31BF,\
U+3040..U+309F, U+30A0..U+30FF, U+31F0..U+31FF, U+AC00..U+D7AF, U+1100..U+11FF,\
U+3130..U+318F, U+A000..U+A48F, U+A490..U+A4CF
charset_table = U+FF10..U+FF19->0..9, 0..9, U+FF41..U+FF5A->a..z, U+FF21..U+FF3A->a..z,\
A..Z->a..z, a..z, U+0149, U+017F, U+0138, U+00DF, U+00FF, U+00C0..U+00D6->U+00E0..U+00F6,\ ......略..
# end
其中: ngram_chars 表示要進行一元字元切分模式的字符集;
charset_table 表示可被一元字元切分模式認可的有效字符集;
仔細對比字符集開頭,發現ngram_chars中沒有數字與英文字母的集合,呵呵!終於找到原因了,將charset_table字符集開頭:“U+FF10..U+FF19->0..9,0..9,U+FF41..U+FF5A->a..z,U+FF21..U+FF3A->a..z,A..Z->a..z, a..z,”部分,複製到ngram_char字符集前頭如下:
ngram_chars =U+FF10..U+FF19->0..9, 0..9, U+FF41..U+FF5A->a..z, U+FF21..U+FF3A->a..z,\
A..Z->a..z, a..z, U+4E00..U+9FBF, U+3400..U+4DBF, U+20000..U+2A6DF, U+F900..U+FAFF,\
U+2F800..U+2FA1F, U+2E80..U+2EFF, U+2F00..U+2FDF, U+3100..U+312F, U+31A0..U+31BF,\
U+3040..U+309F, U+30A0..U+30FF, U+31F0..U+31FF, U+AC00..U+D7AF, U+1100..U+11FF,\
U+3130..U+318F, U+A000..U+A48F, U+A490..U+A4CF
charset_table = U+FF10..U+FF19->0..9, 0..9, U+FF41..U+FF5A->a..z, U+FF21..U+FF3A->a..z,\
A..Z->a..z, a..z, U+0149, U+017F, U+0138, U+00DF, U+00FF, U+00C0..U+00D6->U+00E0..U+00F6,\ ......略..
重新執行索引,問題解決。