基於gibbsLDA的文字分類
之前幾篇文章講到了文件主題模型,但是畢竟我的首要任務還是做分類任務,而涉及主題模型的原因主要是用於text representation,因為考慮到Topic Model能夠明顯將文件向量降低維度,當然TopicModel可以做比這更多的事情,但是對於分類任務,我覺得這一點就差不多了。
LDA之前已經說到過,是一個比較完善的文件主題模型,這次試用的是JGibbsLDA開源的LDA程式碼做LDA的相關工作,簡單易用,用法官網上有,也可以自行谷歌。
按照官網上的引數和格式規範,就可以訓練生成語料相關的結果了,一共會產生以下幾個檔案:
- model-final.twords:topic-word,也就是每個主題對應的單詞分佈
- model-final.others:LDA的一些引數
- model-final.phi:該檔案是一個主題數×詞數量的矩陣
- model-final.tassign:這個是統計文件單詞的tf-idf
- model-final.theta:這個就是我們需要的,表示文件對應的主題概率
- wordmap.txt:這個是用來統計單詞詞頻
當然我們需要用到的是model-final.theta這個檔案,並將它作為文件神經網路分類器的輸入文章向量;
然後開始我們的實驗:
實驗語料:20_newsgroups,包含20類的分類新聞,並將測試集和訓練集按照1:1分開
實驗環境:JDK1.8 windows7
使用LDA開源工具:JGibbsLDA
分類器使用:100*300*20的簡單三層神經BP神經網路,神經網路的工具選取的是JOONE
首先,將預料進行預處理,去掉停用詞和無關的詞語(如日期年份郵件地址等),這個實驗沒有使用詞幹化處理,原因是開始準備使用Lucene的詞幹化處理工具,但是其處理效果很不好,會把does詞幹化成doe,把integrate 詞幹化成intergr 這就達不到我們的目的,而之後使用Stanford的coreNLP詞幹化工具,coreNLP詞幹化效果不錯,但是其處理是基於上下文的,導致處理速度過慢,達不到預期效果,所以最後沒有做詞幹化處理
由於LDA對於短文字的效果並不好,所以我們針對語料進行了篩選,選擇了文字長度大於5000的文章
訓練文字trainScale處理後的形式(這裡這是列舉了一行)
126
archive atheism resources alt atheism archive resources modified december version atheist resources addresses atheist organizations usa freedom religion foundation darwin fish bumper stickers assorted atheist paraphernalia freedom religion foundation write ffrf box madison wi telephone evolution designs evolution designs sell darwin fish fish symbol christians stick cars feet word darwin written inside deluxe moulded plastic fish postpaid write evolution designs laurel canyon north hollywood san francisco bay area darwin fish lynn gold mailing net lynn directly price fish american atheist press aap publish atheist books critiques bible lists biblical contradictions book bible handbook ball foote american atheist press isbn edition bible contradictions absurdities atrocities immoralities ball foote bible contradicts aap based king james version bible write american atheist press box austin tx cameron road austin tx telephone fax prometheus books sell books including haught holy horrors write east amherst street buffalo york telephone alternate address newer older prometheus books glenn drive buffalo ny african americans humanism organization promoting black secular humanism uncovering history black freethought publish quarterly newsletter aah examiner write norm allen jr african americans humanism box buffalo ny united kingdom rationalist press association national secular society islington high street holloway road london ew london nl british humanist association south place ethical society lamb conduit passage conway hall london wc rh red lion square london wc rl fax national secular society publish freethinker monthly magazine founded germany ibka internationaler bund der konfessionslosen und atheisten postfach berlin germany ibka publish journal miz materialien und informationen zur zeit politisches journal der konfessionslosesn und atheisten hrsg ibka miz vertrieb postfach berlin germany atheist books write ibdk internationaler ucherdienst der konfessionslosen postfach hannover germany telephone books fiction thomas disch santa claus compromise short story ultimate proof santa exists characters events fictitious similarity living dead gods uh walter miller jr canticle leibowitz gem atomic doomsday novel monks spent lives copying blueprints saint leibowitz filling sheets paper ink leaving white lines letters edgar pangborn davy atomic doomsday novel set clerical church example forbids produce describe substance atoms philip dick philip dick dick wrote philosophical thought provoking short stories novels stories bizarre times approachable wrote sf wrote truth religion technology believed met sort god remained sceptical novels relevance galactic pot healer fallible alien deity summons group earth craftsmen women remote planet raise giant cathedral beneath oceans deity demand faith earthers pot healer joe fernwright unable comply polished ironic amusing novel maze death noteworthy description technology based religion valis schizophrenic hero searches hidden mysteries gnostic christianity reality fired brain pink laser beam unknown divine origin accompanied dogmatic dismissively atheist friend assorted odd characters divine invasion god invades earth making young woman pregnant returns star system terminally ill assisted dead man brain wired hour listening music margaret atwood handmaid tale story based premise congress mysteriously assassinated fundamentalists charge nation set book diary woman life live christian theocracy women property revoked bank accounts closed sinful luxuries outlawed radio readings bible crimes punished retroactively doctors performed legal abortions hunted hanged atwood writing style difficult tale grows chilling authors bible dull rambling work criticized worth reading ll fuss exists versions true version books fiction peter de rosa vicars christ bantam press de rosa christian catholic enlighting history papal immoralities adulteries fallacies german translation gottes erste diener die dunkle seite des papsttums droemer knaur michael martin atheism philosophical justification temple university press philadelphia usa detailed scholarly justification atheism outstanding appendix defining terminology usage tendentious area argues negative atheism belief existence god positive atheism belief existence god includes refutations challenging arguments god attention paid refuting contempory theists platinga swinburne isbn hardcover paperback case christianity temple university press comprehensive critique christianity considers contemporary defences christianity ultimately demonstrates unsupportable incoherent isbn james turner god creed johns hopkins university press baltimore md usa subtitled origins unbelief america examines unbelief agnostic atheistic mainstream alternative view focusses period considering france britain emphasis american england developments religious history secularization atheism god creed intellectual history fate single idea belief god exists isbn hardcover paper george seldes editor thoughts ballantine books york usa dictionary quotations kind concentrating statements writings explicitly implicitly person philosophy view includes obscure suppressed opinions popular observations traces expressed twisted idea centuries number quotations derived cardiff men religion noyes views religion isbn paper richard swinburne existence god revised edition clarendon paperbacks oxford book second volume trilogy began coherence theism concluded faith reason work swinburne attempts construct series inductive arguments existence god arguments tendentious rely imputation late century western christian values aesthetics god supposedly simple conceived decisively rejected mackie miracle theism revised edition existence god swinburne includes appendix incoherent attempt rebut mackie mackie miracle theism oxford posthumous volume comprehensive review principal arguments existence god ranges classical philosophical positions descartes anselm berkeley hume al moral arguments newman kant sidgwick restatements classical theses plantinga swinburne addresses positions push concept god realm rational kierkegaard kung philips replacements god lelie axiarchism book delight read formalistic written martin works refreshingly direct compared hand waving swinburne james haught holy horrors illustrated history religious murder madness prometheus books religious persecution ancient times christians library congress catalog card number norm allen jr african american humanism anthology listing african americans humanism gordon stein anthology atheism rationalism prometheus books anthology covering wide range subjects including devil evil morality history freethought comprehensive bibliography edmund cohen mind bible believer prometheus books study christian fundamentalists net resources small mail based archive server mantis uk carries archives alt atheism moderated articles assorted files send mail archive uk send atheism mail reply mathew ?
其中的每一行都表示一個文件,行的單詞表示文件的單詞,使用的是詞袋模型,因此詞的順序對於結果沒有關係
第一行的126表示126篇文件
然後我們將這個訓練文字應用於LDA的處理,主要程式碼如下:
<span style="font-size: 18px;"> public void lda(){
LDACmdOption ldaOption = new LDACmdOption();
ldaOption.est = true;
ldaOption.K=100; //表示100個主題
ldaOption.beta = 0.1; //beta引數
ldaOption.alpha = 10.0/ldaOption.K; //alpha引數
ldaOption.niters = 500; //迭代代數
ldaOption.savestep=200; //每隔200代就儲存一下
ldaOption.modelName="model-train"; //模型名稱
ldaOption.dir="D:\\J2ee_workspace\\LDATest"; //訓練文字所在目錄
ldaOption.dfile="trainScale"; //訓練文字檔案
Estimator estimator = new Estimator();
estimator.init(ldaOption);
estimator.estimate(); //開始引數估計
}</span>
程式碼中的具體引數都給出了註釋,訓練出來的model-final.theta結果如下:(這裡只展示model-final.theta的部分內容)
1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.0012087912087912088;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.3045054945054945;0.002307692307692308;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.0012087912087912088;0.0012087912087912088;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.0012087912087912088;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.004505494505494505;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.5671428571428572;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.017692307692307695;0.0078021978021978015;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.002307692307692308;1.0989010989010989E-4;1.0989010989010989E-4;0.0012087912087912088;0.027582417582417584;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.02208791208791209;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.0012087912087912088;1.0989010989010989E-4;0.01989010989010989;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.0078021978021978015;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;
1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.563985837126961E-4;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;0.35058168942842693;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;0.0010622154779969652;5.563985837126961E-4;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;0.640414769853313;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;0.0010622154779969652;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.563985837126961E-4;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.563985837126961E-4;5.0581689428426914E-5;
需要說明的是,我對JGibbsLDA程式碼做了部分修改,使之滿足我的神經網路分類器的輸出格式要求,上面的前20行表示類別資訊,中間數字為1的所在位置表示這個類別,比如上面前20列表示這個文字屬於類別1, 20列之後表示這個文件的主題分佈,我使用了100個類,所以是100個數字
有了訓練文字產生的LDA模型就可以對測試資料按照生成的模型產生測試文件向量,在這裡,生成測試文件向量的方法有多種,當然最簡單的是將測試文件再次丟進訓練文件,重新跑個LDA模型出來,這種方法顯然耗時,所以不建議採用,當然如果測試文件數量比較大的話而訓練文件數量小的話還是可以試一試的,一般會採用第二種方法:對於新的文件,在訓練文件生成的模型基礎之上在生成新的文件的向量,這個一般的做法是隻對新的文件進行Gibbs取樣,而模型的twords不變。JGibbsLDA有比較容易的實現方法:
public void generateWithLDAModel(){
LDACmdOption ldaOption = new LDACmdOption();
ldaOption.inf = true;
ldaOption.estc = false;
ldaOption.dir = "D:\\J2ee_workspace\\LDATest";
ldaOption.modelName = "model-final"; //根據訓練文件生成的模型檔案,注意檔案的位置需要在根目錄下
ldaOption.dfile = "testScale"; //測試文件路徑
Inferencer inferencer = new Inferencer();
inferencer.init(ldaOption);
Model newModel = inferencer.inference();
newModel.saveModelTheta("./vector/test/testScale");//新生成的文件向量檔案存放的位置
}
生成新的測試文件向量檔案如下(只列出幾行):
1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;0.001765650080256822;1.6051364365971107E-4;0.001765650080256822;0.004975922953451044;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;0.4158908507223114;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;0.07078651685393259;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;0.0033707865168539327;1.6051364365971107E-4;0.09486356340288925;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;0.001765650080256822;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;0.38218298555377206;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;0.001765650080256822;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;0.006581059390048154;
1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;4.22102839600921E-4;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;0.06335379892555641;3.8372985418265546E-5;4.22102839600921E-4;0.24102072141212588;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;4.22102839600921E-4;3.8372985418265546E-5;3.8372985418265546E-5;4.22102839600921E-4;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;8.058326937835764E-4;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;4.22102839600921E-4;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;4.22102839600921E-4;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;0.6876822716807367;3.8372985418265546E-5;0.0011895625479662318;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;
上面的表示意義和之前的訓練文件向量一樣
有了這些個檔案,就可以丟到JOONE神經網路分類器(三層100*300*20的簡單BP神經網路)裡面去分類了:
分類效果如下:
在121個測試用例中,正確的分類用例為100個,準確率約為81%,對於這個結果,我還是覺得可以接受的,雖然可能對於這樣的效果還不如簡單的tf-idf+SVM模型,但是這個實驗主要是想探尋LDA的降維做法對於分類任務是不是可行的,所以對於文件維度為100,81%的結果我覺得還是勉強能接受的。