Naive Bayes Classifier in OpenNLP

阿新 • • 發佈：2019-01-02

The OpenNLP project of the Apache Foundation is a machine learning toolkit for text analytics.

For many years, OpenNLP did not carry a Naive Bayes classifier implementation.

OpenNLP has finally included a Naive Bayes classifier implementation in the trunk (it is not yet available in a stable release).

Naive Bayes classifiers are very useful when there is little to no labelled data available.

Labelled data is usually needed in large quantities to train classifiers.

However, the Naive Bayes classifier can sometimes make do with a very small amount of labelled data and bootstrap itself over unlabelled data. Unlabelled data is usually easier to get your hands on or cheaper to collect than labelled data – by far. The process of bootstrapping Naive Bayes classifiers over unlabelled data is explained in the paper “

Text Classification from Labelled and Unlabelled Documents using EM” by Kamal Nigam et al.

So, whenever I get clients who are using OpenNLP, but have only very scanty labelled data available to train a classifier with, I end up having to teach them to build a Naive Bayes classifier and bootstrap it by using an EM procedure over unlabelled data.

Now that won’t be necessary any longer, because OpenNLP provides a Naive Bayes classifier that can be used for that purpose.

Tutorial

Training a Naive Bayes classifier is a lot like training a maximum entropy classifier. In fact, you still have to use the DocumentCategorizerME class to do it.

But you pass in a special parameter to tell the DocumentCategorizerME class that you want a Naive Bayes classifier instead.

Here is some code for training a classifier (from the OpenNLP manual) in this case, the Maximum Entropy classifier.

DoccatModel model = null;
InputStream dataIn = null;
try {
  dataIn = new FileInputStream("en-sentiment.train");
  ObjectStream<String> lineStream =
		new PlainTextByLineStream(dataIn, "UTF-8");
  ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);

  // Training a maxent model by default!!!
  model = DocumentCategorizerME.train("en", sampleStream);
}
catch (IOException e) {
  // Failed to read or parse training data, training failed
  e.printStackTrace();
}

Now, if you want to invoke the new Naive Bayes classifier instead, you just have to pass in a few training parameters, as follows.

						
DoccatModel model = null;
InputStream dataIn = null;
try {
  dataIn = new FileInputStream("en-sentiment.train");
  ObjectStream<String> lineStream =
		new PlainTextByLineStream(dataIn, "UTF-8");
  ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);

  TrainingParameters params = new TrainingParameters();
  params.put(TrainingParameters.CUTOFF_PARAM, Integer.toString(0));
  params.put(TrainingParameters.ALGORITHM_PARAM, NaiveBayesTrainer.NAIVE_BAYES_VALUE);

  // Now the parameter TrainingParameters.ALGORITHM_PARAM ensures
  // that we train a Naive Bayes model instead
  model = DocumentCategorizerME.train("en", sampleStream, params);
}
catch (IOException e) {
  // Failed to read or parse training data, training failed
  e.printStackTrace();
}

Evaluation

I ran some tests on the Naive Bayes document categorizer in OpenNLP built from the trunk (you can also get the latest build using Maven).

Here are the numbers.

1. Subjectivity Classification

I ran the experiment on the 5000 movie reviews dataset (used in the paper “A Sentimental Education” by Bo Pang and Lillian Lee) with a 50:50 split into training and test:

Accuracies
Perceptron: 57.54% (100 iterations)
Perceptron: 59.96% (1000 iterations)
Maxent: 91.48% (100 iterations)
Maxent: 90.68% (1000 iterations)
Naive Bayes: 90.72%

2. Sentiment Polarity Classification

Cornell movie review dataset v1.1 (700 positive and 700 negative reviews).

With 350 of each as training and the rest as test, I get:

Accuracies
Perceptron: 49.70% (100 iterations)
Perceptron: 49.85% (1000 iterations)
Maxent: 77.11% (100 iterations)
Maxent: 77.55% (1000 iterations)
Naive Bayes: 75.65%

Naive Bayes Classifier in OpenNLP

Naive Bayes Classifier in OpenNLP

naive bayes classifier in data mining

Naive Bayes Classifier From Scratch in Python

樸素貝葉斯分類器的應用 Naive Bayes classifier

機器學習---樸素貝葉斯分類器（Machine Learning Naive Bayes Classifier）

Developing a Naive Bayes Text Classifier in JAVA

Accord.NET_Naive Bayes Classifier

【黎明傳數==>機器學習速成寶典】模型篇05——樸素貝葉斯【Naive Bayes】（附python代碼）

【Spark MLlib速成寶典】模型篇04樸素貝葉斯【Naive Bayes】（Python版）

機器學習實戰三（Naive Bayes）

機器學習分類實例——SVM(修改)/Decision Tree/Naive Bayes

基於Naive Bayes算法的文本分類

樸素貝葉斯算法（Naive Bayes）

機器學習實戰（三）樸素貝葉斯NB（Naive Bayes）

機器學習——樸素貝葉斯（Naive Bayes）詳細解讀

naive bayes 演算法的Python實現與理解

樸素貝葉斯分類（Naive Bayes,NB）

深入理解樸素貝葉斯（Naive Bayes）

機器學習2：Naive Bayes（樸素貝葉斯）

機器學習之樸素貝葉斯(Naive Bayes)

Naive Bayes Classifier in OpenNLP

相關推薦