1. 程式人生 > >Naive Bayes Classifier in OpenNLP

Naive Bayes Classifier in OpenNLP

The OpenNLP project of the Apache Foundation is a machine learning toolkit for text analytics.

For many years, OpenNLP did not carry a Naive Bayes classifier implementation.

OpenNLP has finally included a Naive Bayes classifier implementation in the trunk (it is not yet available in a stable release).

Naive Bayes classifiers are very useful when there is little to no labelled data available.

Labelled data is usually needed in large quantities to train classifiers.

However, the Naive Bayes classifier can sometimes make do with a very small amount of labelled data and bootstrap itself over unlabelled data.  Unlabelled data is usually easier to get your hands on or cheaper to collect than labelled data – by far.  The process of bootstrapping Naive Bayes classifiers over unlabelled data is explained in the paper “

Text Classification from Labelled and Unlabelled Documents using EM” by Kamal Nigam et al.

So, whenever I get clients who are using OpenNLP, but have only very scanty labelled data available to train a classifier with, I end up having to teach them to build a Naive Bayes classifier and bootstrap it by using an EM procedure over unlabelled data.

Now that won’t be necessary any longer, because OpenNLP provides a Naive Bayes classifier that can be used for that purpose.

Tutorial

Training a Naive Bayes classifier is a lot like training a maximum entropy classifier.  In fact, you still have to use the DocumentCategorizerME class to do it.

But you pass in a special parameter to tell the DocumentCategorizerME class that you want a Naive Bayes classifier instead.

Here is some code for training a classifier (from the OpenNLP manual) in this case, the Maximum Entropy classifier.

DoccatModel model = null;
InputStream dataIn = null;
try {
  dataIn = new FileInputStream("en-sentiment.train");
  ObjectStream<String> lineStream =
		new PlainTextByLineStream(dataIn, "UTF-8");
  ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);

  // Training a maxent model by default!!!
  model = DocumentCategorizerME.train("en", sampleStream);
}
catch (IOException e) {
  // Failed to read or parse training data, training failed
  e.printStackTrace();
}

Now, if you want to invoke the new Naive Bayes classifier instead, you just have to pass in a few training parameters, as follows.

						
DoccatModel model = null;
InputStream dataIn = null;
try {
  dataIn = new FileInputStream("en-sentiment.train");
  ObjectStream<String> lineStream =
		new PlainTextByLineStream(dataIn, "UTF-8");
  ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);

  TrainingParameters params = new TrainingParameters();
  params.put(TrainingParameters.CUTOFF_PARAM, Integer.toString(0));
  params.put(TrainingParameters.ALGORITHM_PARAM, NaiveBayesTrainer.NAIVE_BAYES_VALUE);

  // Now the parameter TrainingParameters.ALGORITHM_PARAM ensures
  // that we train a Naive Bayes model instead
  model = DocumentCategorizerME.train("en", sampleStream, params);
}
catch (IOException e) {
  // Failed to read or parse training data, training failed
  e.printStackTrace();
}

Evaluation

I ran some tests on the Naive Bayes document categorizer in OpenNLP built from the trunk (you can also get the latest build using Maven).

Here are the numbers.

1. Subjectivity Classification

I ran the experiment on the 5000 movie reviews dataset (used in the paper “A Sentimental Education” by Bo Pang and Lillian Lee) with a 50:50 split into training and test:

Accuracies
Perceptron: 57.54% (100 iterations)
Perceptron: 59.96% (1000 iterations)
Maxent: 91.48% (100 iterations)
Maxent: 90.68% (1000 iterations)
Naive Bayes: 90.72%

2. Sentiment Polarity Classification

Cornell movie review dataset v1.1 (700 positive and 700 negative reviews).

With 350 of each as training and the rest as test, I get:

Accuracies
Perceptron: 49.70% (100 iterations)
Perceptron: 49.85% (1000 iterations)
Maxent: 77.11% (100 iterations)
Maxent: 77.55% (1000 iterations)
Naive Bayes: 75.65%