Package org.apache.mahout.classifier.bayes

Introduction

See:
          Description

Class Summary
TestClassifier Test the Naive Bayes classifier with improved weighting

To run the twenty newsgroups example: refer http://cwiki.apache.org/MAHOUT/twentynewsgroups.html

TrainClassifier Train the Naive Bayes classifier with improved weighting.
 

Package org.apache.mahout.classifier.bayes Description

Introduction

This package provides an implementation of a MapReduce-enabled Naïve Bayes classifier. It is a very simple classifier that counts the occurrences of words in association with a label which can then be used to determine the likelihood that a new document, and its words, should be assigned a particular label.

Implementation

The implementation is divided up into three parts:

  1. The Trainer -- responsible for doing the counting of the words and the labels
  2. The Model -- responsible for holding the training data in a useful way
  3. The Classifier -- responsible for using the trainers output to determine the category of previously unseen documents

The Trainer

The trainer is manifested in several classes:

  1. org.apache.mahout.classifier.bayes.BayesDriver -- Creates the Hadoop Naive Bayes job and outputs the model. This Driver encapsulates a lot of intermediate Map-Reduce Classes
  2. org.apache.mahout.classifier.bayes.common.BayesFeatureDriver
  3. org.apache.mahout.classifier.bayes.common.BayesTfIdfDriver
  4. org.apache.mahout.classifier.bayes.common.BayesWeightSummerDriver
  5. org.apache.mahout.classifier.bayes.BayesThetaNormalizerDriver

The trainer assumes that the input files are in the KeyValueTextInputFormat, i.e. the first token of the line is the label and separated from the remaining tokens on the line by a tab-delimiter. The remaining tokens are the unique features (words). Thus, input documents might look like:

 hockey puck stick goalie forward defenseman referee ice checking slapshot helmet
 football field football pigskin referee helmet turf tackle
 

where hockey and football are the labels and the remaining words are the features associated with those particular labels.

The output from the trainer is a SequenceFile.

The Model

The org.apache.mahout.classifier.bayes.BayesModel is the data structure used to represent the results of the training for use by the org.apache.mahout.classifier.bayes.BayesClassifier. A Model can be created by hand, or, if using the org.apache.mahout.classifier.bayes.BayesDriver, it can be created from the SequenceFile that is output. To create it from the SequenceFile, use the SequenceFileModelReader located in the io subpackage.

The Classifier

The org.apache.mahout.classifier.bayes.BayesClassifier is responsible for using a org.apache.mahout.classifier.bayes.BayesModel to classify documents into categories.



Copyright © 2008-2011 The Apache Software Foundation. All Rights Reserved.