Package org.apache.mahout.classifier.bayes

Introduction

See:
          Description

Interface Summary
Algorithm The algorithm interface for implementing variations of bayes Algorithm
Datastore The Datastore interface for the Algorithm to use
 

Class Summary
BayesAlgorithm Class implementing the Naive Bayes Classifier Algorithm
BayesParameters BayesParameter used for passing parameters to the Map/Reduce Jobs parameters include gramSize,
ByScoreLabelResultComparator Compare two results of classification and return the lowest valued one
CBayesAlgorithm Class implementing the Complementary Naive Bayes Classifier Algorithm
ClassifierContext The Classifier Wrapper used for choosing the Algorithm and Datastore
InMemoryBayesDatastore Class implementing the Datastore for Algorithms to read In-Memory model
SequenceFileModelReader This Class reads the different interim files created during the Training stage as well as the Model File during testing.
TestClassifier Test the Naive Bayes classifier with improved weighting

To run the twenty newsgroups example: refer http://cwiki.apache.org/MAHOUT/twentynewsgroups.html

TrainClassifier Train the Naive Bayes classifier with improved weighting.
 

Exception Summary
InvalidDatastoreException Exception thrown when illegal access is done on the datastore or when the backend storage goes down.
 

Package org.apache.mahout.classifier.bayes Description

Introduction

This package provides an implementation of a MapReduce-enabled Naïve Bayes classifier. It is a very simple classifier that counts the occurrences of words in association with a label which can then be used to determine the likelihood that a new document, and its words, should be assigned a particular label.

Implementation

The implementation is divided up into three parts:

  1. The Trainer -- responsible for doing the counting of the words and the labels
  2. The Model -- responsible for holding the training data in a useful way
  3. The Classifier -- responsible for using the trainers output to determine the category of previously unseen documents

The Trainer

The trainer is manifested in several classes:

  1. BayesDriver -- Creates the Hadoop Naive Bayes job and outputs the model. This Driver encapsulates a lot of intermediate Map-Reduce Classes
  2. BayesFeatureDriver
  3. BayesTfIdfDriver
  4. BayesWeightSummerDriver
  5. BayesThetaNormalizerDriver

The trainer assumes that the input files are in the KeyValueTextInputFormat, i.e. the first token of the line is the label and separated from the remaining tokens on the line by a tab-delimiter. The remaining tokens are the unique features (words). Thus, input documents might look like:

 hockey puck stick goalie forward defenseman referee ice checking slapshot helmet
 football field football pigskin referee helmet turf tackle
 

where hockey and football are the labels and the remaining words are the features associated with those particular labels.

The output from the trainer is a SequenceFile.



Copyright © 2008-2012 The Apache Software Foundation. All Rights Reserved.