|
||||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |
See:
Description
Interface Summary | |
---|---|
Algorithm | The algorithm interface for implementing variations of bayes Algorithm |
Datastore | The Datastore interface for the Algorithm to use |
Class Summary | |
---|---|
BayesAlgorithm | Class implementing the Naive Bayes Classifier Algorithm |
BayesParameters | BayesParameter used for passing parameters to the Map/Reduce Jobs parameters include gramSize, |
ByScoreLabelResultComparator | Compare two results of classification and return the lowest valued one |
CBayesAlgorithm | Class implementing the Complementary Naive Bayes Classifier Algorithm |
ClassifierContext | The Classifier Wrapper used for choosing the Algorithm and Datastore |
InMemoryBayesDatastore | Class implementing the Datastore for Algorithms to read In-Memory model |
SequenceFileModelReader | This Class reads the different interim files created during the Training stage as well as the Model File during testing. |
TestClassifier | Test the Naive Bayes classifier with improved weighting To run the twenty newsgroups example: refer http://cwiki.apache.org/MAHOUT/twentynewsgroups.html |
TrainClassifier | Train the Naive Bayes classifier with improved weighting. |
Exception Summary | |
---|---|
InvalidDatastoreException | Exception thrown when illegal access is done on the datastore or when the backend storage goes down. |
This package provides an implementation of a MapReduce-enabled Naïve Bayes classifier. It is a very simple classifier that counts the occurrences of words in association with a label which can then be used to determine the likelihood that a new document, and its words, should be assigned a particular label.
The implementation is divided up into three parts:
The trainer is manifested in several classes:
BayesDriver
-- Creates the Hadoop Naive Bayes job and outputs
the model. This Driver encapsulates a lot of intermediate Map-Reduce ClassesBayesFeatureDriver
BayesTfIdfDriver
BayesWeightSummerDriver
BayesThetaNormalizerDriver
The trainer assumes that the input files are in the KeyValueTextInputFormat
,
i.e. the first token of the line is the label and separated from the remaining tokens on the line by a
tab-delimiter. The remaining tokens are the unique features (words). Thus, input documents might look like:
hockey puck stick goalie forward defenseman referee ice checking slapshot helmet football field football pigskin referee helmet turf tackle
where hockey and football are the labels and the remaining words are the features associated with those particular labels.
The output from the trainer is a SequenceFile
.
|
||||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |