Package org.apache.mahout.classifier.bayes

The Bayes example package provides some helper classes for training the Naive Bayes classifier on the Twenty Newsgroups data.

See:
          Description

Interface Summary
SplitBayesInput.SplitCallback Used to pass information back to a caller once a file has been split without the need for a data object
 

Class Summary
PrepareTwentyNewsgroups Prepare the 20 Newsgroups files for training using the BayesFileFormatter.
SplitBayesInput A utility for splitting files in the input format used by the Bayes classifiers into training and test sets in order to perform cross-validation.
WikipediaDatasetCreatorDriver Create and run the Wikipedia Dataset Creator.
WikipediaDatasetCreatorMapper Maps over Wikipedia xml format and output all document having the category listed in the input category file
WikipediaDatasetCreatorReducer Can also be used as a local Combiner
WikipediaXmlSplitter Splits the wikipedia xml file in to chunks of size as specified by command line parameter
XmlInputFormat Reads records that are delimited by a specific begin/end tag.
XmlInputFormat.XmlRecordReader XMLRecordReader class to read through a given xml document to output xml blocks as records as specified by the start tag and end tag
 

Package org.apache.mahout.classifier.bayes Description

The Bayes example package provides some helper classes for training the Naive Bayes classifier on the Twenty Newsgroups data. See PrepareTwentyNewsgroups for details on running the trainer and formatting the Twenty Newsgroups data properly for the training.

The easiest way to prepare the data is to use the ant task in core/build.xml:

ant extract-20news-18828

This runs the arg line:

-p $\{working.dir\}/20news-18828/ -o $\{working.dir\}/20news-18828-collapse -a $\{analyzer\} -c UTF-8

To Run the Wikipedia examples (assumes you've built the Mahout Job jar):

  1. Download the Wikipedia Dataset. Use the Ant target: ant enwiki-files
  2. Chunk the data using the WikipediaXmlSplitter (from the Hadoop home): bin/hadoop jar $MAHOUT_HOME/target/mahout-examples-0.x org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d $MAHOUT_HOME/examples/temp/enwiki-latest-pages-articles.xml -o $MAHOUT_HOME/examples/work/wikipedia/chunks/ -c 64



Copyright © 2008-2011 The Apache Software Foundation. All Rights Reserved.