Package org.apache.mahout.classifier.bayes

The Bayes example package provides some helper classes for training the Naive Bayes classifier on the Twenty Newsgroups data.

See:
          Description

Interface Summary
SplitBayesInput.SplitCallback Used to pass information back to a caller once a file has been split without the need for a data object
 

Class Summary
PrepareTwentyNewsgroups Prepare the 20 Newsgroups files for training using the BayesFileFormatter.
SplitBayesInput A utility for splitting files in the input format used by the Bayes classifiers into training and test sets in order to perform cross-validation.
WikipediaDatasetCreatorDriver Create and run the Wikipedia Dataset Creator.
WikipediaDatasetCreatorMapper Maps over Wikipedia xml format and output all document having the category listed in the input category file
WikipediaDatasetCreatorOutputFormat This class extends the MultipleOutputFormat, allowing to write the output data to different output files in sequence file output format.
WikipediaDatasetCreatorReducer Can also be used as a local Combiner
WikipediaXmlSplitter Splits the wikipedia xml file in to chunks of size as specified by command line parameter
XmlInputFormat Reads records that are delimited by a specific begin/end tag.
XmlInputFormat.XmlRecordReader XMLRecordReader class to read through a given xml document to output xml blocks as records as specified by the start tag and end tag
 

Package org.apache.mahout.classifier.bayes Description

The Bayes example package provides some helper classes for training the Naive Bayes classifier on the Twenty Newsgroups data. See PrepareTwentyNewsgroups for details on running the trainer. See PrepareTwentyNewsgroups for details on formatting the Twenty Newsgroups data properly for the training.
The easiest way to prepare the data is to use the ant task in core/build.xml:
    ant extract-20news-18828
  
This runs the arg line:
    -p ${working.dir}/20news-18828/ -o ${working.dir}/20news-18828-collapse -a ${analyzer} -c UTF-8
  
 
To Run the Wikipedia examples (assumes you've built the Mahout Job jar):
  1. Download the Wikipedia Dataset. Use the Ant target: ant enwiki-files
  2. Chunk the data using the WikipediaXmlSplitter (from the Hadoop home):
    bin/hadoop jar <PATH TO MAHOUT>/target/mahout-examples-0.2 org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d <MAHOUT_HOME>/examples/temp/enwiki-latest-pages-articles.xml -o <MAHOUT_HOME>/examples/work/wikipedia/chunks/ -c 64
Copyright © 2008 Apache Software Foundation



Copyright © 2008-2010 The Apache Software Foundation. All Rights Reserved.