|
||||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |
See:
Description
Interface Summary | |
---|---|
SplitBayesInput.SplitCallback | Used to pass information back to a caller once a file has been split without the need for a data object |
Class Summary | |
---|---|
PrepareTwentyNewsgroups | Prepare the 20 Newsgroups files for training using the
BayesFileFormatter . |
SplitBayesInput | A utility for splitting files in the input format used by the Bayes classifiers into training and test sets in order to perform cross-validation. |
WikipediaDatasetCreatorDriver | Create and run the Wikipedia Dataset Creator. |
WikipediaDatasetCreatorMapper | Maps over Wikipedia xml format and output all document having the category listed in the input category file |
WikipediaDatasetCreatorReducer | Can also be used as a local Combiner |
WikipediaXmlSplitter | Splits the wikipedia xml file in to chunks of size as specified by command line parameter |
XmlInputFormat | Reads records that are delimited by a specific begin/end tag. |
XmlInputFormat.XmlRecordReader | XMLRecordReader class to read through a given xml document to output xml blocks as records as specified by the start tag and end tag |
The Bayes example package provides some helper classes for training the Naive Bayes classifier
on the Twenty Newsgroups data. See PrepareTwentyNewsgroups
for details on running the trainer and
formatting the Twenty Newsgroups data properly for the training.
The easiest way to prepare the data is to use the ant task in core/build.xml:
ant extract-20news-18828
This runs the arg line:
-p $\{working.dir\}/20news-18828/ -o $\{working.dir\}/20news-18828-collapse -a $\{analyzer\} -c UTF-8
To Run the Wikipedia examples (assumes you've built the Mahout Job jar):
ant enwiki-files
bin/hadoop jar $MAHOUT_HOME/target/mahout-examples-0.x
org.apache.mahout.classifier.bayes.WikipediaXmlSplitter
-d $MAHOUT_HOME/examples/temp/enwiki-latest-pages-articles.xml
-o $MAHOUT_HOME/examples/work/wikipedia/chunks/ -c 64
|
||||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |