org.apache.mahout.classifier
Class BayesFileFormatter

java.lang.Object
  extended by org.apache.mahout.classifier.BayesFileFormatter

public final class BayesFileFormatter
extends java.lang.Object

Flatten a file into format that can be read by the Bayes M/R job.

One document per line, first token is the label followed by a tab, rest of the line are the terms.


Method Summary
static void collapse(java.lang.String label, org.apache.lucene.analysis.Analyzer analyzer, java.io.File inputDir, java.nio.charset.Charset charset, java.io.File outputFile)
          Collapse all the files in the inputDir into a single file in the proper Bayes format, 1 document per line
static void format(java.lang.String label, org.apache.lucene.analysis.Analyzer analyzer, java.io.File input, java.nio.charset.Charset charset, java.io.File outDir)
          Write the input files to the outdir, one output file per input file
static void main(java.lang.String[] args)
          Run the FileFormatter
static java.lang.String[] readerToDocument(org.apache.lucene.analysis.Analyzer analyzer, java.io.Reader reader)
          Convert a Reader to a vector
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

collapse

public static void collapse(java.lang.String label,
                            org.apache.lucene.analysis.Analyzer analyzer,
                            java.io.File inputDir,
                            java.nio.charset.Charset charset,
                            java.io.File outputFile)
                     throws java.io.IOException
Collapse all the files in the inputDir into a single file in the proper Bayes format, 1 document per line

Parameters:
label - The label
analyzer - The analyzer to use
inputDir - The input Directory
charset - The charset of the input files
outputFile - The file to collapse to
Throws:
java.io.IOException

format

public static void format(java.lang.String label,
                          org.apache.lucene.analysis.Analyzer analyzer,
                          java.io.File input,
                          java.nio.charset.Charset charset,
                          java.io.File outDir)
                   throws java.io.IOException
Write the input files to the outdir, one output file per input file

Parameters:
label - The label of the file
analyzer - The analyzer to use
input - The input file or directory. May not be null
charset - The Character set of the input files
outDir - The output directory. Files will be written there with the same name as the input file
Throws:
java.io.IOException

readerToDocument

public static java.lang.String[] readerToDocument(org.apache.lucene.analysis.Analyzer analyzer,
                                                  java.io.Reader reader)
                                           throws java.io.IOException
Convert a Reader to a vector

Parameters:
analyzer - The Analyzer to use
reader - The reader to feed to the Analyzer
Returns:
An array of unique tokens
Throws:
java.io.IOException

main

public static void main(java.lang.String[] args)
                 throws java.lang.ClassNotFoundException,
                        java.lang.IllegalAccessException,
                        java.lang.InstantiationException,
                        java.io.IOException
Run the FileFormatter

Parameters:
args - The input args. Run with -h to see the help
Throws:
java.lang.ClassNotFoundException - if the Analyzer can't be found
java.lang.IllegalAccessException - if the Analyzer can't be constructed
java.lang.InstantiationException - if the Analyzer can't be constructed
java.io.IOException - if the files can't be dealt with properly


Copyright © 2008-2010 The Apache Software Foundation. All Rights Reserved.