org.apache.poi.hwpf.extractor
Class WordExtractor

java.lang.Object
  extended by org.apache.poi.hwpf.extractor.WordExtractor

public class WordExtractor
extends java.lang.Object

Class to extract the text from a Word Document. You should use either getParagraphText() or getText() unless you have a strong reason otherwise.

Author:
Nick Burch (nick at torchbox dot com)

Constructor Summary
WordExtractor(HWPFDocument doc)
          Create a new Word Extractor
WordExtractor(java.io.InputStream is)
          Create a new Word Extractor
WordExtractor(POIFSFileSystem fs)
          Create a new Word Extractor
 
Method Summary
 java.lang.String[] getParagraphText()
          Get the text from the word file, as an array with one String per paragraph
 java.lang.String getText()
          Grab the text, based on the paragraphs.
 java.lang.String getTextFromPieces()
          Grab the text out of the text pieces.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

WordExtractor

public WordExtractor(java.io.InputStream is)
              throws java.io.IOException
Create a new Word Extractor

Parameters:
is - InputStream containing the word file
Throws:
java.io.IOException

WordExtractor

public WordExtractor(POIFSFileSystem fs)
              throws java.io.IOException
Create a new Word Extractor

Parameters:
fs - POIFSFileSystem containing the word file
Throws:
java.io.IOException

WordExtractor

public WordExtractor(HWPFDocument doc)
              throws java.io.IOException
Create a new Word Extractor

Parameters:
doc - The HWPFDocument to extract from
Throws:
java.io.IOException
Method Detail

getParagraphText

public java.lang.String[] getParagraphText()
Get the text from the word file, as an array with one String per paragraph


getTextFromPieces

public java.lang.String getTextFromPieces()
Grab the text out of the text pieces. Might also include various bits of crud, but will work in cases where the text piece -> paragraph mapping is broken. Fast too.


getText

public java.lang.String getText()
Grab the text, based on the paragraphs. Shouldn't include any crud, but slightly slower than getTextFromPieces().



Copyright 2006 The Apache Software Foundation or its licensors, as applicable.