net.nutch.analysis.lang
Class NGramProfile

java.lang.Object
  extended bynet.nutch.analysis.lang.NGramProfile

public class NGramProfile
extends Object

This class runs a ngram analysis over submitted text, results might be used for automatic language identifiaction. The similarity calculation is at experimental level. You have been warned. Methods are provided to build new NGramProfiles profiles.

Author:
Sami Siren

Field Summary
static Logger LOG
           
 
Constructor Summary
NGramProfile(String name)
          Construct a new ngram profile
NGramProfile(String name, int minlen, int maxlen)
          Construct a new ngram profile
 
Method Summary
 void addFromToken(Token t)
          Add ngrams from a token to this profile
 void addNGrams(StringBuffer word)
          Add ngrams from a single word to this profile
 void analyze(StringBuffer text)
          Analyze a piece of text
static NGramProfile createNgramProfile(String name, InputStream is, String encoding)
          Create a new Language profile from (preferably quite large) text file
 String getName()
           
 float getSimilarity(NGramProfile another)
          Calculate a score how well NGramProfiles match each other
 Vector getSorted()
          Return sorted vector of ngrams (sort done by 1.
 void load(InputStream is)
          Loads a ngram profile from InputStream (assumes UTF-8 encoded content)
static void main(String[] args)
          main method used for testing only
protected  void normalize()
          Normalize profile
 void save(OutputStream os)
          Writes NGramProfile content into OutputStream, content is outputted with UTF-8 encoding
 void setName(String name)
           
 String toString()
          Return ngramprofile as text
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

LOG

public static final Logger LOG
Constructor Detail

NGramProfile

public NGramProfile(String name)
Construct a new ngram profile

Parameters:
name - Name of profile

NGramProfile

public NGramProfile(String name,
                    int minlen,
                    int maxlen)
Construct a new ngram profile

Parameters:
name - Name of profile
minlen - min length of ngram sequences
maxlen - max length of ngram sequences
Method Detail

addFromToken

public void addFromToken(Token t)
Add ngrams from a token to this profile

Parameters:
t - Token to be added

analyze

public void analyze(StringBuffer text)
Analyze a piece of text

Parameters:
text - the text to be analyzed

normalize

protected void normalize()
Normalize profile


addNGrams

public void addNGrams(StringBuffer word)
Add ngrams from a single word to this profile

Parameters:
word -

getSorted

public Vector getSorted()
Return sorted vector of ngrams (sort done by 1. count 2. sequence)

Returns:
sorted vector of ngrams

toString

public String toString()
Return ngramprofile as text

Returns:
ngramprofile as text

getSimilarity

public float getSimilarity(NGramProfile another)
Calculate a score how well NGramProfiles match each other

Parameters:
another - ngram profile to compare against
Returns:
similarity 0=exact match

load

public void load(InputStream is)
          throws IOException
Loads a ngram profile from InputStream (assumes UTF-8 encoded content)

Throws:
IOException

createNgramProfile

public static NGramProfile createNgramProfile(String name,
                                              InputStream is,
                                              String encoding)
Create a new Language profile from (preferably quite large) text file

Parameters:
name - name of profile
is -
encoding - encoding of stream

save

public void save(OutputStream os)
          throws IOException
Writes NGramProfile content into OutputStream, content is outputted with UTF-8 encoding

Parameters:
os - Stream to output to
Throws:
IOException

main

public static void main(String[] args)
main method used for testing only

Parameters:
args -

getName

public String getName()
Returns:
Returns the name.

setName

public void setName(String name)
Parameters:
name - The name to set.


Copyright © 2005 The Nutch Organization.