org.apache.lucene.index
Class DocTermOrds

java.lang.Object
  extended by org.apache.lucene.index.DocTermOrds

public class DocTermOrds
extends Object

This class enables fast access to multiple term ords for a specified field across all docIDs. Like FieldCache, it uninverts the index and holds a packed data structure in RAM to enable fast access. Unlike FieldCache, it can handle multi-valued fields, and, it does not hold the term bytes in RAM. Rather, you must obtain a TermsEnum from the getOrdTermsEnum(org.apache.lucene.index.AtomicReader) method, and then seek-by-ord to get the term's bytes. While normally term ords are type long, in this API they are int as the internal representation here cannot address more than MAX_INT unique terms. Also, typically this class is used on fields with relatively few unique terms vs the number of documents. In addition, there is an internal limit (16 MB) on how many bytes each chunk of documents may consume. If you trip this limit you'll hit an IllegalStateException. Deleted documents are skipped during uninversion, and if you look them up you'll get 0 ords. The returned per-document ords do not retain their original order in the document. Instead they are returned in sorted (by ord, ie term's BytesRef comparator) order. They are also de-dup'd (ie if doc has same term more than once in this field, you'll only get that ord back once). This class tests whether the provided reader is able to retrieve terms by ord (ie, it's single segment, and it uses an ord-capable terms index). If not, this class will create its own term index internally, allowing to create a wrapped TermsEnum that can handle ord. The getOrdTermsEnum(org.apache.lucene.index.AtomicReader) method then provides this wrapped enum, if necessary. The RAM consumption of this class can be high!

WARNING: This API is experimental and might change in incompatible ways in the next release.

Nested Class Summary
 class DocTermOrds.TermOrdsIterator
           
 
Field Summary
static int DEFAULT_INDEX_INTERVAL_BITS
           
protected  DocsEnum docsEnum
           
protected  String field
           
protected  int[] index
           
protected  BytesRef[] indexedTermsArray
           
protected  int maxTermDocFreq
           
protected  int numTermsInField
           
protected  int ordBase
           
protected  int phase1_time
           
protected  BytesRef prefix
           
protected  long sizeOfIndexedStrings
           
protected  long termInstances
           
protected  byte[][] tnums
           
protected  int total_time
           
 
Constructor Summary
  DocTermOrds(AtomicReader reader, String field)
          Inverts all terms
  DocTermOrds(AtomicReader reader, String field, BytesRef termPrefix)
          Inverts only terms starting w/ prefix
  DocTermOrds(AtomicReader reader, String field, BytesRef termPrefix, int maxTermDocFreq)
          Inverts only terms starting w/ prefix, and only terms whose docFreq (not taking deletions into account) is <= maxTermDocFreq
  DocTermOrds(AtomicReader reader, String field, BytesRef termPrefix, int maxTermDocFreq, int indexIntervalBits)
          Inverts only terms starting w/ prefix, and only terms whose docFreq (not taking deletions into account) is <= maxTermDocFreq, with a custom indexing interval (default is every 128nd term).
protected DocTermOrds(String field, int maxTermDocFreq, int indexIntervalBits)
          Subclass inits w/ this, but be sure you then call uninvert, only once
 
Method Summary
 TermsEnum getOrdTermsEnum(AtomicReader reader)
          Returns a TermsEnum that implements ord.
 boolean isEmpty()
           
 DocTermOrds.TermOrdsIterator lookup(int doc, DocTermOrds.TermOrdsIterator reuse)
          Returns an iterator to step through the term ords for this document.
 BytesRef lookupTerm(TermsEnum termsEnum, int ord)
           
 int numTerms()
           
 long ramUsedInBytes()
           
protected  void setActualDocFreq(int termNum, int df)
           
protected  void uninvert(AtomicReader reader, BytesRef termPrefix)
           
protected  void visitTerm(TermsEnum te, int termNum)
          Subclass can override this
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_INDEX_INTERVAL_BITS

public static final int DEFAULT_INDEX_INTERVAL_BITS
See Also:
Constant Field Values

maxTermDocFreq

protected final int maxTermDocFreq

field

protected final String field

numTermsInField

protected int numTermsInField

termInstances

protected long termInstances

total_time

protected int total_time

phase1_time

protected int phase1_time

index

protected int[] index

tnums

protected byte[][] tnums

sizeOfIndexedStrings

protected long sizeOfIndexedStrings

indexedTermsArray

protected BytesRef[] indexedTermsArray

prefix

protected BytesRef prefix

ordBase

protected int ordBase

docsEnum

protected DocsEnum docsEnum
Constructor Detail

DocTermOrds

public DocTermOrds(AtomicReader reader,
                   String field)
            throws IOException
Inverts all terms

Throws:
IOException

DocTermOrds

public DocTermOrds(AtomicReader reader,
                   String field,
                   BytesRef termPrefix)
            throws IOException
Inverts only terms starting w/ prefix

Throws:
IOException

DocTermOrds

public DocTermOrds(AtomicReader reader,
                   String field,
                   BytesRef termPrefix,
                   int maxTermDocFreq)
            throws IOException
Inverts only terms starting w/ prefix, and only terms whose docFreq (not taking deletions into account) is <= maxTermDocFreq

Throws:
IOException

DocTermOrds

public DocTermOrds(AtomicReader reader,
                   String field,
                   BytesRef termPrefix,
                   int maxTermDocFreq,
                   int indexIntervalBits)
            throws IOException
Inverts only terms starting w/ prefix, and only terms whose docFreq (not taking deletions into account) is <= maxTermDocFreq, with a custom indexing interval (default is every 128nd term).

Throws:
IOException

DocTermOrds

protected DocTermOrds(String field,
                      int maxTermDocFreq,
                      int indexIntervalBits)
Subclass inits w/ this, but be sure you then call uninvert, only once

Method Detail

ramUsedInBytes

public long ramUsedInBytes()

getOrdTermsEnum

public TermsEnum getOrdTermsEnum(AtomicReader reader)
                          throws IOException
Returns a TermsEnum that implements ord. If the provided reader supports ord, we just return its TermsEnum; if it does not, we build a "private" terms index internally (WARNING: consumes RAM) and use that index to implement ord. This also enables ord on top of a composite reader. The returned TermsEnum is unpositioned. This returns null if there are no terms.

NOTE: you must pass the same reader that was used when creating this class

Throws:
IOException

numTerms

public int numTerms()
Returns:
The number of terms in this field

isEmpty

public boolean isEmpty()
Returns:
Whether this DocTermOrds instance is empty.

visitTerm

protected void visitTerm(TermsEnum te,
                         int termNum)
                  throws IOException
Subclass can override this

Throws:
IOException

setActualDocFreq

protected void setActualDocFreq(int termNum,
                                int df)
                         throws IOException
Throws:
IOException

uninvert

protected void uninvert(AtomicReader reader,
                        BytesRef termPrefix)
                 throws IOException
Throws:
IOException

lookup

public DocTermOrds.TermOrdsIterator lookup(int doc,
                                           DocTermOrds.TermOrdsIterator reuse)
Returns an iterator to step through the term ords for this document. It's also possible to subclass this class and directly access members.


lookupTerm

public BytesRef lookupTerm(TermsEnum termsEnum,
                           int ord)
                    throws IOException
Throws:
IOException


Copyright © 2000-2012 Apache Software Foundation. All Rights Reserved.