|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.apache.lucene.codecs.PostingsFormat
org.apache.lucene.codecs.block.BlockPostingsFormat
public final class BlockPostingsFormat
Block postings format, which encodes postings in packed int blocks for faster decode.
NOTE: this format is still experimental and subject to change without backwards compatibility.
Basic idea:
In packed block, integers are encoded with the same bit width (packed format
),
the block size (i.e. number of integers inside block) is fixed.
In VInt block, integers are encoded as VInt
,
the block size is variable.
When the postings is long enough, BlockPostingsFormat will try to encode most integer data as packed block.
Take a term with 259 documents as example, the first 256 document ids are encoded as two packed blocks, while the remaining 3 as one VInt block.
Different kinds of data are always encoded separately into different packed blocks, but may possible be encoded into a same VInt block.
This strategy is applied to pairs: <document number, frequency>, <position, payload length>, <position, offset start, offset length>, and <position, payload length, offsetstart, offset length>.
The structure of skip table is quite similar to Lucene40PostingsFormat. Skip interval is the same as block size, and each skip entry points to the beginning of each block. However, for the first block, skip data is omitted.
A position is an integer indicating where the term occurs at within one document. A payload is a blob of metadata associated with current position. An offset is a pair of integers indicating the tokenized start/end offsets for given term in current position.
When payloads and offsets are not omitted, numPositions==numPayloads==numOffsets (assuming a null payload contributes one count). As mentioned in block structure, it is possible to encode these three either combined or separately.
For all the cases, payloads and offsets are stored together. When encoded as packed block, position data is separated out as .pos, while payloads and offsets are encoded in .pay (payload metadata will also be stored directly in .pay). When encoded as VInt block, all these three are stored in .pos (so as payload metadata).
With this strategy, the majority of payload and offset data will be outside .pos file. So for queries that require only position data, running on a full index with payloads and offsets, this reduces disk pre-fetches.
Files and detailed format:
The .tim file format is quite similar to Lucene40PostingsFormat, with minor difference in MetadataBlock
byte
SuffixLengthbyte
RootCodeLength, SumDocFreq, DocCount>
NumFieldsCodecHeader
Uint64
VInt
VLong
Notes:
DocIdSetIterator.advance(int)
.
The .tim file format is mentioned in Lucene40PostingsFormat:TermIndex
The .doc file contains the lists of documents which contain each term, along
with the frequency of the term in that document (except when frequencies are
omitted: FieldInfo.IndexOptions.DOCS_ONLY
). It also saves skip data to the beginning of
each packed or VInt block, when the length of document list is larger than packed block size.
CodecHeader
PackedInts
VInt
VLong
Notes:
MultiLevelSkipListWriter
, skip data is assumed to be saved for
skipIntervalth, 2*skipIntervalth ... posting in the list. However,
in BlockPostingsFormat, the skip data is saved for skipInterval+1th,
2*skipInterval+1th ... posting (skipInterval==PackedBlockSize in this case).
When DocFreq is multiple of PackedBlockSize, MultiLevelSkipListWriter will expect one
more skip data than BlockSkipWriter. The .pos file contains the lists of positions that each term occurs at within documents. It also sometimes stores part of payloads and offsets for speedup.
CodecHeader
PackedInts
VInt
byte
PayLengthNotes:
The .pay file will store payloads and offsets associated with certain term-document positions. Some payloads and offsets will be separated out into .pos file, for speedup reason.
CodecHeader
PackedInts
VInt
byte
SumPayLengthNotes:
Field Summary | |
---|---|
static int |
BLOCK_SIZE
Fixed packed block size, number of integers encoded in a single packed block. |
static String |
DOC_EXTENSION
Filename extension for document number, frequencies, and skip data. |
static String |
PAY_EXTENSION
Filename extension for payloads and offsets. |
static String |
POS_EXTENSION
Filename extension for positions. |
Fields inherited from class org.apache.lucene.codecs.PostingsFormat |
---|
EMPTY |
Constructor Summary | |
---|---|
BlockPostingsFormat()
|
|
BlockPostingsFormat(int minTermBlockSize,
int maxTermBlockSize)
|
Method Summary | |
---|---|
FieldsConsumer |
fieldsConsumer(SegmentWriteState state)
|
FieldsProducer |
fieldsProducer(SegmentReadState state)
|
String |
toString()
|
Methods inherited from class org.apache.lucene.codecs.PostingsFormat |
---|
availablePostingsFormats, forName, getName, reloadPostingsFormats |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Field Detail |
---|
public static final String DOC_EXTENSION
public static final String POS_EXTENSION
public static final String PAY_EXTENSION
public static final int BLOCK_SIZE
Constructor Detail |
---|
public BlockPostingsFormat()
public BlockPostingsFormat(int minTermBlockSize, int maxTermBlockSize)
Method Detail |
---|
public String toString()
toString
in class PostingsFormat
public FieldsConsumer fieldsConsumer(SegmentWriteState state) throws IOException
fieldsConsumer
in class PostingsFormat
IOException
public FieldsProducer fieldsProducer(SegmentReadState state) throws IOException
fieldsProducer
in class PostingsFormat
IOException
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |