org.apache.hadoop.io
Class Text

java.lang.Object
  extended by org.apache.hadoop.io.Text
All Implemented Interfaces:
Comparable, Writable, WritableComparable

public class Text
extends Object
implements WritableComparable

This class stores text using standard UTF8 encoding. It provides methods to serialize, deserialize, and compare texts at byte level. The type of length is integer and is serialized using zero-compressed format.

In addition, it provides methods for string traversal without converting the byte array to a string.

Also includes utilities for serializing/deserialing a string, coding/decoding a string, checking if a byte array contains valid UTF8 code, calculating the length of an encoded string.


Nested Class Summary
static class Text.Comparator
          A WritableComparator optimized for Text keys.
 
Constructor Summary
Text()
           
Text(byte[] utf8)
          Construct from a byte array.
Text(String string)
          Construct from a string.
Text(Text utf8)
          Construct from another text.
 
Method Summary
static int bytesToCodePoint(ByteBuffer bytes)
          Returns the next code point at the current position in the buffer.
 int charAt(int position)
          Returns the Unicode Scalar Value (32-bit integer value) for the character at position.
 int compareTo(Object o)
          Compare two Texts bytewise using standard UTF8 ordering.
static String decode(byte[] utf8)
          Converts the provided byte array to a String using the UTF-8 encoding.
static String decode(byte[] utf8, int start, int length)
           
static String decode(byte[] utf8, int start, int length, boolean replace)
          Converts the provided byte array to a String using the UTF-8 encoding.
static ByteBuffer encode(String string)
          Converts the provided String to bytes using the UTF-8 encoding.
static ByteBuffer encode(String string, boolean replace)
          Converts the provided String to bytes using the UTF-8 encoding.
 boolean equals(Object o)
          Returns true iff o is a Text with the same contents.
 int find(String what)
           
 int find(String what, int start)
          Finds any occurence of what in the backing buffer, starting as position start.
 byte[] getBytes()
          Retuns the raw bytes.
 int getLength()
          Returns the number of bytes in the byte array
 int hashCode()
          hash function
 void readFields(DataInput in)
          deserialize check if the received bytes are valid utf8 code.
static String readString(DataInput in)
          Read a UTF8 encoded string from in
 void set(byte[] utf8)
          Set to a utf8 byte array
 void set(String string)
          Set to contain the contents of a string.
 void set(Text other)
          copy a text.
static void skip(DataInput in)
          Skips over one Text in the input.
 String toString()
          Convert text back to string
static int utf8Length(String string)
          For the given string, returns the number of UTF-8 bytes required to encode the string.
static void validateUTF(byte[] utf8, int start, int len)
           
static void validateUTF8(byte[] utf8)
          Check if a byte array contains valid utf-8
 void write(DataOutput out)
          serialize write this object to out length uses zero-compressed encoding
static int writeString(DataOutput out, String s)
          Write a UTF8 encoded string to out
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

Text

public Text()

Text

public Text(String string)
     throws CharacterCodingException
Construct from a string.

Throws:
CharacterCodingExcetpion - if the string contains invalid codepoints or unpaired surrogates
CharacterCodingException

Text

public Text(Text utf8)
Construct from another text.


Text

public Text(byte[] utf8)
     throws CharacterCodingException
Construct from a byte array.

Throws:
CharacterCodingExcetpion - if the array has invalid UTF8 bytes
CharacterCodingException
Method Detail

getBytes

public byte[] getBytes()
Retuns the raw bytes.


getLength

public int getLength()
Returns the number of bytes in the byte array


charAt

public int charAt(int position)
Returns the Unicode Scalar Value (32-bit integer value) for the character at position. Note that this method avoids using the converter or doing String instatiation


find

public int find(String what)

find

public int find(String what,
                int start)
Finds any occurence of what in the backing buffer, starting as position start. The starting position is measured in bytes and the return value is in terms of byte position in the buffer. The backing buffer is not converted to a string for this operation.

Returns:
byte position of the first occurence of the search string in the UTF-8 buffer or -1 if not found

set

public void set(String string)
         throws CharacterCodingException
Set to contain the contents of a string.

Throws:
CharacterCodingException - if the string contains invalid codepoints or unpaired surrogate

set

public void set(byte[] utf8)
         throws CharacterCodingException
Set to a utf8 byte array

Throws:
CharacterCodingException - if the array contains invalid UTF8 code

set

public void set(Text other)
copy a text.


toString

public String toString()
Convert text back to string

Overrides:
toString in class Object
See Also:
Object.toString()

readFields

public void readFields(DataInput in)
                throws IOException
deserialize check if the received bytes are valid utf8 code. if not throws MalformedInputException

Specified by:
readFields in interface Writable
Throws:
IOException
See Also:
Writable.readFields(DataInput)

skip

public static void skip(DataInput in)
                 throws IOException
Skips over one Text in the input.

Throws:
IOException

write

public void write(DataOutput out)
           throws IOException
serialize write this object to out length uses zero-compressed encoding

Specified by:
write in interface Writable
Throws:
IOException
See Also:
Writable.write(DataOutput)

compareTo

public int compareTo(Object o)
Compare two Texts bytewise using standard UTF8 ordering.

Specified by:
compareTo in interface Comparable

equals

public boolean equals(Object o)
Returns true iff o is a Text with the same contents.

Overrides:
equals in class Object

hashCode

public int hashCode()
hash function

Overrides:
hashCode in class Object

decode

public static String decode(byte[] utf8)
                     throws CharacterCodingException
Converts the provided byte array to a String using the UTF-8 encoding. If the input is malformed, throws a MalformedInputException.

Throws:
CharacterCodingException

decode

public static String decode(byte[] utf8,
                            int start,
                            int length)
                     throws CharacterCodingException
Throws:
CharacterCodingException

decode

public static String decode(byte[] utf8,
                            int start,
                            int length,
                            boolean replace)
                     throws CharacterCodingException
Converts the provided byte array to a String using the UTF-8 encoding. If replace is true, then malformed input is replaced with the substitution character, which is U+FFFD. Otherwise the method throws a MalformedInputException.

Throws:
CharacterCodingException

encode

public static ByteBuffer encode(String string)
                         throws CharacterCodingException
Converts the provided String to bytes using the UTF-8 encoding. If the input is malformed, throws a MalformedInputException.

Returns:
ByteBuffer: bytes stores at ByteBuffer.array() and length is ByteBuffer.limit()
Throws:
CharacterCodingException

encode

public static ByteBuffer encode(String string,
                                boolean replace)
                         throws CharacterCodingException
Converts the provided String to bytes using the UTF-8 encoding. If replace is true, then malformed input is replaced with the substitution character, which is U+FFFD. Otherwise the method throws a MalformedInputException.

Returns:
ByteBuffer: bytes stores at ByteBuffer.array() and length is ByteBuffer.limit()
Throws:
CharacterCodingException

readString

public static String readString(DataInput in)
                         throws IOException
Read a UTF8 encoded string from in

Throws:
IOException

writeString

public static int writeString(DataOutput out,
                              String s)
                       throws IOException
Write a UTF8 encoded string to out

Throws:
IOException

validateUTF8

public static void validateUTF8(byte[] utf8)
                         throws MalformedInputException
Check if a byte array contains valid utf-8

Parameters:
utf8: - byte array
Throws:
MalformedInputException - if the byte array contains invalid utf-8

validateUTF

public static void validateUTF(byte[] utf8,
                               int start,
                               int len)
                        throws MalformedInputException
Throws:
MalformedInputException

bytesToCodePoint

public static int bytesToCodePoint(ByteBuffer bytes)
Returns the next code point at the current position in the buffer. The buffer's position will be incremented. Any mark set on this buffer will be changed by this method!


utf8Length

public static int utf8Length(String string)
For the given string, returns the number of UTF-8 bytes required to encode the string.

Parameters:
string - text to encode
Returns:
number of UTF-8 bytes required to encode


Copyright © 2006 The Apache Software Foundation