Hadoop源码分析-Text
Text是Hadoop中的一個Writable類,定義了Hadoop中的其中的數據類型以及操作。
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
This class stores text using standard UTF8 encoding. It provides methods to serialize, deserialize, and compare texts at byte level. The type of length is integer and is serialized using zero-compressed format.
In addition, it provides methods for string traversal without converting the byte array to a string.Also includes utilities for serializing/deserialing a string,encoding / decoding a string, checking if a byte array contains valid UTF8 code, calculating the length of an encoded string.
由上圖的Writable層次結構圖可以看到絕大多數的數據類型都實現了Writable、WritableComparable接口,在此先分析一下這兩個接口情況。自頂下下逐步分析。
Writable接口的定義如下:
1 package org.apache.hadoop.io; 2 3 import java.io.DataOutput; 4 import java.io.DataInput; 5 import java.io.IOException; 6 public interface Writable { 7 void write(DataOutput out) throws IOException; 8 void readFields(DataInput in) throws IOException; 9 }?
void write(DataOutput out) throws IOException /*object將自身字段序列化后的的字節流寫入輸出流out中。 參數:out - 接收object序列化后的字節流的輸出流. */?
void readFields(DataInput in) throws IOException /*將輸入流in中的字節流反序列化然后寫入object的字段 參數:字節流的出處 */?
而DataInput、DataOutput是java.io.*中最基本的輸入輸出流接口,其他輸入輸出流都需要實現DataInput與DataOutput這兩個接口的方法。關于這兩個接口,另外開篇分析解讀。
到此Writable接口解讀完畢,其實這些東西大家看看API文檔也可以看懂的,我只是想詳細了解一下Writable類所以就寫一次更加明白。
?
WritableComparable接口定義如下:
package org.apache.hadoop.io; public interface WritableComparable<T> extends Writable, comparable<T> { }咋一看這個WritableComparable沒有方法,其實它的方法全都是通過繼承而來的,Writable接口上面已經分析了,所以WritableComparable以下兩個方法。
void write(DataOutput out) throws IOException; void readFields(DataInput in) throws IOException;?
還有來自comparable的方法,comparable是屬于java.lang.*中的一個接口,它只有一個方法。
int compareTo( T other); /*比較此對象與指定對象other的順序。如果該對象小于、等于或大于指定對象,則分別返回負整數、零或正整數。參數:o - 要比較的對象。返回:負整數、零或正整數,根據此對象是小于、等于還是大于指定對象。 */簡單來說實現WritableComparable的類是一個可寫可比較的類。
現在來分析基本類Text,聲明定義如下
public class Text extends BinaryComparable implements WritableComparable<BinaryComparable>; 1 package org.apache.hadoop.io; 2 3 import java.io.IOException; 4 import java.io.DataInput; 5 import java.io.DataOutput; 6 import java.nio.ByteBuffer; 7 import java.nio.CharBuffer; 8 import java.nio.charset.CharacterCodingException; 9 import java.nio.charset.Charset; 10 import java.nio.charset.CharsetDecoder; 11 import java.nio.charset.CharsetEncoder; 12 import java.nio.charset.CodingErrorAction; 13 import java.nio.charset.MalformedInputException; 14 import java.text.CharacterIterator; 15 import java.text.StringCharacterIterator; 16 17 import org.apache.commons.logging.Log; 18 import org.apache.commons.logging.LogFactory; 19 20 /** This class stores text using standard UTF8 encoding. It provides methods 21 * to serialize, deserialize, and compare texts at byte level. The type of 22 * length is integer and is serialized using zero-compressed format. <p>In 23 * addition, it provides methods for string traversal without converting the 24 * byte array to a string. <p>Also includes utilities for 25 * serializing/deserialing a string, coding/decoding a string, checking if a 26 * byte array contains valid UTF8 code, calculating the length of an encoded 27 * string. 28 */ 29 public class Text extends BinaryComparable 30 implements WritableComparable<BinaryComparable> { 31 private static final Log LOG= LogFactory.getLog(Text.class); 32 33 private static ThreadLocal<CharsetEncoder> ENCODER_FACTORY = 34 new ThreadLocal<CharsetEncoder>() { 35 protected CharsetEncoder initialValue() { 36 return Charset.forName("UTF-8").newEncoder(). 37 onMalformedInput(CodingErrorAction.REPORT). 38 onUnmappableCharacter(CodingErrorAction.REPORT); 39 } 40 }; 41 42 private static ThreadLocal<CharsetDecoder> DECODER_FACTORY = 43 new ThreadLocal<CharsetDecoder>() { 44 protected CharsetDecoder initialValue() { 45 return Charset.forName("UTF-8").newDecoder(). 46 onMalformedInput(CodingErrorAction.REPORT). 47 onUnmappableCharacter(CodingErrorAction.REPORT); 48 } 49 }; 50 51 private static final byte [] EMPTY_BYTES = new byte[0]; 52 53 private byte[] bytes; 54 private int length; 55 56 public Text() { 57 bytes = EMPTY_BYTES; 58 } 59 60 /** Construct from a string. 61 */ 62 public Text(String string) { 63 set(string); 64 } 65 66 /** Construct from another text. */ 67 public Text(Text utf8) { 68 set(utf8); 69 } 70 71 /** Construct from a byte array. 72 */ 73 public Text(byte[] utf8) { 74 set(utf8); 75 } 76 77 /** 78 * Returns the raw bytes; however, only data up to {@link #getLength()} is 79 * valid. 80 */ 81 public byte[] getBytes() { 82 return bytes; 83 } 84 85 /** Returns the number of bytes in the byte array */ 86 public int getLength() { 87 return length; 88 } 89 90 /** 91 * Returns the Unicode Scalar Value (32-bit integer value) 92 * for the character at <code>position</code>. Note that this 93 * method avoids using the converter or doing String instatiation 94 * @return the Unicode scalar value at position or -1 95 * if the position is invalid or points to a 96 * trailing byte 97 */ 98 public int charAt(int position) { 99 if (position > this.length) return -1; // too long 100 if (position < 0) return -1; // duh. 101 102 ByteBuffer bb = (ByteBuffer)ByteBuffer.wrap(bytes).position(position); 103 return bytesToCodePoint(bb.slice()); 104 } 105 106 public int find(String what) { 107 return find(what, 0); 108 } 109 110 /** 111 * Finds any occurence of <code>what</code> in the backing 112 * buffer, starting as position <code>start</code>. The starting 113 * position is measured in bytes and the return value is in 114 * terms of byte position in the buffer. The backing buffer is 115 * not converted to a string for this operation. 116 * @return byte position of the first occurence of the search 117 * string in the UTF-8 buffer or -1 if not found 118 */ 119 public int find(String what, int start) { 120 try { 121 ByteBuffer src = ByteBuffer.wrap(this.bytes,0,this.length); 122 ByteBuffer tgt = encode(what); 123 byte b = tgt.get(); 124 src.position(start); 125 126 while (src.hasRemaining()) { 127 if (b == src.get()) { // matching first byte 128 src.mark(); // save position in loop 129 tgt.mark(); // save position in target 130 boolean found = true; 131 int pos = src.position()-1; 132 while (tgt.hasRemaining()) { 133 if (!src.hasRemaining()) { // src expired first 134 tgt.reset(); 135 src.reset(); 136 found = false; 137 break; 138 } 139 if (!(tgt.get() == src.get())) { 140 tgt.reset(); 141 src.reset(); 142 found = false; 143 break; // no match 144 } 145 } 146 if (found) return pos; 147 } 148 } 149 return -1; // not found 150 } catch (CharacterCodingException e) { 151 // can't get here 152 e.printStackTrace(); 153 return -1; 154 } 155 } 156 /** Set to contain the contents of a string. 157 */ 158 public void set(String string) { 159 try { 160 ByteBuffer bb = encode(string, true); 161 bytes = bb.array(); 162 length = bb.limit(); 163 }catch(CharacterCodingException e) { 164 throw new RuntimeException("Should not have happened " + e.toString()); 165 } 166 } 167 168 /** Set to a utf8 byte array 169 */ 170 public void set(byte[] utf8) { 171 set(utf8, 0, utf8.length); 172 } 173 174 /** copy a text. */ 175 public void set(Text other) { 176 set(other.getBytes(), 0, other.getLength()); 177 } 178 179 /** 180 * Set the Text to range of bytes 181 * @param utf8 the data to copy from 182 * @param start the first position of the new string 183 * @param len the number of bytes of the new string 184 */ 185 public void set(byte[] utf8, int start, int len) { 186 setCapacity(len, false); 187 System.arraycopy(utf8, start, bytes, 0, len); 188 this.length = len; 189 } 190 191 /** 192 * Append a range of bytes to the end of the given text 193 * @param utf8 the data to copy from 194 * @param start the first position to append from utf8 195 * @param len the number of bytes to append 196 */ 197 public void append(byte[] utf8, int start, int len) { 198 setCapacity(length + len, true); 199 System.arraycopy(utf8, start, bytes, length, len); 200 length += len; 201 } 202 203 /** 204 * Clear the string to empty. 205 */ 206 public void clear() { 207 length = 0; 208 } 209 210 /* 211 * Sets the capacity of this Text object to <em>at least</em> 212 * <code>len</code> bytes. If the current buffer is longer, 213 * then the capacity and existing content of the buffer are 214 * unchanged. If <code>len</code> is larger 215 * than the current capacity, the Text object's capacity is 216 * increased to match. 217 * @param len the number of bytes we need 218 * @param keepData should the old data be kept 219 */ 220 private void setCapacity(int len, boolean keepData) { 221 if (bytes == null || bytes.length < len) { 222 byte[] newBytes = new byte[len]; 223 if (bytes != null && keepData) { 224 System.arraycopy(bytes, 0, newBytes, 0, length); 225 } 226 bytes = newBytes; 227 } 228 } 229 230 /** 231 * Convert text back to string 232 * @see java.lang.Object#toString() 233 */ 234 public String toString() { 235 try { 236 return decode(bytes, 0, length); 237 } catch (CharacterCodingException e) { 238 throw new RuntimeException("Should not have happened " + e.toString()); 239 } 240 } 241 242 /** deserialize 243 */ 244 public void readFields(DataInput in) throws IOException { 245 int newLength = WritableUtils.readVInt(in); 246 setCapacity(newLength, false); 247 in.readFully(bytes, 0, newLength); 248 length = newLength; 249 } 250 251 /** Skips over one Text in the input. */ 252 public static void skip(DataInput in) throws IOException { 253 int length = WritableUtils.readVInt(in); 254 WritableUtils.skipFully(in, length); 255 } 256 257 /** serialize 258 * write this object to out 259 * length uses zero-compressed encoding 260 * @see Writable#write(DataOutput) 261 */ 262 public void write(DataOutput out) throws IOException { 263 WritableUtils.writeVInt(out, length); 264 out.write(bytes, 0, length); 265 } 266 267 /** Returns true iff <code>o</code> is a Text with the same contents. */ 268 public boolean equals(Object o) { 269 if (o instanceof Text) 270 return super.equals(o); 271 return false; 272 } 273 274 public int hashCode() { 275 return super.hashCode(); 276 } 277 278 /** A WritableComparator optimized for Text keys. */ 279 public static class Comparator extends WritableComparator { 280 public Comparator() { 281 super(Text.class); 282 } 283 284 public int compare(byte[] b1, int s1, int l1, 285 byte[] b2, int s2, int l2) { 286 int n1 = WritableUtils.decodeVIntSize(b1[s1]); 287 int n2 = WritableUtils.decodeVIntSize(b2[s2]); 288 return compareBytes(b1, s1+n1, l1-n1, b2, s2+n2, l2-n2); 289 } 290 } 291 292 static { 293 // register this comparator 294 WritableComparator.define(Text.class, new Comparator()); 295 } 296 297 /// STATIC UTILITIES FROM HERE DOWN 298 /** 299 * Converts the provided byte array to a String using the 300 * UTF-8 encoding. If the input is malformed, 301 * replace by a default value. 302 */ 303 public static String decode(byte[] utf8) throws CharacterCodingException { 304 return decode(ByteBuffer.wrap(utf8), true); 305 } 306 307 public static String decode(byte[] utf8, int start, int length) 308 throws CharacterCodingException { 309 return decode(ByteBuffer.wrap(utf8, start, length), true); 310 } 311 312 /** 313 * Converts the provided byte array to a String using the 314 * UTF-8 encoding. If <code>replace</code> is true, then 315 * malformed input is replaced with the 316 * substitution character, which is U+FFFD. Otherwise the 317 * method throws a MalformedInputException. 318 */ 319 public static String decode(byte[] utf8, int start, int length, boolean replace) 320 throws CharacterCodingException { 321 return decode(ByteBuffer.wrap(utf8, start, length), replace); 322 } 323 324 private static String decode(ByteBuffer utf8, boolean replace) 325 throws CharacterCodingException { 326 CharsetDecoder decoder = DECODER_FACTORY.get(); 327 if (replace) { 328 decoder.onMalformedInput( 329 java.nio.charset.CodingErrorAction.REPLACE); 330 decoder.onUnmappableCharacter(CodingErrorAction.REPLACE); 331 } 332 String str = decoder.decode(utf8).toString(); 333 // set decoder back to its default value: REPORT 334 if (replace) { 335 decoder.onMalformedInput(CodingErrorAction.REPORT); 336 decoder.onUnmappableCharacter(CodingErrorAction.REPORT); 337 } 338 return str; 339 } 340 341 /** 342 * Converts the provided String to bytes using the 343 * UTF-8 encoding. If the input is malformed, 344 * invalid chars are replaced by a default value. 345 * @return ByteBuffer: bytes stores at ByteBuffer.array() 346 * and length is ByteBuffer.limit() 347 */ 348 349 public static ByteBuffer encode(String string) 350 throws CharacterCodingException { 351 return encode(string, true); 352 } 353 354 /** 355 * Converts the provided String to bytes using the 356 * UTF-8 encoding. If <code>replace</code> is true, then 357 * malformed input is replaced with the 358 * substitution character, which is U+FFFD. Otherwise the 359 * method throws a MalformedInputException. 360 * @return ByteBuffer: bytes stores at ByteBuffer.array() 361 * and length is ByteBuffer.limit() 362 */ 363 public static ByteBuffer encode(String string, boolean replace) 364 throws CharacterCodingException { 365 CharsetEncoder encoder = ENCODER_FACTORY.get(); 366 if (replace) { 367 encoder.onMalformedInput(CodingErrorAction.REPLACE); 368 encoder.onUnmappableCharacter(CodingErrorAction.REPLACE); 369 } 370 ByteBuffer bytes = 371 encoder.encode(CharBuffer.wrap(string.toCharArray())); 372 if (replace) { 373 encoder.onMalformedInput(CodingErrorAction.REPORT); 374 encoder.onUnmappableCharacter(CodingErrorAction.REPORT); 375 } 376 return bytes; 377 } 378 379 /** Read a UTF8 encoded string from in 380 */ 381 public static String readString(DataInput in) throws IOException { 382 int length = WritableUtils.readVInt(in); 383 byte [] bytes = new byte[length]; 384 in.readFully(bytes, 0, length); 385 return decode(bytes); 386 } 387 388 /** Write a UTF8 encoded string to out 389 */ 390 public static int writeString(DataOutput out, String s) throws IOException { 391 ByteBuffer bytes = encode(s); 392 int length = bytes.limit(); 393 WritableUtils.writeVInt(out, length); 394 out.write(bytes.array(), 0, length); 395 return length; 396 } 397 398 // states for validateUTF8 399 400 private static final int LEAD_BYTE = 0; 401 402 private static final int TRAIL_BYTE_1 = 1; 403 404 private static final int TRAIL_BYTE = 2; 405 406 /** 407 * Check if a byte array contains valid utf-8 408 * @param utf8 byte array 409 * @throws MalformedInputException if the byte array contains invalid utf-8 410 */ 411 public static void validateUTF8(byte[] utf8) throws MalformedInputException { 412 validateUTF8(utf8, 0, utf8.length); 413 } 414 415 /** 416 * Check to see if a byte array is valid utf-8 417 * @param utf8 the array of bytes 418 * @param start the offset of the first byte in the array 419 * @param len the length of the byte sequence 420 * @throws MalformedInputException if the byte array contains invalid bytes 421 */ 422 public static void validateUTF8(byte[] utf8, int start, int len) 423 throws MalformedInputException { 424 int count = start; 425 int leadByte = 0; 426 int length = 0; 427 int state = LEAD_BYTE; 428 while (count < start+len) { 429 int aByte = ((int) utf8[count] & 0xFF); 430 431 switch (state) { 432 case LEAD_BYTE: 433 leadByte = aByte; 434 length = bytesFromUTF8[aByte]; 435 436 switch (length) { 437 case 0: // check for ASCII 438 if (leadByte > 0x7F) 439 throw new MalformedInputException(count); 440 break; 441 case 1: 442 if (leadByte < 0xC2 || leadByte > 0xDF) 443 throw new MalformedInputException(count); 444 state = TRAIL_BYTE_1; 445 break; 446 case 2: 447 if (leadByte < 0xE0 || leadByte > 0xEF) 448 throw new MalformedInputException(count); 449 state = TRAIL_BYTE_1; 450 break; 451 case 3: 452 if (leadByte < 0xF0 || leadByte > 0xF4) 453 throw new MalformedInputException(count); 454 state = TRAIL_BYTE_1; 455 break; 456 default: 457 // too long! Longest valid UTF-8 is 4 bytes (lead + three) 458 // or if < 0 we got a trail byte in the lead byte position 459 throw new MalformedInputException(count); 460 } // switch (length) 461 break; 462 463 case TRAIL_BYTE_1: 464 if (leadByte == 0xF0 && aByte < 0x90) 465 throw new MalformedInputException(count); 466 if (leadByte == 0xF4 && aByte > 0x8F) 467 throw new MalformedInputException(count); 468 if (leadByte == 0xE0 && aByte < 0xA0) 469 throw new MalformedInputException(count); 470 if (leadByte == 0xED && aByte > 0x9F) 471 throw new MalformedInputException(count); 472 // falls through to regular trail-byte test!! 473 case TRAIL_BYTE: 474 if (aByte < 0x80 || aByte > 0xBF) 475 throw new MalformedInputException(count); 476 if (--length == 0) { 477 state = LEAD_BYTE; 478 } else { 479 state = TRAIL_BYTE; 480 } 481 break; 482 } // switch (state) 483 count++; 484 } 485 } 486 487 /** 488 * Magic numbers for UTF-8. These are the number of bytes 489 * that <em>follow</em> a given lead byte. Trailing bytes 490 * have the value -1. The values 4 and 5 are presented in 491 * this table, even though valid UTF-8 cannot include the 492 * five and six byte sequences. 493 */ 494 static final int[] bytesFromUTF8 = 495 { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 496 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 497 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 498 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 499 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 500 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 501 0, 0, 0, 0, 0, 0, 0, 502 // trail bytes 503 -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 504 -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 505 -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 506 -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 507 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 508 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 509 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5 }; 510 511 /** 512 * Returns the next code point at the current position in 513 * the buffer. The buffer's position will be incremented. 514 * Any mark set on this buffer will be changed by this method! 515 */ 516 public static int bytesToCodePoint(ByteBuffer bytes) { 517 bytes.mark(); 518 byte b = bytes.get(); 519 bytes.reset(); 520 int extraBytesToRead = bytesFromUTF8[(b & 0xFF)]; 521 if (extraBytesToRead < 0) return -1; // trailing byte! 522 int ch = 0; 523 524 switch (extraBytesToRead) { 525 case 5: ch += (bytes.get() & 0xFF); ch <<= 6; /* remember, illegal UTF-8 */ 526 case 4: ch += (bytes.get() & 0xFF); ch <<= 6; /* remember, illegal UTF-8 */ 527 case 3: ch += (bytes.get() & 0xFF); ch <<= 6; 528 case 2: ch += (bytes.get() & 0xFF); ch <<= 6; 529 case 1: ch += (bytes.get() & 0xFF); ch <<= 6; 530 case 0: ch += (bytes.get() & 0xFF); 531 } 532 ch -= offsetsFromUTF8[extraBytesToRead]; 533 534 return ch; 535 } 536 537 538 static final int offsetsFromUTF8[] = 539 { 0x00000000, 0x00003080, 540 0x000E2080, 0x03C82080, 0xFA082080, 0x82082080 }; 541 542 /** 543 * For the given string, returns the number of UTF-8 bytes 544 * required to encode the string. 545 * @param string text to encode 546 * @return number of UTF-8 bytes required to encode 547 */ 548 public static int utf8Length(String string) { 549 CharacterIterator iter = new StringCharacterIterator(string); 550 char ch = iter.first(); 551 int size = 0; 552 while (ch != CharacterIterator.DONE) { 553 if ((ch >= 0xD800) && (ch < 0xDC00)) { 554 // surrogate pair? 555 char trail = iter.next(); 556 if ((trail > 0xDBFF) && (trail < 0xE000)) { 557 // valid pair 558 size += 4; 559 } else { 560 // invalid pair 561 size += 3; 562 iter.previous(); // rewind one 563 } 564 } else if (ch < 0x80) { 565 size++; 566 } else if (ch < 0x800) { 567 size += 2; 568 } else { 569 // ch < 0x10000, that is, the largest char value 570 size += 3; 571 } 572 ch = iter.next(); 573 } 574 return size; 575 } 576 }它繼承了BinaryComparable基類、實現了WritableComparable<BinaryComparable>接口
WritableComparable已經在上面講述,現來分析BinaryComparable基類,定義如下:
1 package org.apache.hadoop.io; 2 public abstract class BinaryComparable implements Comparable<BinaryComparable> { 3 public abstract int getLength(); 4 public abstract byte[] getBytes(); 5 public int compareTo(BinaryComparable other) { 6 if (this == other) 7 return 0; 8 return WritableComparator.compareBytes(getBytes(), 0, getLength(), 9 other.getBytes(), 0, other.getLength()); 10 } 11 public int compareTo(byte[] other, int off, int len) { 12 return WritableComparator.compareBytes(getBytes(), 0, getLength(), 13 other, off, len); 14 } 15 public boolean equals(Object other) { 16 if (!(other instanceof BinaryComparable)) 17 return false; 18 BinaryComparable that = (BinaryComparable)other; 19 if (this.getLength() != that.getLength()) 20 return false; 21 return this.compareTo(that) == 0; 22 } 23 public int hashCode() { 24 return WritableComparator.hashBytes(getBytes(), getLength()); 25 } 26 27 }BinaryComparable是一個抽象類,主要是提供一個在二進制流這一層次直接比較兩個對象的功能
其中
WritableComparator.compareBytes(getBytes(), 0, getLength(), other.getBytes(), 0, other.getLength());是根據字典序排序返回比較結果。
而
WritableComparator.hashBytes(getBytes(), getLength());則是返回字節流的hashCode;
?
現在總括看看Text的方法
?
?Text源碼?
Text是針對UTF-8序列的Writable類,一般可以認為它等價于java.lang.String?的?Writable,為了與輸入流輸出流DataInput、DataOutput兼容(DataInput與DataOutput是使用UTF-8修改版進行編碼的),Text是使用Java的UTF-8修改版來進行編碼。關于UTF-8修改版如下:
雖然Text使用UTF-8修改版編碼,但是它著重還是使用了UTF-8編碼,因此Text類與Java的String之間存在差異。
1、索引?對于Text類的索引是根據編碼后的字節序列中的位置實現的,并非字符串中的Unicode字符,也不是Java Char的編碼單元(String),對于ASCII字符串,這三個索引位置的概念是一致的,因為在三個編碼方式當中,ASCII的編碼大小均為一個字節。
但是Unicode使用多個字節來進行編碼時,Text與String的差異就出現了。
import org.apache.hadoop.io.Text;public class TextExample {public static void main(String[] args) {// TODO Auto-generated method stubString str = new String("\u0041\u00DF\u6771\uD801\uDC00");Text t = new Text(str);System.out.println(str.length());System.out.println(t.getLength());} }
輸出
5
10
證實了String的長度是其所含char編碼單位的個數,但是Text的長度卻是其UTF-8編碼的字節數(10=1+2+3+4),這個怎么算出來呢?
待續
?
總結
以上是生活随笔為你收集整理的Hadoop源码分析-Text的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: hadoop java操作hdfs
- 下一篇: mapreduce框架详解