HDFS读写过程解析
一、文件的打開
1.1、客戶端
HDFS打開一個文件,需要在客戶端調用DistributedFileSystem.open(Path f, int bufferSize),其實現為:
| public FSDataInputStream open(Path f, int bufferSize) throws IOException { ? return new DFSClient.DFSDataInputStream( ??????? dfs.open(getPathName(f), bufferSize, verifyChecksum, statistics)); } |
其中dfs為DistributedFileSystem的成員變量DFSClient,其open函數被調用,其中創建一個DFSInputStream(src, buffersize, verifyChecksum)并返回。
在DFSInputStream的構造函數中,openInfo函數被調用,其主要從namenode中得到要打開的文件所對應的blocks的信息,實現如下:
| ? synchronized void openInfo() throws IOException { ? LocatedBlocks newInfo = callGetBlockLocations(namenode, src, 0, prefetchSize); ? this.locatedBlocks = newInfo; ? this.currentNode = null; } |
| private static LocatedBlocks callGetBlockLocations(ClientProtocol namenode, ??? String src, long start, long length) throws IOException { ??? return namenode.getBlockLocations(src, start, length); } |
LocatedBlocks主要包含一個鏈表的List<LocatedBlock> blocks,其中每個LocatedBlock包含如下信息:
- Block b:此block的信息
- long offset:此block在文件中的偏移量
- DatanodeInfo[] locs:此block位于哪些DataNode上
上面namenode.getBlockLocations是一個RPC調用,最終調用NameNode類的getBlockLocations函數。
1.2、NameNode
NameNode.getBlockLocations實現如下:
| public LocatedBlocks?? getBlockLocations(String src, ??????????????????????????????????????? long offset, ??????????????????????????????????????? long length) throws IOException { ? return namesystem.getBlockLocations(getClientMachine(), ????????????????????????????????????? src, offset, length); } |
namesystem是NameNode一個成員變量,其類型為FSNamesystem,保存的是NameNode的name space樹,其中一個重要的成員變量為FSDirectory dir。
FSDirectory和Lucene中的FSDirectory沒有任何關系,其主要包括FSImage fsImage,用于讀寫硬盤上的fsimage文件,FSImage類有成員變量FSEditLog editLog,用于讀寫硬盤上的edit文件,這兩個文件的關系在上一篇文章中已經解釋過。
FSDirectory還有一個重要的成員變量INodeDirectoryWithQuota rootDir,INodeDirectoryWithQuota的父類為INodeDirectory,實現如下:
| public class INodeDirectory extends INode { ? …… ? private List<INode> children; ? …… } |
由此可見INodeDirectory本身是一個INode,其中包含一個鏈表的INode,此鏈表中,如果仍為文件夾,則是類型INodeDirectory,如果是文件,則是類型INodeFile,INodeFile中有成員變量BlockInfo blocks[],是此文件包含的block的信息。顯然這是一棵樹形的結構。
FSNamesystem.getBlockLocations函數如下:
| public LocatedBlocks getBlockLocations(String src, long offset, long length, ??? boolean doAccessTime) throws IOException { ? final LocatedBlocks ret = getBlockLocationsInternal(src, dir.getFileINode(src), ????? offset, length, Integer.MAX_VALUE, doAccessTime);? ? return ret; } |
dir.getFileINode(src)通過路徑名從文件系統樹中找到INodeFile,其中保存的是要打開的文件的INode的信息。
getBlockLocationsInternal的實現如下:
| ? private synchronized LocatedBlocks getBlockLocationsInternal(String src, ???????????????????????????????????????????????????? INodeFile inode, ???????????????????????????????????????????????????? long offset, ???????????????????????????????????????????????????? long length, ???????????????????????????????????????????????????? int nrBlocksToReturn, ???????????????????????????????????????????????????? boolean doAccessTime) ???????????????????????????????????????????????????? throws IOException { ??//得到此文件的block信息 ? Block[] blocks = inode.getBlocks(); ? List<LocatedBlock> results = new ArrayList<LocatedBlock>(blocks.length); ??//計算從offset開始,長度為length所涉及的blocks ? int curBlk = 0; ? long curPos = 0, blkSize = 0; ? int nrBlocks = (blocks[0].getNumBytes() == 0) ? 0 : blocks.length; ? for (curBlk = 0; curBlk < nrBlocks; curBlk++) { ??? blkSize = blocks[curBlk].getNumBytes(); ??? if (curPos + blkSize > offset) { ??????//當offset在curPos和curPos + blkSize之間的時候,curBlk指向offset所在的block ????? break; ??? } ??? curPos += blkSize; ? } ? long endOff = offset + length; ??//循環,依次遍歷從curBlk開始的每個block,直到當前位置curPos越過endOff ? do { ??? int numNodes = blocksMap.numNodes(blocks[curBlk]); ??? int numCorruptNodes = countNodes(blocks[curBlk]).corruptReplicas(); ??? int numCorruptReplicas = corruptReplicas.numCorruptReplicas(blocks[curBlk]); ??? boolean blockCorrupt = (numCorruptNodes == numNodes); ??? int numMachineSet = blockCorrupt ? numNodes : ????????????????????????? (numNodes - numCorruptNodes); ????//依次找到此block所對應的datanode,將其中沒有損壞的放入machineSet中 ??? DatanodeDescriptor[] machineSet = new DatanodeDescriptor[numMachineSet]; ??? if (numMachineSet > 0) { ????? numNodes = 0; ????? for(Iterator<DatanodeDescriptor> it = ????????? blocksMap.nodeIterator(blocks[curBlk]); it.hasNext();) { ??????? DatanodeDescriptor dn = it.next(); ??????? boolean replicaCorrupt = corruptReplicas.isReplicaCorrupt(blocks[curBlk], dn); ??????? if (blockCorrupt || (!blockCorrupt && !replicaCorrupt)) ????????? machineSet[numNodes++] = dn; ????? } ??? } ????//使用此machineSet和當前的block構造一個LocatedBlock ??? results.add(new LocatedBlock(blocks[curBlk], machineSet, curPos, ??????????????? blockCorrupt)); ??? curPos += blocks[curBlk].getNumBytes(); ??? curBlk++; ? } while (curPos < endOff ??????? && curBlk < blocks.length ??????? && results.size() < nrBlocksToReturn); ??//使用此LocatedBlock鏈表構造一個LocatedBlocks對象返回 ? return inode.createLocatedBlocks(results); } |
1.3、客戶端
通過RPC調用,在NameNode得到的LocatedBlocks對象,作為成員變量構造DFSInputStream對象,最后包裝為FSDataInputStream返回給用戶。
?
二、文件的讀取
2.1、客戶端
文件讀取的時候,客戶端利用文件打開的時候得到的FSDataInputStream.read(long position, byte[] buffer, int offset, int length)函數進行文件讀操作。
FSDataInputStream會調用其封裝的DFSInputStream的read(long position, byte[] buffer, int offset, int length)函數,實現如下:
| ? public int read(long position, byte[] buffer, int offset, int length) ? throws IOException { ? long filelen = getFileLength(); ? int realLen = length; ? if ((position + length) > filelen) { ??? realLen = (int)(filelen - position); ? } ??//首先得到包含從offset到offset + length內容的block列表 ??//比如對于64M一個block的文件系統來說,欲讀取從100M開始,長度為128M的數據,則block列表包括第2,3,4塊block ? List<LocatedBlock> blockRange = getBlockRange(position, realLen); ? int remaining = realLen; ??//對每一個block,從中讀取內容 ??//對于上面的例子,對于第2塊block,讀取從36M開始,讀取長度28M,對于第3塊,讀取整一塊64M,對于第4塊,讀取從0開始,長度為36M,共128M數據 ? for (LocatedBlock blk : blockRange) { ??? long targetStart = position - blk.getStartOffset(); ??? long bytesToRead = Math.min(remaining, blk.getBlockSize() - targetStart); ??? fetchBlockByteRange(blk, targetStart, ??????????????????????? targetStart + bytesToRead - 1, buffer, offset); ??? remaining -= bytesToRead; ??? position += bytesToRead; ??? offset += bytesToRead; ? } ? assert remaining == 0 : "Wrong number of bytes read."; ? if (stats != null) { ??? stats.incrementBytesRead(realLen); ? } ? return realLen; } |
其中getBlockRange函數如下:
| ? private synchronized List<LocatedBlock> getBlockRange(long offset, ????????????????????????????????????????????????????? long length) ??????????????????????????????????????????????????? throws IOException { ? List<LocatedBlock> blockRange = new ArrayList<LocatedBlock>(); ??//首先從緩存的locatedBlocks中查找offset所在的block在緩存鏈表中的位置 ? int blockIdx = locatedBlocks.findBlock(offset); ? if (blockIdx < 0) { // block is not cached ??? blockIdx = LocatedBlocks.getInsertIndex(blockIdx); ? } ? long remaining = length; ? long curOff = offset; ? while(remaining > 0) { ??? LocatedBlock blk = null; ????//按照blockIdx的位置找到block ??? if(blockIdx < locatedBlocks.locatedBlockCount()) ????? blk = locatedBlocks.get(blockIdx); ????//如果block為空,則緩存中沒有此block,則直接從NameNode中查找這些block,并加入緩存 ??? if (blk == null || curOff < blk.getStartOffset()) { ????? LocatedBlocks newBlocks; ????? newBlocks = callGetBlockLocations(namenode, src, curOff, remaining); ????? locatedBlocks.insertRange(blockIdx, newBlocks.getLocatedBlocks()); ????? continue; ??? } ????//如果block找到,則放入結果集 ??? blockRange.add(blk); ??? long bytesRead = blk.getStartOffset() + blk.getBlockSize() - curOff; ??? remaining -= bytesRead; ??? curOff += bytesRead; ????//取下一個block ??? blockIdx++; ? } ? return blockRange; } |
其中fetchBlockByteRange實現如下:
| ? private void fetchBlockByteRange(LocatedBlock block, long start, ???????????????????????????????? long end, byte[] buf, int offset) throws IOException { ? Socket dn = null; ? int numAttempts = block.getLocations().length; ??//此while循環為讀取失敗后的重試次數 ? while (dn == null && numAttempts-- > 0 ) { ????//選擇一個DataNode來讀取數據 ??? DNAddrPair retval = chooseDataNode(block); ??? DatanodeInfo chosenNode = retval.info; ??? InetSocketAddress targetAddr = retval.addr; ??? BlockReader reader = null; ??? try { ??????//創建Socket連接到DataNode ????? dn = socketFactory.createSocket(); ????? dn.connect(targetAddr, socketTimeout); ????? dn.setSoTimeout(socketTimeout); ????? int len = (int) (end - start + 1); ??????//利用建立的Socket鏈接,生成一個reader負責從DataNode讀取數據 ????? reader = BlockReader.newBlockReader(dn, src, ????????????????????????????????????????? block.getBlock().getBlockId(), ????????????????????????????????????????? block.getBlock().getGenerationStamp(), ????????????????????????????????????????? start, len, buffersize, ????????????????????????????????????????? verifyChecksum, clientName); ??????//讀取數據 ????? int nread = reader.readAll(buf, offset, len); ????? return; ??? } finally { ????? IOUtils.closeStream(reader); ????? IOUtils.closeSocket(dn); ????? dn = null; ??? } ????//如果讀取失敗,則將此DataNode標記為失敗節點 ??? addToDeadNodes(chosenNode); ? } } |
BlockReader.newBlockReader函數實現如下:
| ? public static BlockReader newBlockReader( Socket sock, String file, ?????????????????????????????????? long blockId, ?????????????????????????????????? long genStamp, ?????????????????????????????????? long startOffset, long len, ?????????????????????????????????? int bufferSize, boolean verifyChecksum, ?????????????????????????????????? String clientName) ?????????????????????????????????? throws IOException { ??//使用Socket建立寫入流,向DataNode發送讀指令 ? DataOutputStream out = new DataOutputStream( ??? new BufferedOutputStream(NetUtils.getOutputStream(sock,HdfsConstants.WRITE_TIMEOUT))); ? out.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION ); ? out.write( DataTransferProtocol.OP_READ_BLOCK ); ? out.writeLong( blockId ); ? out.writeLong( genStamp ); ? out.writeLong( startOffset ); ? out.writeLong( len ); ? Text.writeString(out, clientName); ? out.flush(); ??//使用Socket建立讀入流,用于從DataNode讀取數據 ? DataInputStream in = new DataInputStream( ????? new BufferedInputStream(NetUtils.getInputStream(sock), ????????????????????????????? bufferSize)); ? DataChecksum checksum = DataChecksum.newDataChecksum( in ); ? long firstChunkOffset = in.readLong(); ??//生成一個reader,主要包含讀入流,用于讀取數據 ? return new BlockReader( file, blockId, in, checksum, verifyChecksum, ????????????????????????? startOffset, firstChunkOffset, sock ); } |
BlockReader的readAll函數就是用上面生成的DataInputStream讀取數據。
2.2、DataNode
在DataNode啟動的時候,會調用函數startDataNode,其中與數據讀取有關的邏輯如下:
| ? void startDataNode(Configuration conf, ?????????????????? AbstractList<File> dataDirs ?????????????????? ) throws IOException { ? …… ??// 建立一個ServerSocket,并生成一個DataXceiverServer來監控客戶端的鏈接 ? ServerSocket ss = (socketWriteTimeout > 0) ? ??????? ServerSocketChannel.open().socket() : new ServerSocket(); ? Server.bind(ss, socAddr, 0); ? ss.setReceiveBufferSize(DEFAULT_DATA_SOCKET_SIZE); ? // adjust machine name with the actual port ? tmpPort = ss.getLocalPort(); ? selfAddr = new InetSocketAddress(ss.getInetAddress().getHostAddress(), ?????????????????????????????????? tmpPort); ? this.dnRegistration.setName(machineName + ":" + tmpPort); ? this.threadGroup = new ThreadGroup("dataXceiverServer"); ? this.dataXceiverServer = new Daemon(threadGroup, ????? new DataXceiverServer(ss, conf, this)); ? this.threadGroup.setDaemon(true); // auto destroy when empty ? …… } |
DataXceiverServer.run()函數如下:
| ? public void run() { ? while (datanode.shouldRun) { ??????//接受客戶端的鏈接 ????? Socket s = ss.accept(); ????? s.setTcpNoDelay(true); ??????//生成一個線程DataXceiver來對建立的鏈接提供服務 ????? new Daemon(datanode.threadGroup, ????????? new DataXceiver(s, datanode, this)).start(); ? } ? try { ??? ss.close(); ? } catch (IOException ie) { ??? LOG.warn(datanode.dnRegistration + ":DataXceiveServer: " ??????????????????????????? + StringUtils.stringifyException(ie)); ? } } |
DataXceiver.run()函數如下:
| ? public void run() { ? DataInputStream in=null; ? try { ????//建立一個輸入流,讀取客戶端發送的指令 ??? in = new DataInputStream( ??????? new BufferedInputStream(NetUtils.getInputStream(s), ??????????????????????????????? SMALL_BUFFER_SIZE)); ??? short version = in.readShort(); ??? boolean local = s.getInetAddress().equals(s.getLocalAddress()); ??? byte op = in.readByte(); ??? // Make sure the xciver count is not exceeded ??? int curXceiverCount = datanode.getXceiverCount(); ??? long startTime = DataNode.now(); ??? switch ( op ) { ????//讀取 ??? case DataTransferProtocol.OP_READ_BLOCK: ??????//真正的讀取數據 ????? readBlock( in ); ????? datanode.myMetrics.readBlockOp.inc(DataNode.now() - startTime); ????? if (local) ??????? datanode.myMetrics.readsFromLocalClient.inc(); ????? else ??????? datanode.myMetrics.readsFromRemoteClient.inc(); ????? break; ????//寫入 ??? case DataTransferProtocol.OP_WRITE_BLOCK: ??????//真正的寫入數據 ????? writeBlock( in ); ????? datanode.myMetrics.writeBlockOp.inc(DataNode.now() - startTime); ????? if (local) ??????? datanode.myMetrics.writesFromLocalClient.inc(); ????? else ??????? datanode.myMetrics.writesFromRemoteClient.inc(); ????? break; ????//其他的指令 ??? …… ??? } ? } catch (Throwable t) { ??? LOG.error(datanode.dnRegistration + ":DataXceiver",t); ? } finally { ??? IOUtils.closeStream(in); ??? IOUtils.closeSocket(s); ??? dataXceiverServer.childSockets.remove(s); ? } } |
| ? private void readBlock(DataInputStream in) throws IOException { ??//讀取指令 ? long blockId = in.readLong();????????? ? Block block = new Block( blockId, 0 , in.readLong()); ? long startOffset = in.readLong(); ? long length = in.readLong(); ? String clientName = Text.readString(in); ??//創建一個寫入流,用于向客戶端寫數據 ? OutputStream baseStream = NetUtils.getOutputStream(s, ????? datanode.socketWriteTimeout); ? DataOutputStream out = new DataOutputStream( ?????????????? new BufferedOutputStream(baseStream, SMALL_BUFFER_SIZE)); ??//生成BlockSender用于讀取本地的block的數據,并發送給客戶端 ??//BlockSender有一個成員變量InputStream blockIn用于讀取本地block的數據 ? BlockSender blockSender = new BlockSender(block, startOffset, length, ????????? true, true, false, datanode, clientTraceFmt); ?? out.writeShort(DataTransferProtocol.OP_STATUS_SUCCESS); // send op status ???//向客戶端寫入數據 ?? long read = blockSender.sendBlock(out, baseStream, null); ?? …… ? } finally { ??? IOUtils.closeStream(out); ??? IOUtils.closeStream(blockSender); ? } } |
三、文件的寫入
下面解析向hdfs上傳一個文件的過程。
3.1、客戶端
上傳一個文件到hdfs,一般會調用DistributedFileSystem.create,其實現如下:
| ? ? public FSDataOutputStream create(Path f, FsPermission permission, ??? boolean overwrite, ??? int bufferSize, short replication, long blockSize, ??? Progressable progress) throws IOException { ??? return new FSDataOutputStream ?????? (dfs.create(getPathName(f), permission, ?????????????????? overwrite, replication, blockSize, progress, bufferSize), ??????? statistics); ? } |
其最終生成一個FSDataOutputStream用于向新生成的文件中寫入數據。其成員變量dfs的類型為DFSClient,DFSClient的create函數如下:
| ? public OutputStream create(String src, ???????????????????????????? FsPermission permission, ???????????????????????????? boolean overwrite, ???????????????????????????? short replication, ???????????????????????????? long blockSize, ???????????????????????????? Progressable progress, ???????????????????????????? int buffersize ???????????????????????????? ) throws IOException { ??? checkOpen(); ??? if (permission == null) { ????? permission = FsPermission.getDefault(); ??? } ??? FsPermission masked = permission.applyUMask(FsPermission.getUMask(conf)); ??? OutputStream result = new DFSOutputStream(src, masked, ??????? overwrite, replication, blockSize, progress, buffersize, ??????? conf.getInt("io.bytes.per.checksum", 512)); ??? leasechecker.put(src, result); ??? return result; ? } |
其中構造了一個DFSOutputStream,在其構造函數中,同過RPC調用NameNode的create來創建一個文件。?
當然,構造函數中還做了一件重要的事情,就是streamer.start(),也即啟動了一個pipeline,用于寫數據,在寫入數據的過程中,我們會仔細分析。
| ? DFSOutputStream(String src, FsPermission masked, boolean overwrite, ????? short replication, long blockSize, Progressable progress, ????? int buffersize, int bytesPerChecksum) throws IOException { ??? this(src, blockSize, progress, bytesPerChecksum); ??? computePacketChunkSize(writePacketSize, bytesPerChecksum); ??? try { ????? namenode.create( ????????? src, masked, clientName, overwrite, replication, blockSize); ??? } catch(RemoteException re) { ????? throw re.unwrapRemoteException(AccessControlException.class, ???????????????????????????????????? QuotaExceededException.class); ??? } ??? streamer.start(); ? } |
?
3.2、NameNode
NameNode的create函數調用namesystem.startFile函數,其又調用startFileInternal函數,實現如下:
| ? private synchronized void startFileInternal(String src, ????????????????????????????????????????????? PermissionStatus permissions, ????????????????????????????????????????????? String holder, ????????????????????????????????????????????? String clientMachine, ????????????????????????????????????????????? boolean overwrite, ????????????????????????????????????????????? boolean append, ????????????????????????????????????????????? short replication, ????????????????????????????????????????????? long blockSize ????????????????????????????????????????????? ) throws IOException { ??? ...... ???//創建一個新的文件,狀態為under construction,沒有任何data block與之對應 ?? long genstamp = nextGenerationStamp(); ?? INodeFileUnderConstruction newNode = dir.addFile(src, permissions, ????? replication, blockSize, holder, clientMachine, clientNode, genstamp); ?? ...... ? } |
?
3.3、客戶端
下面輪到客戶端向新創建的文件中寫入數據了,一般會使用FSDataOutputStream的write函數,最終會調用DFSOutputStream的writeChunk函數:
按照hdfs的設計,對block的數據寫入使用的是pipeline的方式,也即將數據分成一個個的package,如果需要復制三分,分別寫入DataNode 1, 2, 3,則會進行如下的過程:
- 首先將package 1寫入DataNode 1
- 然后由DataNode 1負責將package 1寫入DataNode 2,同時客戶端可以將pacage 2寫入DataNode 1
- 然后DataNode 2負責將package 1寫入DataNode 3, 同時客戶端可以講package 3寫入DataNode 1,DataNode 1將package 2寫入DataNode 2
- 就這樣將一個個package排著隊的傳遞下去,直到所有的數據全部寫入并復制完畢
| ? protected synchronized void writeChunk(byte[] b, int offset, int len, byte[] checksum) ??????????????????????????????????????????????????????? throws IOException { ??????//創建一個package,并寫入數據 ????? currentPacket = new Packet(packetSize, chunksPerPacket, ?????????????????????????????????? bytesCurBlock); ????? currentPacket.writeChecksum(checksum, 0, cklen); ????? currentPacket.writeData(b, offset, len); ????? currentPacket.numChunks++; ????? bytesCurBlock += len; ??????//如果此package已滿,則放入隊列中準備發送 ????? if (currentPacket.numChunks == currentPacket.maxChunks || ????????? bytesCurBlock == blockSize) { ????????? ...... ????????? dataQueue.addLast(currentPacket); ??????????//喚醒等待dataqueue的傳輸線程,也即DataStreamer ????????? dataQueue.notifyAll(); ????????? currentPacket = null; ????????? ...... ????? } ? } |
DataStreamer的run函數如下:
| ? public void run() { ??? while (!closed && clientRunning) { ????? Packet one = null; ????? synchronized (dataQueue) { ????????//如果隊列中沒有package,則等待 ??????? while ((!closed && !hasError && clientRunning ?????????????? && dataQueue.size() == 0) || doSleep) { ????????? try { ??????????? dataQueue.wait(1000); ????????? } catch (InterruptedException? e) { ????????? } ????????? doSleep = false; ??????? } ??????? try { ??????????//得到隊列中的第一個package ????????? one = dataQueue.getFirst(); ????????? long offsetInBlock = one.offsetInBlock; ??????????//由NameNode分配block,并生成一個寫入流指向此block ????????? if (blockStream == null) { ??????????? nodes = nextBlockOutputStream(src); ??????????? response = new ResponseProcessor(nodes); ??????????? response.start(); ????????? } ????????? ByteBuffer buf = one.getBuffer(); ??????????//將package從dataQueue移至ackQueue,等待確認 ????????? dataQueue.removeFirst(); ????????? dataQueue.notifyAll(); ????????? synchronized (ackQueue) { ??????????? ackQueue.addLast(one); ??????????? ackQueue.notifyAll(); ????????? } ??????????//利用生成的寫入流將數據寫入DataNode中的block ????????? blockStream.write(buf.array(), buf.position(), buf.remaining()); ????????? if (one.lastPacketInBlock) { ??????????? blockStream.writeInt(0);?//表示此block寫入完畢 ????????? } ????????? blockStream.flush(); ??????? } catch (Throwable e) { ??????? } ????? } ????? ...... ? } |
?
其中重要的一個函數是nextBlockOutputStream,實現如下:
| ? private DatanodeInfo[] nextBlockOutputStream(String client) throws IOException { ??? LocatedBlock lb = null; ??? boolean retry = false; ??? DatanodeInfo[] nodes; ??? int count = conf.getInt("dfs.client.block.write.retries", 3); ??? boolean success; ??? do { ????? ...... ??????//由NameNode為文件分配DataNode和block ????? lb = locateFollowingBlock(startTime); ????? block = lb.getBlock(); ????? nodes = lb.getLocations(); ??????//創建向DataNode的寫入流 ????? success = createBlockOutputStream(nodes, clientName, false); ????? ...... ??? } while (retry && --count >= 0); ??? return nodes; ? } |
?
locateFollowingBlock中通過RPC調用namenode.addBlock(src, clientName)函數
?
3.4、NameNode
NameNode的addBlock函數實現如下:
| ? public LocatedBlock addBlock(String src, ?????????????????????????????? String clientName) throws IOException { ??? LocatedBlock locatedBlock = namesystem.getAdditionalBlock(src, clientName); ??? return locatedBlock; ? } |
FSNamesystem的getAdditionalBlock實現如下:
| ? public LocatedBlock getAdditionalBlock(String src, ???????????????????????????????????????? String clientName ???????????????????????????????????????? ) throws IOException { ??? long fileLength, blockSize; ??? int replication; ??? DatanodeDescriptor clientNode = null; ??? Block newBlock = null; ??? ...... ????//為新的block選擇DataNode ??? DatanodeDescriptor targets[] = replicator.chooseTarget(replication, ?????????????????????????????????????????????????????????? clientNode, ?????????????????????????????????????????????????????????? null, ?????????????????????????????????????????????????????????? blockSize); ??? ...... ????//得到文件路徑中所有path的INode,其中最后一個是新添加的文件對的INode,狀態為under construction ??? INode[] pathINodes = dir.getExistingPathINodes(src); ??? int inodesLen = pathINodes.length; ??? INodeFileUnderConstruction pendingFile? = (INodeFileUnderConstruction) ??????????????????????????????????????????????? pathINodes[inodesLen - 1]; ????//為文件分配block, 并設置在那寫DataNode上 ??? newBlock = allocateBlock(src, pathINodes); ??? pendingFile.setTargets(targets); ??? ...... ??? return new LocatedBlock(newBlock, targets, fileLength); ? } |
?
3.5、客戶端
在分配了DataNode和block以后,createBlockOutputStream開始寫入數據。
| ? private boolean createBlockOutputStream(DatanodeInfo[] nodes, String client, ????????????????? boolean recoveryFlag) { ??????//創建一個socket,鏈接DataNode ????? InetSocketAddress target = NetUtils.createSocketAddr(nodes[0].getName()); ????? s = socketFactory.createSocket(); ????? int timeoutValue = 3000 * nodes.length + socketTimeout; ????? s.connect(target, timeoutValue); ????? s.setSoTimeout(timeoutValue); ????? s.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE); ????? long writeTimeout = HdfsConstants.WRITE_TIMEOUT_EXTENSION * nodes.length + ????????????????????????? datanodeWriteTimeout; ????? DataOutputStream out = new DataOutputStream( ????????? new BufferedOutputStream(NetUtils.getOutputStream(s, writeTimeout), ?????????????????????????????????? DataNode.SMALL_BUFFER_SIZE)); ????? blockReplyStream = new DataInputStream(NetUtils.getInputStream(s)); ??????//寫入指令 ????? out.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION ); ????? out.write( DataTransferProtocol.OP_WRITE_BLOCK ); ????? out.writeLong( block.getBlockId() ); ????? out.writeLong( block.getGenerationStamp() ); ????? out.writeInt( nodes.length ); ????? out.writeBoolean( recoveryFlag ); ????? Text.writeString( out, client ); ????? out.writeBoolean(false); ????? out.writeInt( nodes.length - 1 ); ??????//注意,次循環從1開始,而非從0開始。將除了第一個DataNode以外的另外兩個DataNode的信息發送給第一個DataNode, 第一個DataNode可以根據此信息將數據寫給另兩個DataNode ????? for (int i = 1; i < nodes.length; i++) { ??????? nodes[i].write(out); ????? } ????? checksum.writeHeader( out ); ????? out.flush(); ????? firstBadLink = Text.readString(blockReplyStream); ????? if (firstBadLink.length() != 0) { ??????? throw new IOException("Bad connect ack with firstBadLink " + firstBadLink); ????? } ????? blockStream = out; ? } |
?
客戶端在DataStreamer的run函數中創建了寫入流后,調用blockStream.write將數據寫入DataNode
?
3.6、DataNode
DataNode的DataXceiver中,收到指令DataTransferProtocol.OP_WRITE_BLOCK則調用writeBlock函數:
| ? private void writeBlock(DataInputStream in) throws IOException { ??? DatanodeInfo srcDataNode = null; ????//讀入頭信息 ??? Block block = new Block(in.readLong(), ??????? dataXceiverServer.estimateBlockSize, in.readLong()); ??? int pipelineSize = in.readInt(); // num of datanodes in entire pipeline ??? boolean isRecovery = in.readBoolean(); // is this part of recovery? ??? String client = Text.readString(in); // working on behalf of this client ??? boolean hasSrcDataNode = in.readBoolean(); // is src node info present ??? if (hasSrcDataNode) { ????? srcDataNode = new DatanodeInfo(); ????? srcDataNode.readFields(in); ??? } ??? int numTargets = in.readInt(); ??? if (numTargets < 0) { ????? throw new IOException("Mislabelled incoming datastream."); ??? } ????//讀入剩下的DataNode列表,如果當前是第一個DataNode,則此列表中收到的是第二個,第三個DataNode的信息,如果當前是第二個DataNode,則受到的是第三個DataNode的信息 ??? DatanodeInfo targets[] = new DatanodeInfo[numTargets]; ??? for (int i = 0; i < targets.length; i++) { ????? DatanodeInfo tmp = new DatanodeInfo(); ????? tmp.readFields(in); ????? targets[i] = tmp; ??? } ??? DataOutputStream mirrorOut = null;? // stream to next target ??? DataInputStream mirrorIn = null;??? // reply from next target ??? DataOutputStream replyOut = null;?? // stream to prev target ??? Socket mirrorSock = null;?????????? // socket to next target ??? BlockReceiver blockReceiver = null; // responsible for data handling ??? String mirrorNode = null;?????????? // the name:port of next target ??? String firstBadLink = "";?????????? // first datanode that failed in connection setup ??? try { ??????//生成一個BlockReceiver, 其有成員變量DataInputStream in為從客戶端或者上一個DataNode讀取數據,還有成員變量DataOutputStream mirrorOut,用于向下一個DataNode寫入數據,還有成員變量OutputStream out用于將數據寫入本地。 ????? blockReceiver = new BlockReceiver(block, in, ????????? s.getRemoteSocketAddress().toString(), ????????? s.getLocalSocketAddress().toString(), ????????? isRecovery, client, srcDataNode, datanode); ????? // get a connection back to the previous target ????? replyOut = new DataOutputStream( ???????????????????? NetUtils.getOutputStream(s, datanode.socketWriteTimeout)); ??????//如果當前不是最后一個DataNode,則同下一個DataNode建立socket連接 ????? if (targets.length > 0) { ??????? InetSocketAddress mirrorTarget = null; ??????? // Connect to backup machine ??????? mirrorNode = targets[0].getName(); ??????? mirrorTarget = NetUtils.createSocketAddr(mirrorNode); ??????? mirrorSock = datanode.newSocket(); ??????? int timeoutValue = numTargets * datanode.socketTimeout; ??????? int writeTimeout = datanode.socketWriteTimeout + ???????????????????????????? (HdfsConstants.WRITE_TIMEOUT_EXTENSION * numTargets); ??????? mirrorSock.connect(mirrorTarget, timeoutValue); ??????? mirrorSock.setSoTimeout(timeoutValue); ??????? mirrorSock.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE); ????????//創建向下一個DataNode寫入數據的流 ??????? mirrorOut = new DataOutputStream( ???????????? new BufferedOutputStream( ???????????????????????? NetUtils.getOutputStream(mirrorSock, writeTimeout), ???????????????????????? SMALL_BUFFER_SIZE)); ??????? mirrorIn = new DataInputStream(NetUtils.getInputStream(mirrorSock)); ??????? mirrorOut.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION ); ??????? mirrorOut.write( DataTransferProtocol.OP_WRITE_BLOCK ); ??????? mirrorOut.writeLong( block.getBlockId() ); ??????? mirrorOut.writeLong( block.getGenerationStamp() ); ??????? mirrorOut.writeInt( pipelineSize ); ??????? mirrorOut.writeBoolean( isRecovery ); ??????? Text.writeString( mirrorOut, client ); ??????? mirrorOut.writeBoolean(hasSrcDataNode); ??????? if (hasSrcDataNode) { // pass src node information ????????? srcDataNode.write(mirrorOut); ??????? } ??????? mirrorOut.writeInt( targets.length - 1 ); ????????//此出也是從1開始,將除了下一個DataNode的其他DataNode信息發送給下一個DataNode ??????? for ( int i = 1; i < targets.length; i++ ) { ????????? targets[i].write( mirrorOut ); ??????? } ??????? blockReceiver.writeChecksumHeader(mirrorOut); ??????? mirrorOut.flush(); ????? } ??????//使用BlockReceiver接受block ????? String mirrorAddr = (mirrorSock == null) ? null : mirrorNode; ????? blockReceiver.receiveBlock(mirrorOut, mirrorIn, replyOut, ???????????????????????????????? mirrorAddr, null, targets.length); ????? ...... ??? } finally { ????? // close all opened streams ????? IOUtils.closeStream(mirrorOut); ????? IOUtils.closeStream(mirrorIn); ????? IOUtils.closeStream(replyOut); ????? IOUtils.closeSocket(mirrorSock); ????? IOUtils.closeStream(blockReceiver); ??? } ? } |
?
BlockReceiver的receiveBlock函數中,一段重要的邏輯如下:
| ? void receiveBlock( ????? DataOutputStream mirrOut, // output to next datanode ????? DataInputStream mirrIn,?? // input from next datanode ????? DataOutputStream replyOut,? // output to previous datanode ????? String mirrAddr, BlockTransferThrottler throttlerArg, ????? int numTargets) throws IOException { ????? ...... ??????//不斷的接受package,直到結束 ????? while (receivePacket() > 0) {} ????? if (mirrorOut != null) { ??????? try { ????????? mirrorOut.writeInt(0); // mark the end of the block ????????? mirrorOut.flush(); ??????? } catch (IOException e) { ????????? handleMirrorOutError(e); ??????? } ????? } ????? ...... ? } |
?
BlockReceiver的receivePacket函數如下:
| ? private int receivePacket() throws IOException { ????//從客戶端或者上一個節點接收一個package ??? int payloadLen = readNextPacket(); ??? buf.mark(); ??? //read the header ??? buf.getInt(); // packet length ??? offsetInBlock = buf.getLong(); // get offset of packet in block ??? long seqno = buf.getLong();??? // get seqno ??? boolean lastPacketInBlock = (buf.get() != 0); ??? int endOfHeader = buf.position(); ??? buf.reset(); ??? setBlockPosition(offsetInBlock); ????//將package寫入下一個DataNode ??? if (mirrorOut != null) { ????? try { ??????? mirrorOut.write(buf.array(), buf.position(), buf.remaining()); ??????? mirrorOut.flush(); ????? } catch (IOException e) { ??????? handleMirrorOutError(e); ????? } ??? } ??? buf.position(endOfHeader);??????? ??? int len = buf.getInt(); ??? offsetInBlock += len; ??? int checksumLen = ((len + bytesPerChecksum - 1)/bytesPerChecksum)* ??????????????????????????????????????????????????????????? checksumSize; ??? int checksumOff = buf.position(); ??? int dataOff = checksumOff + checksumLen; ??? byte pktBuf[] = buf.array(); ??? buf.position(buf.limit()); // move to the end of the data. ??? ...... ????//將數據寫入本地的block ??? out.write(pktBuf, dataOff, len); ??? /// flush entire packet before sending ack ??? flush(); ??? // put in queue for pending acks ??? if (responder != null) { ????? ((PacketResponder)responder.getRunnable()).enqueue(seqno, ????????????????????????????????????? lastPacketInBlock); ??? } ??? return payloadLen; ? } |
總結
以上是生活随笔為你收集整理的HDFS读写过程解析的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Python函数式编程——map()、r
- 下一篇: Map-Reduce的过程解析