构建一致性哈希ring Part2
******************轉載請注明出處!**********
最后更新:2011年8月22日22:47:38
Part1中,已經(jīng)構建好了一致性哈希ring的原型。
但存在一個問題。100個結點對應著1000個虛結點。結點變動時,虛結點和結點的對應關系會發(fā)生變化。當100個結點擴張到1001個時,會發(fā)生什么情況?
新增的結點數(shù)目會擠兌掉原先的結點數(shù)目!原因就在于1000個虛結點是固定的,不變化的。如果再擴容1000個虛結點->更改虛結點和實結點之間的對應關系->調整數(shù)據(jù),這似乎又回到ring2_0.py的老套路。
這里做一些改動以更接近真實情況。
首先是vnode,以后改稱為partition。因為partition數(shù)量很少會變動,所以需要充分估計到系統(tǒng)預期的規(guī)模。假如不會超過6000個結點,那么虛結點可以設置為實結點的100倍。這樣,當虛結點需要調整的時候,最多只會影響到1%的數(shù)據(jù)。
在計算機中,數(shù)字取2的冪階會有一些好處。比如除法只需要移相應的位就可以了。所以ring3_0.py中,結點數(shù)為65536(2^16),partition數(shù)為8388608(2^23)個。
ring3_0.py
from array import array from hashlib import md5 from struct import unpack_from from time import timePARTITION_POWER = 23 PARTITION_SHIFT = 32 - PARTITION_POWER NODE_COUNT = 65536 DATA_ID_COUNT = 100000000begin = time() part2node = array('H') for part in xrange(2 ** PARTITION_POWER):part2node.append(part % NODE_COUNT) node_counts = [0] * NODE_COUNT for data_id in xrange(DATA_ID_COUNT):data_id = str(data_id)part = unpack_from('>I',md5(str(data_id)).digest())[0] >> PARTITION_SHIFTnode_id = part2node[part]node_counts[node_id] += 1 desired_count = DATA_ID_COUNT / NODE_COUNTprint '%d: Desier data ids per node' % desired_count max_count = max(node_counts) over = 100.0 * (max_count - desired_count) / desired_count print '%d Most data ids on one node, %.02f%% over' % (max_count, over) min_count = min(node_counts) under = 100.0 * (desired_count - min_count) / desired_count print '%d Least data ids on one node, %.02f%% under' % (min_count, under) print '%d seconds pass ...' % (time() - begin) 結果 1525: Desier data ids per node 1683 Most data ids on one node, 10.36% over 1360 Least data ids on one node, 10.82% under 234 seconds pass ...說明:
代碼比較簡單,part2node是node和partition的對應關系,node_counts記錄著每個node中有多少個數(shù)據(jù)映射進來。
gholt特意提到系統(tǒng)開銷:2Byte存儲16位對應結點id, 600多萬個partition只對應占用12MB的內存。gholt還提到ring3_0.py結果出現(xiàn)10%波動的問題。主要是因為數(shù)據(jù)空間(10^8相對于partition數(shù)(約8*10^6)太小了。他嘗試過更大的數(shù)據(jù)空間空間,比如10^9個數(shù)據(jù),對應于8*10^6的partition,耗時6個小時。-_-!
ring3_0.py已經(jīng)有點接近現(xiàn)實中一致性哈希ring了,ring4_0.py中將加入replica的特性。
ring4_0.py
from array import array from struct import unpack_from from hashlib import md5 from time import timeREPLICAS = 3 PARTITION_POWER = 16 PARTITION_SHIFT = 32 - PARTITION_POWER PARTITION_MAX = 2 ** PARTITION_POWER - 1 NODE_COUNT = 256 DATA_ID_COUNT = 10000000begin = time() part2node = array('H') for part in xrange(2 ** PARTITION_POWER):part2node.append(part % NODE_COUNT) #(3) node_counts = [0] * NODE_COUNT for data_id in xrange(DATA_ID_COUNT):data_id = str(data_id)part = unpack_from('>I',md5(str(data_id)).digest())[0] >> PARTITION_SHIFTnode_ids = [part2node[part]] #(1)node_counts[node_ids[0]] += 1for replica in xrange(1, REPLICAS):while part2node[part] in node_ids: #(2)part += 1if part > PARTITION_MAX:part = 0node_ids.append(part2node[part])node_counts[node_ids[-1]] += 1 desired_count = DATA_ID_COUNT / NODE_COUNT * REPLICAS print'%d: Desired data ids per node' % desired_count max_count = max(node_counts) over = 100.0 * (max_count - desired_count) / desired_count print'%d: Most data ids on one node, %.02f%% over' % (max_count, over) min_count = min(node_counts) under = 100.0 * (desired_count - min_count) / desired_count print'%d: Least data ids on one node,%.02f%% under' % (min_count, under) print'%d seconds pass ...' % (time() - begin) 結果: 117186: Desired data ids per node 118133: Most data ids on one node, 0.81% over 116093: Least data ids on one node,0.93% under 52 seconds pass ...說明:
#(1) node_ids記錄的是3個replica存放的node id。part2node[part]是根據(jù)對應的partition id 找到對應的node id。
#(2)處循環(huán)3次,依次為數(shù)據(jù)的replica安排連續(xù)的partition。之后改變相應的記錄,node_ids.和node_counts。
ring4_0.py看起來還不錯,1%的波動。可是仍然會有兩個問題:
1. #(2)處是分配連續(xù)的partition。#(3)處初始化時是有規(guī)律的。這會在一些情況下使部分數(shù)據(jù)表現(xiàn)很糟糕。比如這部分數(shù)據(jù)被映射到node x的partition N上,node x頻繁宕機,而總是partition N上的數(shù)據(jù)需要進行復制。解決也相對容易: part2node初始化時進行隨機打亂,partition N和node x不在存在某種聯(lián)系。
2 數(shù)據(jù)安全的問題。假如replica所在的partiton都分布在同一個機架上。機架掉電,這會導致所有replica都不可用。因此需要一種機制對故障進行隔離。因此也就引入了zone的概念。
ring4_1.py
from array import array from struct import unpack_from from hashlib import md5 from time import time from random import shuffleREPLICAS = 3 PARTITION_POWER = 16 PARTITION_SHIFT = 32 - 16 PARTITION_MAX = 2 ** PARTITION_POWER - 1 NODE_COUNT = 256 ZONE_COUNT = 16 DATA_ID_COUNT = 10000000begin = time() node2zone = [] while len(node2zone) < NODE_COUNT: #(1)zone = 0while zone < ZONE_COUNT and len(node2zone) < NODE_COUNT:node2zone.append(zone)zone += 1 part2node = array('H') for part in xrange(2 ** PARTITION_POWER):part2node.append(part % NODE_COUNT) shuffle(part2node) #(2) node_counts = [0] * NODE_COUNT zone_counts = [0] * ZONE_COUNT for data_id in xrange(DATA_ID_COUNT):data_id = str(data_id)part = unpack_from('>I',md5(str(data_id)).digest())[0] >> PARTITION_SHIFTnode_ids = [part2node[part]]zones = [node2zone[node_ids[0]]]node_counts[node_ids[0]] += 1zone_counts[zones[0]] += 1for replica in xrange(1, REPLICAS): #(3)while part2node[part] in node_ids and \node2zone[part2node[part]] in zones:part += 1if part > PARTITION_MAX:part = 0node_ids.append(part2node[part])zones.append(node2zone[node_ids[-1]])node_counts[node_ids[-1]] += 1zone_counts[zones[-1]] += 1desired_count = DATA_ID_COUNT / NODE_COUNT * REPLICAS print '%d Desired data ids per node' % desired_count max_count = max(node_counts) over = 100.0 * (max_count - desired_count) / desired_count print '%d: Most data ids on one node, %.02f%% over' % (max_count, over) min_count = min(node_counts) under = 100.0 * (desired_count - min_count) / desired_count print '%d: Least data ids on one node,%.02f%% under' % (min_count, under)desired_count = DATA_ID_COUNT / ZONE_COUNT * REPLICAS print'%d: Desired data ids per zone' % desired_count max_count = max(zone_counts) over = 100.0 * (max_count - desired_count) / desired_count print'%d: Most data ids in one zone, %.02f %% over' % (max_count, over) min_count = min(zone_counts) under = 100.0 * (desired_count - min_count) / desired_count print'%d: Least data ids in one zone, %.02f %%under' % (min_count, under) print '%d seconds pass ...' % (time() - begin) 結果: 117186 Desired data ids per node 118619: Most data ids on one node, 1.22% over 115810: Least data ids on one node,1.17% under 1875000: Desired data ids per zone 1879143: Most data ids in one zone, 0.22 % over 1871851: Least data ids in one zone, 0.17 %under 69 seconds pass ...說明:
#(1)zonelist初始化。引入zone,以解決ring4_0.py問題2。
#(2) part2node洗牌,打亂順序。解決ring4_0.py問題1。花的時間比較多。
#(3) 逐次探查partition位置是否合適,不能在同一個node上,也不能在同一個zone上。
?
到此,ring的基本功能都已經(jīng)有了:一致性哈希ring,replica,zone。ring4_2.py給出類似圖2.1的實現(xiàn)。當然,只是一種實現(xiàn)模型。swift曾經(jīng)采用過這種模型,但現(xiàn)已廢棄。這里權當擴展了解,不感興趣的請?zhí)较乱籔art。Part3將把上述的特性封裝成類,加入weight,并簡單測試。
?
ring4_2.py體現(xiàn)的是一種anchor思想。即:維護一個node的anchor環(huán)。每次data都會查找anchor環(huán)找到和匹配自己的存儲node位置。
anchor環(huán)就是ring4_2.py中的hash2index和index2node。每個數(shù)據(jù)都在#(2)處查找index,之后到index2node找對應的node。#(1)中的二分查找、hash計算都是比較耗時的。系統(tǒng)開銷暫不提,這種實現(xiàn)并不能夠均勻分布數(shù)據(jù)。為使數(shù)據(jù)分布的更為均勻,每個node都要維護anchor,不斷遍歷anchor環(huán)以查找合適位置。而且,這種情況下replica的管理會非常麻煩。
空間和時間,計算機中的博弈。ring4_1.py采用的空間換時間,簡化了對應關系;而ring4_2.py采用了時間換空間,思路簡單,但效果不盡人意。
ring4_2.py
from bisect import bisect_left from hashlib import md5 from struct import unpack_from from time import timeREPLICAS = 3 NODE_COUNT = 256 ZONE_COUNT = 16 DATA_ID_COUNT = 10000000 VNODE_COUNT = 100begin = time() node2zone = [] while len(node2zone) < NODE_COUNT:zone = 0while zone < ZONE_COUNT and len(node2zone) < NODE_COUNT:node2zone.append(zone)zone += 1 hash2index = [] index2node = [] for node in xrange(NODE_COUNT):for vnode in xrange(VNODE_COUNT):hsh = unpack_from('>I', md5(str(node)).digest())[0]index = bisect_left(hash2index, hsh)if index > len(hash2index):index = 0hash2index.insert(index, hsh)index2node.insert(index, node) node_counts = [0] * NODE_COUNT zone_counts = [0] * ZONE_COUNT for data_id in xrange(DATA_ID_COUNT): #(1)data_id = str(data_id)hsh = unpack_from('>I', md5(str(data_id)).digest())[0]index = bisect_left(hash2index, hsh) #(2)if index >= len(hash2index):index = 0node_ids = [index2node[index]]zones = [node2zone[node_ids[0]]]node_counts[node_ids[0]] += 1zone_counts[zones[0]] += 1for replica in xrange(1, REPLICAS):while index2node[index] in node_ids and node2zone[index2node[index]] in zones:index += 1if index >= len(hash2index):index = 0node_ids.append(index2node[index])zones.append(node2zone[node_ids[-1]])node_counts[node_ids[-1]] += 1zone_counts[zones[-1]] += 1 desired_count = DATA_ID_COUNT / NODE_COUNT * REPLICAS print '%d: Desired data ids per node' % desired_count max_count = max(node_counts) over = 100.0 * (max_count - desired_count) / desired_count print '%d: Most data ids on one node, %.02f%% over' % (max_count, over) min_count = min(node_counts) under = 100.0 * (desired_count - min_count) / desired_count print '%d: Least data ids on one node, %.02f%% under' % (min_count, under)desired_count = DATA_ID_COUNT / ZONE_COUNT * REPLICAS print '%d: Desired data ids per zone' % desired_count max_count = max(zone_counts) over = 100.0 * (max_count - desired_count) / desired_count print '%d: Most data ids on one zone, %.02f%% over' % (max_count, over) min_count = min(zone_counts) under = 100.0 * (desired_count - min_count) / desired_count print '%d: Least data ids on one zone, %.02f%% under' % (min_count, under)print '%d seconds pass ...' % (time() - begin) 結果: 117186: Desired data ids per node 351282: Most data ids on one node, 199.76% over 15965: Least data ids on one node, 86.38% under 1875000: Desired data ids per zone 2248496: Most data ids on one zone, 19.92% over 1378013: Least data ids on one zone, 26.51% under 990 seconds pass ...總結
以上是生活随笔為你收集整理的构建一致性哈希ring Part2的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 谷歌浏览器解决临时跨域问题
- 下一篇: X-Mouse Button Contr