分布式块存储QoS限速算法介绍与实践以及对上层应用的影响
分布式塊存儲QoS限速算法以及對上層應用的影響
- QoS限速算法介紹
- 令牌桶 Token Bucket
- 漏桶 Leaky Bucket
- Leaky bucket as a meter
- Leaky bucket as a queue
- 主流的塊設備流控方案
- Qemu
- librbd
- spdk
- 限速策略對塊設備的影響
- 時延
- IO util
- 對數(shù)據(jù)庫應用的影響
- 參考鏈接
QoS限速算法介紹
限速策略主要有令牌桶的漏桶兩種,下面分別介紹如下。
令牌桶 Token Bucket
Wiki對令牌桶的算法描述如下:
- A token is added to the bucket every 1/r seconds.
- The bucket can hold at the most b tokens. If a token arrives when the bucket is full, it is discarded.
- When a packet (network layer PDU) of n bytes arrives,
- if at least n tokens are in the bucket, n tokens are removed from the bucket, and the packet is sent to the network.
- if fewer than n tokens are available, no tokens are removed from the bucket, and the packet is considered to be non-conformant.
一個固定容量的桶裝著一定數(shù)量的令牌,桶的容量即令牌數(shù)量上限。桶里的令牌數(shù)量會每隔固定時間補充,直到桶被裝滿。一個IO請求將消耗一個令牌,如果桶里有令牌,則該IO請求消耗令牌后放行,反之則無法放行(算法可以選擇是否放棄IO請求)。如果對字節(jié)數(shù)限流,每次個IO會消耗iosize大小的令牌。
按照以上描述,我們可以知道,令牌桶算法可以達到以下效果:
漏桶 Leaky Bucket
Leaky bucket as a meter
Wiki中對Leaky bucket as a meter定義如下:
- A fixed capacity bucket, associated with each virtual connection or user, leaks at a fixed rate.
- If the bucket is empty, it stops leaking.
- For a packet to conform, it has to be possible to add a specific amount of water to the bucket: The specific amount added by a conforming packet can be the same for all packets, or can be proportional to the length of the packet.
- If this amount of water would cause the bucket to exceed its capacity then the packet does not conform and the water in the bucket is left unchanged.
我們可以理解如下:
一個桶,以固定的流量漏水,經(jīng)過的IO會請求報文會向桶中加水,加水的量以流控的方面為準,可以是byte,可以是IOPS,如果加水溢出,則IO不能通過,反之則可以放行。
可見,這個算法描述和令牌桶基本類似,我們可以認為Leaky bucket as a meter和Token Bucket是等價的。
Leaky bucket as a queue
wiki對這種限流策略的描述是:The leaky bucket consists of a finite queue. When a packet arrives, if there is room on the queue it is appended to the queue; otherwise it is discarded. At every clock tick one packet is transmitted (unless the queue is empty)
可以認為,Leaky bucket as a queue就是令牌桶的桶大小等于1的場景。
主流的塊設備流控方案
主流的、在工程上有大范圍應用的留空策略,主要有三種,qemu,librbd,spdk,下面分別介紹
Qemu
Qemu早在1.1版本就已支持塊設備IO限速,提供6個配置項,可對IOPS和帶寬6種場景分別進行速率上限設置。在1.7版本對塊設備IO限速增加了支持突發(fā)的功能。在2.6版本對支持突發(fā)的功能進行了完善,可控制突發(fā)速率和時長。參數(shù)如下:
| 總iops | iops-total | iops-total-max | iops-total-max-length |
| 讀iops | iops-read | iops-read-max | iops-read-max-length |
| 寫iops | iops-write | iops-write-max | iops-write-max-length |
| 總bps | bps-total | bps-total-max | bps-total-max-length |
| 讀bps | bps-read | bps-read-max | bps-read-max-length |
| 寫bps | bps-write | bps-write-max | bps-write-max-length |
其實現(xiàn)的核心數(shù)據(jù)結構是這樣描述的:
typedef struct LeakyBucket {uint64_t avg; /* IO的限制目標速率 */uint64_t max; /* IO的突發(fā)限制速率 */double level; /* bucket level in units */double burst_level; /* bucket level in units (for computing bursts) */uint64_t burst_length; /* 突發(fā)時長,默認單位是秒 */ } LeakyBucketQemu的流控算法使用漏桶實現(xiàn)。算法的的目標是,用戶可以在突發(fā)速率bkt.max持續(xù)bkt.burst_length秒,之后速率會降為bkt.avg
為了實現(xiàn)這個目標,qemu實現(xiàn)了兩個桶
如果主桶已經(jīng)滿了,則需要等待漏桶,如果主桶未滿并且設置了突發(fā)桶,則需要檢驗突發(fā)桶是否可以放行。這樣,我們通過突發(fā)桶保證了IO的突發(fā)速率,通過主桶的大小,保證了突發(fā)的時間。
關鍵的控制IO是否能放行的函數(shù)如下:
/* This function compute the wait time in ns that a leaky bucket should trigger** @bkt: the leaky bucket we operate on* @ret: the resulting wait time in ns or 0 if the operation can go through*/ int64_t throttle_compute_wait(LeakyBucket *bkt) {double extra; /* the number of extra units blocking the io */double bucket_size; /* I/O before throttling to bkt->avg */double burst_bucket_size; /* Before throttling to bkt->max */if (!bkt->avg) {return 0;}if (!bkt->max) {/* If bkt->max is 0 we still want to allow short bursts of I/O* from the guest, otherwise every other request will be throttled* and performance will suffer considerably. */bucket_size = (double) bkt->avg / 10;burst_bucket_size = 0;} else {/* If we have a burst limit then we have to wait until all I/O* at burst rate has finished before throttling to bkt->avg */bucket_size = bkt->max * bkt->burst_length;burst_bucket_size = (double) bkt->max / 10;}/* If the main bucket is full then we have to wait */extra = bkt->level - bucket_size;if (extra > 0) {return throttle_do_compute_wait(bkt->avg, extra);}/* If the main bucket is not full yet we still have to check the* burst bucket in order to enforce the burst limit */if (bkt->burst_length > 1) {assert(bkt->max > 0); /* see throttle_is_valid() */extra = bkt->burst_level - burst_bucket_size;if (extra > 0) {return throttle_do_compute_wait(bkt->max, extra);}}return 0; }librbd
Ceph在13.2.0版本(m版)支持對RBD鏡像的IO限速,此版本僅支持總iops場景的限速,且支持突發(fā),支持配置突發(fā)速率,但不可控制突發(fā)時長(實際相當于突發(fā)時長設置為1秒且無法修改)。在14.2.0版本(n版)增加了對讀iops、寫iops、總bps、讀bps、寫bps這5種場景的限速支持,對突發(fā)的支持效果保持不變。
Librbd的限速機制支持突發(fā),支持配置突發(fā)速率,但不支持控制突發(fā)時長,使用令牌桶實現(xiàn)。令牌桶加水的速率可以使用rbd_qos_schedule_tick_min參數(shù)調(diào)節(jié),默認50ms,用戶可以通過如下參數(shù)配置基本速率和突發(fā)速率。
| 總iops | rbd_qos_iops_limit | rbd_qos_iops_burst |
| 讀iops | rbd_qos_iops_read_limit | rbd_qos_iops_read_burst |
| 寫iops | rbd_qos_iops_write_limit | rbd_qos_iops_write_burst |
| 總bps | rbd_qos_bps_limit | rbd_qos_bps_burst |
| 讀bps | rbd_qos_bps_read_limit | rbd_qos_bps_read_burst |
| 寫bps | rbd_qos_bps_write_limit | rbd_qos_bps_write_burst |
spdk
spdk的qos限速實現(xiàn)在bdev層,是令牌桶。支持對IOPS和BW單獨進行配置,但是不支持突發(fā)速率。通過使用rpc請求bdev_set_qos_limit進行配置。配置參數(shù)如下
| rw_ios_per_sec | IOPS限制 |
| rw_mbytes_per_sec | 讀寫帶寬限制 |
| r_mbytes_per_sec | 讀帶寬限制 |
| w_mbytes_per_sec | 寫帶寬限制 |
spdk通過注冊poller函數(shù)bdev_channel_poll_qos向令牌桶中加令牌,頻率為SPDK_BDEV_QOS_TIMESLICE_IN_USEC硬編碼,默認1ms。每次加令牌的頻率就是總速率/時間片
一個IO需要經(jīng)過所有配置的令牌桶之后才可以被放行,令牌桶可以單次消耗減為負數(shù),減為負數(shù)之后所有的IO均不能被放行,只有等函數(shù)bdev_channel_poll_qos重新將令牌桶加成正數(shù)之后才能放行。
static int bdev_channel_poll_qos(void *arg) {struct spdk_bdev_qos *qos = arg;uint64_t now = spdk_get_ticks();int i;if (now < (qos->last_timeslice + qos->timeslice_size)) {/* We received our callback earlier than expected - return* immediately and wait to do accounting until at least one* timeslice has actually expired. This should never happen* with a well-behaved timer implementation.*/return SPDK_POLLER_IDLE;}/* Reset for next round of rate limiting */for (i = 0; i < SPDK_BDEV_QOS_NUM_RATE_LIMIT_TYPES; i++) {/* We may have allowed the IOs or bytes to slightly overrun in the last* timeslice. remaining_this_timeslice is signed, so if it's negative* here, we'll account for the overrun so that the next timeslice will* be appropriately reduced.*/if (qos->rate_limits[i].remaining_this_timeslice > 0) {qos->rate_limits[i].remaining_this_timeslice = 0;}}while (now >= (qos->last_timeslice + qos->timeslice_size)) {qos->last_timeslice += qos->timeslice_size;for (i = 0; i < SPDK_BDEV_QOS_NUM_RATE_LIMIT_TYPES; i++) {qos->rate_limits[i].remaining_this_timeslice +=qos->rate_limits[i].max_per_timeslice;}}return bdev_qos_io_submit(qos->ch, qos); }限速策略對塊設備的影響
不同的qos策略上層塊設備的體驗也是不同的,主要體現(xiàn)在IO的時延和%util。
時延
時延由qos策略的突發(fā)性能和補充頻率決定
- 突發(fā)性能,設置比較大的漏桶或令牌桶,或者像Qemu那樣配置兩個桶,可以增強塊設備的突發(fā)性能,讓塊設備承受突發(fā)流量時延時比較低。
- 可以通過fio命令fio --group_reporting --rw=randwrite --bs=1M --numjobs=1 --iodepth=64 --ioengine=libaio --direct=1 --name test --size=2000G --filename=/dev/vdb -iodepth_low=0 -iodepth_batch_submit=64 -thinktime=950ms -thinktime_blocks=64,每次下發(fā)1M隊列深度為64的IO,下發(fā)完成之后等950ms再重復。如果塊設備的突發(fā)性能不行,看到的現(xiàn)象是iowait時延較大,時延與隊列深度呈線性增長,且?guī)拤翰簧先ァ6乙驗槲覀兠看蜪O下發(fā)都會等很久,因此io util也不高。
- 補充頻率,補充頻率較低會造成拖尾時延嚴重,舉個簡單的例子,令牌桶每隔1秒補充一次,那么如果當前這1秒下發(fā)的IO下過了限制,那么有些IO的時延肯定會超過1秒,造成拖尾時延較大。
IO util
磁盤util值定義為磁盤處理IO時間占總時間的比例。也就是當前磁盤隊列中有IO的時間和總時間的比率。如果限速算法導致處理IO的時間分布很均勻(如Leaky bucket as a queue,IO一個個斷續(xù)的被處理),磁盤隊列一直存在IO,那么util自然較高。
而設置突發(fā)性能較大的塊設備,很高的隊列深度也可以被很快的處理完成,Util自然低。
對數(shù)據(jù)庫應用的影響
我們這里以構建于分布式塊設備之上的數(shù)據(jù)庫mysql為例,談談限速策略對sql性能造成的影響。
在mysql中,主要有兩部分IO比較影響性能
參考鏈接
總結
以上是生活随笔為你收集整理的分布式块存储QoS限速算法介绍与实践以及对上层应用的影响的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: spdk-nvmf指南
- 下一篇: 深入理解LSM-Tree