PostgreSQL 当有多个索引可选时,优化器如何选择
標簽
PostgreSQL , 索引 , 復合索引 , 選擇 , 成本 , 優化器
背景
當一個表有很多索引時,并且一個QUERY可以使用到其中的多個索引時,數據庫會如何做出選擇?最終選擇哪個,或者哪幾個索引呢?
《PostgreSQL 多查詢條件,多個索引的選擇算法與問題診斷方法》
選擇單個索引時,PATH可以選擇index scan , index only scan, bitmap scan。
選擇多個索引時,PATH可以選擇bitmap scan。
那么優化器會選擇哪個,或者哪幾個索引,又會選擇index scan , index only scan還是 bitmap scan呢?
例子
1、創建4張表,5個字段,分別使用int2,int4,int8,text類型。
create table t1 (c1 int2, c2 int2, c3 int2, c4 int2, c5 int2); create table t2 (c1 int4, c2 int4, c3 int4, c4 int4, c5 int4); create table t3 (c1 int8, c2 int8, c3 int8, c4 int8, c5 int8); create table t4 (c1 text, c2 text, c3 text, c4 text, c5 text);2、寫入測試數據1000萬條,其中c3字段為常量
insert into t1 select random()*32767,random()*32767,random()*32767,1,random()*32767 from generate_series(1,10000000); insert into t2 select random()*2000000000,random()*2000000000,random()*2000000000,1,random()*2000000000 from generate_series(1,10000000); insert into t3 select random()*2000000000,random()*2000000000,random()*2000000000,1,random()*2000000000 from generate_series(1,10000000); insert into t4 select md5(random()::text),md5(random()::text),md5(random()::text),'a',md5(random()::text) from generate_series(1,10000000);3、創建若干索引
create index idx_t1_1 on t1(c1); create index idx_t1_2 on t1(c1,c2); create index idx_t1_3 on t1(c1,c2,c3); create index idx_t1_4 on t1(c4); create index idx_t2_1 on t2(c1); create index idx_t2_2 on t2(c1,c2); create index idx_t2_3 on t2(c1,c2,c3); create index idx_t2_4 on t2(c4); create index idx_t3_1 on t3(c1); create index idx_t3_2 on t3(c1,c2); create index idx_t3_3 on t3(c1,c2,c3); create index idx_t3_4 on t3(c4); create index idx_t4_1 on t4(c1); create index idx_t4_2 on t4(c1,c2); create index idx_t4_3 on t4(c1,c2,c3); create index idx_t4_4 on t4(c4);4、強制收集統計信息
vacuum analyze t1; vacuum analyze t2; vacuum analyze t3; vacuum analyze t4;數據庫如何選擇索引? - 現象
1、輸入c1,c2,c3三個字段的等值過濾,作為條件,t1,t2,t3,t4分別選擇了哪個索引?
t1選擇了c1,c2,c3的合并索引,沒有任何多余的filter
postgres=# explain select * from t1 where c1=1 and c2=1 and c3=1; QUERY PLAN -------------------------------------------------------------------- Index Scan using idx_t1_3 on t1 (cost=0.43..2.66 rows=1 width=10) Index Cond: ((c1 = 1) AND (c2 = 1) AND (c3 = 1)) (2 rows)t2也選擇了c1,c2,c3的合并索引,沒有任何多余的filter
postgres=# explain select * from t2 where c1=1 and c2=1 and c3=1; QUERY PLAN -------------------------------------------------------------------- Index Scan using idx_t2_3 on t2 (cost=0.43..2.66 rows=1 width=20) Index Cond: ((c1 = 1) AND (c2 = 1) AND (c3 = 1)) (2 rows)t3選擇了c1,c2的合并索引,有多余的c3的filter
postgres=# explain select * from t3 where c1=1 and c2=1 and c3=1; QUERY PLAN -------------------------------------------------------------------- Index Scan using idx_t3_2 on t3 (cost=0.43..2.66 rows=1 width=40) Index Cond: ((c1 = 1) AND (c2 = 1)) Filter: (c3 = 1) (3 rows)t4選擇了c1,c2的合并索引,有多余的c3的filter
postgres=# explain select * from t4 where c1='1' and c2='1' and c3='a'; QUERY PLAN --------------------------------------------------------------------- Index Scan using idx_t4_2 on t4 (cost=0.56..2.78 rows=1 width=134) Index Cond: ((c1 = '1'::text) AND (c2 = '1'::text)) Filter: (c3 = 'a'::text) (3 rows)t1,t2,t3,t4的數據記錄數,數據分布,SQL、索引都一樣,為什么會有以上的選擇性差異呢?
數據庫如何選擇索引? - 分析
1 索引大小分析
postgres=# \di+ idx_t* List of relations Schema | Name | Type | Owner | Table | Size | Description --------+----------+-------+----------+-------+---------+------------- public | idx_t1_1 | index | postgres | t1 | 214 MB | public | idx_t1_2 | index | postgres | t1 | 214 MB | public | idx_t1_3 | index | postgres | t1 | 214 MB | public | idx_t1_4 | index | postgres | t1 | 214 MB | public | idx_t2_1 | index | postgres | t2 | 214 MB | public | idx_t2_2 | index | postgres | t2 | 214 MB | public | idx_t2_3 | index | postgres | t2 | 301 MB | public | idx_t2_4 | index | postgres | t2 | 214 MB | public | idx_t3_1 | index | postgres | t3 | 214 MB | public | idx_t3_2 | index | postgres | t3 | 301 MB | public | idx_t3_3 | index | postgres | t3 | 387 MB | public | idx_t3_4 | index | postgres | t3 | 214 MB | public | idx_t4_1 | index | postgres | t4 | 563 MB | public | idx_t4_2 | index | postgres | t4 | 911 MB | public | idx_t4_3 | index | postgres | t4 | 1266 MB | public | idx_t4_4 | index | postgres | t4 | 214 MB | (16 rows)有沒有覺得很奇怪呢?
t1表,從1個字段到3個字段的合并索引,索引大小都一樣!!!!!!!而且都與t3表單個字段一樣大。
t2表,一個字段與2個字段的索引大小一樣大!!!!!2個字段的與T3表的1個字段一樣大。3個字段的與T3表的2個字段一樣大。
t3表,容量方面看起來很正常。
t4表,c4單個字段的索引,與T3表的單個字段的索引一樣大。
看起來好像是這樣的:
INDEX TUPLE占用8個字節的倍數。(可能是對齊的考慮)
2 索引內部結構分析
《深入淺出PostgreSQL B-Tree索引結構》
1、采用pageinspect插件,可以看到索引內部的內容
create extension pageinspect;2、查看索引葉子節點的內容
postgres=# select * from bt_metap('idx_t1_1'); magic | version | root | level | fastroot | fastlevel --------+---------+------+-------+----------+----------- 340322 | 2 | 290 | 2 | 290 | 2 (1 row) postgres=# select * from bt_page_items('idx_t1_1',290) limit 5; itemoffset | ctid | itemlen | nulls | vars | data ------------+----------+---------+-------+------+------------------------- 1 | (3,1) | 8 | f | f | 2 | (289,1) | 16 | f | f | 54 01 00 00 00 00 00 00 3 | (575,1) | 16 | f | f | a9 02 00 00 00 00 00 00 4 | (860,1) | 16 | f | f | fe 03 00 00 00 00 00 00 5 | (1145,1) | 16 | f | f | 52 05 00 00 00 00 00 00 (5 rows) postgres=# select * from bt_page_items('idx_t1_1',289) limit 5; itemoffset | ctid | itemlen | nulls | vars | data ------------+---------+---------+-------+------+------------------------- 1 | (572,1) | 16 | f | f | a9 02 00 00 00 00 00 00 2 | (286,1) | 8 | f | f | 3 | (287,1) | 16 | f | f | 55 01 00 00 00 00 00 00 4 | (288,1) | 16 | f | f | 56 01 00 00 00 00 00 00 5 | (291,1) | 16 | f | f | 57 01 00 00 00 00 00 00 (5 rows) postgres=# select * from bt_page_items('idx_t1_1',287) limit 5; itemoffset | ctid | itemlen | nulls | vars | data ------------+-------------+---------+-------+------+------------------------- 1 | (38155,112) | 16 | f | f | 56 01 00 00 00 00 00 00 2 | (20034,36) | 16 | f | f | 55 01 00 00 00 00 00 00 3 | (20183,9) | 16 | f | f | 55 01 00 00 00 00 00 00 4 | (20250,124) | 16 | f | f | 55 01 00 00 00 00 00 00 5 | (20957,123) | 16 | f | f | 55 01 00 00 00 00 00 00 (5 rows) postgres=# select * from t1 where ctid='(38155,112)'; c1 | c2 | c3 | c4 | c5 -----+-------+-------+----+------- 342 | 18698 | 24394 | 1 | 27763 (1 row) postgres=# select to_hex(342); to_hex -------- 156 (1 row) 對應56 01 00 00 00 00 00 00 倒過來看 01 56。確實有一個現象和前面的描述一樣: INDEX TUPLE占用8個字節的倍數。
雖然是INT2的類型,本質上只占用2字節,但是實際上INDEX TUPLE占用了8個字節。
3、查看T2表,1個字段的索引?,F象也和前面的描述一樣: INDEX TUPLE占用8個字節的倍數。
postgres=# select * from bt_metap('idx_t2_1'); magic | version | root | level | fastroot | fastlevel --------+---------+------+-------+----------+----------- 340322 | 2 | 290 | 2 | 290 | 2 (1 row) postgres=# select * from bt_page_items('idx_t2_1',290) limit 5; itemoffset | ctid | itemlen | nulls | vars | data ------------+----------+---------+-------+------+------------------------- 1 | (3,1) | 8 | f | f | 2 | (289,1) | 16 | f | f | 0f 5c 3c 01 00 00 00 00 3 | (575,1) | 16 | f | f | 8f 8e 7a 02 00 00 00 00 4 | (860,1) | 16 | f | f | c1 f8 b7 03 00 00 00 00 5 | (1145,1) | 16 | f | f | 46 12 f7 04 00 00 00 00 (5 rows) postgres=# select * from bt_page_items('idx_t2_1',289) limit 5; itemoffset | ctid | itemlen | nulls | vars | data ------------+---------+---------+-------+------+------------------------- 1 | (572,1) | 16 | f | f | 8f 8e 7a 02 00 00 00 00 2 | (286,1) | 8 | f | f | 3 | (287,1) | 16 | f | f | 5b 8a 3d 01 00 00 00 00 4 | (288,1) | 16 | f | f | 36 a2 3e 01 00 00 00 00 5 | (291,1) | 16 | f | f | 4b ca 3f 01 00 00 00 00 (5 rows) postgres=# select * from bt_page_items('idx_t2_1',572) limit 5; itemoffset | ctid | itemlen | nulls | vars | data ------------+------------+---------+-------+------+------------------------- 1 | (3798,12) | 16 | f | f | 3d b3 7b 02 00 00 00 00 2 | (48744,46) | 16 | f | f | 8f 8e 7a 02 00 00 00 00 3 | (40279,76) | 16 | f | f | 1a 8f 7a 02 00 00 00 00 4 | (53268,39) | 16 | f | f | a9 8f 7a 02 00 00 00 00 5 | (16540,92) | 16 | f | f | 35 90 7a 02 00 00 00 00 (5 rows) postgres=# select * from t2 where ctid='(3798,12)'; c1 | c2 | c3 | c4 | c5 ----------+-----------+----------+----+------------ 41661245 | 940376658 | 41196565 | 1 | 1467443120 (1 row) postgres=# select to_hex(41661245); to_hex --------- 27bb33d (1 row) 對應3d b3 7b 02 00 00 00 00 倒過來看 02 7b b3 3d。4、查看T2表,3個字段的索引。現象也和前面的描述一樣: INDEX TUPLE占用8個字節的倍數。
postgres=# select * from bt_metap('idx_t2_3'); magic | version | root | level | fastroot | fastlevel --------+---------+------+-------+----------+----------- 340322 | 2 | 209 | 2 | 209 | 2 (1 row) postgres=# select * from bt_page_items('idx_t2_3',209) limit 5; itemoffset | ctid | itemlen | nulls | vars | data ------------+---------+---------+-------+------+------------------------------------------------- 1 | (3,1) | 8 | f | f | 2 | (208,1) | 24 | f | f | a8 2c a1 00 78 82 d4 3d 55 1b a6 67 00 00 00 00 3 | (413,1) | 24 | f | f | a8 9d 42 01 3c fe ab 6b e7 69 81 1c 00 00 00 00 4 | (617,1) | 24 | f | f | c8 f7 e5 01 eb cd b6 31 49 90 14 6f 00 00 00 00 5 | (821,1) | 24 | f | f | e7 b6 86 02 9c 1e 63 65 c4 c3 06 48 00 00 00 00 (5 rows) postgres=# select * from bt_page_items('idx_t2_3',208) limit 5; itemoffset | ctid | itemlen | nulls | vars | data ------------+---------+---------+-------+------+------------------------------------------------- 1 | (410,1) | 24 | f | f | a8 9d 42 01 3c fe ab 6b e7 69 81 1c 00 00 00 00 2 | (205,1) | 8 | f | f | 3 | (206,1) | 24 | f | f | b5 ef a1 00 bf b0 c5 14 94 4f ae 63 00 00 00 00 4 | (207,1) | 24 | f | f | 1b b1 a2 00 0a f5 c6 33 c4 09 e3 12 00 00 00 00 5 | (210,1) | 24 | f | f | 34 81 a3 00 82 74 a7 42 fe d6 33 06 00 00 00 00 (5 rows) postgres=# select * from bt_page_items('idx_t2_3',410) limit 5; itemoffset | ctid | itemlen | nulls | vars | data ------------+-------------+---------+-------+------+------------------------------------------------- 1 | (61769,131) | 24 | f | f | a6 6b 43 01 83 ab a4 04 0c a4 9e 29 00 00 00 00 2 | (44927,154) | 24 | f | f | a8 9d 42 01 3c fe ab 6b e7 69 81 1c 00 00 00 00 3 | (24587,3) | 24 | f | f | e3 9d 42 01 4a 9f 46 25 b9 d2 8e 47 00 00 00 00 4 | (29996,51) | 24 | f | f | be 9e 42 01 66 33 fd 03 8c 8c 39 2c 00 00 00 00 5 | (4032,73) | 24 | f | f | 5c 9f 42 01 ea 4d b1 01 1d af 88 32 00 00 00 00 (5 rows) postgres=# select * from t2 where ctid='(61769,131)'; c1 | c2 | c3 | c4 | c5 ----------+----------+-----------+----+------------ 21195686 | 77900675 | 698262540 | 1 | 1522991839 (1 row) postgres=# select to_hex(21195686); to_hex --------- 1436ba6 (1 row) postgres=# select to_hex(77900675); to_hex --------- 4a4ab83 (1 row) postgres=# select to_hex(698262540); to_hex ---------- 299ea40c (1 row) 對應a6 6b 43 01 83 ab a4 04 0c a4 9e 29 00 00 00 00 每個字段占用4字節,倒過來看 01 43 6b a6。04 a4 ab 83。29 9e a4 0c。但是這些個索引選擇有什么關系嗎?
本身沒什么關系,但是由于這樣出現了一個問題,索引本身占用的大小,在評估成本時也在成本計算環節之一。
HINT , 成本對比
最后數據庫選什么索引,其實還是取決于COST。
每個PATH的COST都會算一遍,哪個最低選哪個。
使用hint來強制使用某個索引,這樣就能對比出COST到底差多少?
postgres=# set pg_hint_plan.debug_print =on; SET postgres=# set pg_hint_plan.enable_hint=on; SET postgres=# set pg_hint_plan.message_level =log; SET postgres=# set client_min_messages =log; SET以t4表為例
postgres=# explain (analyze,verbose,timing,costs,buffers) /*+ IndexScan(t4 idx_t4_3) */ select * from t4 where c1='1' and c2='1' and c3='a'; LOG: available indexes for IndexScan(t4): idx_t4_3 LOG: pg_hint_plan: used hint: IndexScan(t4 idx_t4_3) not used hint: duplication hint: error hint: QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Index Scan using idx_t4_3 on public.t4 (cost=0.69..2.91 rows=1 width=134) (actual time=0.012..0.012 rows=0 loops=1) Output: c1, c2, c3, c4, c5 Index Cond: ((t4.c1 = '1'::text) AND (t4.c2 = '1'::text) AND (t4.c3 = 'a'::text)) Buffers: shared hit=5 Planning time: 0.188 ms Execution time: 0.027 ms (6 rows) postgres=# explain (analyze,verbose,timing,costs,buffers) /*+ IndexScan(t4 idx_t4_2) */ select * from t4 where c1='1' and c2='1' and c3='a'; LOG: available indexes for IndexScan(t4): idx_t4_2 LOG: pg_hint_plan: used hint: IndexScan(t4 idx_t4_2) not used hint: duplication hint: error hint: QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Index Scan using idx_t4_2 on public.t4 (cost=0.56..2.78 rows=1 width=134) (actual time=0.012..0.012 rows=0 loops=1) Output: c1, c2, c3, c4, c5 Index Cond: ((t4.c1 = '1'::text) AND (t4.c2 = '1'::text)) Filter: (t4.c3 = 'a'::text) Buffers: shared hit=5 Planning time: 0.212 ms Execution time: 0.027 ms (7 rows)很顯然,數據庫評估出來選擇idx_t4_2的成本為2.78,使用idx_t4_3的成本為2.91。
理所當然,選擇了idx_t4_2。雖然它多了一次filter。
那么為什么idx_t4_2的成本更低呢?
public | idx_t4_2 | index | postgres | t4 | 911 MB | public | idx_t4_3 | index | postgres | t4 | 1266 MB |1、走idx_t4_2索引時,評估出來掃描的數據塊比idx_t4_3更少,雖然它多了filter,但是filter引入的開銷在 RANDOM PAGE SCAN面前小得多(并且實際上這里沒有filter的成本,因為c1,c2兩個條件下去后評估出來的ROWS=0?select * from t4 where c1='1' and c2='1';)。
2、走idx_t4_3索引時,評估出來不需要filter,但是這個索引比idx_t4_2占用的空間大,按比例劃分時,算了更多的random page SCAN 成本。即使沒有額外的filter,成本也比idx_t4_2更高。
PS:
這樣的評估算法應該還有優化的空間,因為實際上兩個索引的LEVEL是一樣的。
postgres=# select * from bt_metap('idx_t2_2'); magic | version | root | level | fastroot | fastlevel --------+---------+------+-------+----------+----------- 340322 | 2 | 290 | 2 | 290 | 2 (1 row) postgres=# select * from bt_metap('idx_t2_3'); magic | version | root | level | fastroot | fastlevel --------+---------+------+-------+----------+----------- 340322 | 2 | 209 | 2 | 209 | 2 (1 row)以上,使用idx_t2_2和idx_t2_3掃描的INDEX PAGE數實際上是一樣的。
小結
PostgreSQL 采用CBO的優化器模型,哪個PATH成本低,就使用哪個PATH。
在有多個索引可以使用時,實際上數據庫會計算每一種的成本,哪種成本低,選擇哪個。
影響成本的因素很多:
算子設置,選擇性,掃描成本,INDEX TUPLE過濾成本,HEAP TUPLE過濾成本。
成本是綜合的結果。
思考:
select * from tbl where c1=1 and c2=1 and c3=1 and c4=1; idx(c1,c2) idx(c2,c3) idx(c4,c3,c1)選哪個索引?需要考慮什么因素?
提示:
多列統計信息,統計信息,多列條件合并選擇性、索引大小、RANDOM SCAN成本,SEQ SCAN成本,FILTER的成本,成本算子參數,優化器開關等。
參考
《深入淺出PostgreSQL B-Tree索引結構》
《PostgreSQL 10 黑科技 - 自定義統計信息》
src/backend/optimizer/path/costsize.c
《PostgreSQL 多查詢條件,多個索引的選擇算法與問題診斷方法》
《PostgreSQL 自定義函數表達式選擇性評估算法 - Statistics, Cardinality, Selectivity, Estimate》
《PostgreSQL 優化器行評估算法》
《關鍵時刻HINT出彩 - PG優化器的參數優化、執行計劃固化CASE》
《優化器成本因子校對 - PostgreSQL explain cost constants alignment to timestamp》
《優化器成本因子校對(disk,ssd,memory IO開銷精算) - PostgreSQL real seq_page_cost & random_page_cost in disks,ssd,memory》
https://www.postgresql.org/docs/11/static/planner-stats-details.html
總結
以上是生活随笔為你收集整理的PostgreSQL 当有多个索引可选时,优化器如何选择的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 程序猿修仙之路--算法之直接插入排序
- 下一篇: 如何优化冒泡排序