生活随笔
收集整理的這篇文章主要介紹了
关于PostgreSQL的GiST索引之四
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
3. hstore的GiST索引
hstore的GiST索引實現和全文檢索的tsvector的GiST索引的實現差不過,都是簽名文件索引。
3.1 GiST索引項的存儲
1)所有索引項都采用簽名向量壓縮
2)hstore中的所有鍵和所有值都進行哈希設置bit位得到一個簽名向量
3)簽名向量的大小為128bit
3.2?性能推論
1)索引尺寸比較小(因為簽名向量的大小是固定的而且一般比原始數據小)
2)每次查詢要掃描的索引頁很多(20%~40%的索引頁)
3)每次鍵查詢(?操作符)至少要recheck 1.6%(2/128)以上的數據行
? 每條記錄包含的鍵值對越多,需要recheck的數據行越多
4)每次鍵值查詢(@>操作符)至少要recheck 0.02%((2/128) * (2/128))以上的數據行
? 每條記錄包含的鍵值對越多,需要recheck的數據行越多
假設每條記錄平均包含10條鍵值對,需要recheck的數據行估算如下鍵查詢(?操作符)匹配:7.5%的數據行
點擊(此處)折疊或打開
testdb=# with recursive tx1(n,sign_bits) as
testdb-# (select 1 n,1.0 sign_bits
testdb(# union all
testdb(# select n+1, sign_bits + 1 - sign_bits/128 from tx1 where n=10
testdb(# )
testdb-# select n,sign_bits,sign_bits/128 scan_probability from tx1 where n in(10);
?n | sign_bits | scan_probability
----+------------------------+------------------------
?10 | 9.65566251563533320389 | 0.07543486340340104066
(1 row) 由此可見對鍵查詢(多個鍵的?&查詢另當別論),hstore的GiST索引的查詢效率是比較低的。不過,從另外一個角度考慮,通常的應用場景下,不同的key值數本來就不會很多,就不是索引的用武之地。
鍵值查詢(@>操作符)匹配:0.57%(0.57%=0.07543486340340104066*0.07543486340340104066)
但是這只是理論上的平均值,由于簽名向量的沖突,個別查詢的效率可能會很低。比如,如果大部分數據記錄都包含某個key,而查詢這個鍵值對的時候,碰巧它的值在簽名向量中對應的bit位和這個很常用的key相同,那么大部分數據記錄都要進行recheck。 3.3?相關代碼
contrib/hstore/hstore_gist.c
點擊(此處)折疊或打開
/* bigint defines */
#define BITBYTE 8
#define SIGLENINT 4????????????/* >122 => key will toast, so very */
#define SIGLEN????( sizeof(int)*SIGLENINT )
#define SIGLENBIT (SIGLEN*BITBYTE)
typedef char BITVEC[SIGLEN];
typedef char *BITVECP;
...
Datum
ghstore_compress(PG_FUNCTION_ARGS)
{
????GISTENTRY *entry = (GISTENTRY *) PG_GETARG_POINTER(0);
????GISTENTRY *retval = entry;
????if (entry->leafkey)
????{
????????GISTTYPE *res = (GISTTYPE *) palloc0(CALCGTSIZE(0));
????????HStore???? *val = DatumGetHStoreP(entry->key);
????????HEntry???? *hsent = ARRPTR(val);
????????char???? *ptr = STRPTR(val);
????????int????????????count = HS_COUNT(val);
????????int????????????i;
????????SET_VARSIZE(res, CALCGTSIZE(0));
//遍歷所有KV對
????????for (i = 0; i count; ++i)
????????{
????????????int????????????h;
//對key做哈希
????????????h = crc32_sz((char *) HS_KEY(hsent, ptr, i), HS_KEYLEN(hsent, i));
????????????HASH(GETSIGN(res), h);
????????????if (!HS_VALISNULL(hsent, i))
????????????{
//對value做哈希
????????????????h = crc32_sz((char *) HS_VAL(hsent, ptr, i), HS_VALLEN(hsent, i));
????????????????HASH(GETSIGN(res), h);
????????????}
????????}
????????retval = (GISTENTRY *) palloc(sizeof(GISTENTRY));
????????gistentryinit(*retval, PointerGetDatum(res),
???????????????????? entry->rel, entry->page,
???????????????????? entry->offset,
???????????????????? FALSE);
????}
????else if (!ISALLTRUE(DatumGetPointer(entry->key)))
????{
????????int32????????i;
????????GISTTYPE *res;
????????BITVECP????????sign = GETSIGN(DatumGetPointer(entry->key));
????????LOOPBYTE
????????{
????????????if ((sign[i] & 0xff) != 0xff)
????????????????PG_RETURN_POINTER(retval);
????????}
????????res = (GISTTYPE *) palloc(CALCGTSIZE(ALLISTRUE));
????????SET_VARSIZE(res, CALCGTSIZE(ALLISTRUE));
????????res->flag = ALLISTRUE;
????????retval = (GISTENTRY *) palloc(sizeof(GISTENTRY));
????????gistentryinit(*retval, PointerGetDatum(res),
???????????????????? entry->rel, entry->page,
???????????????????? entry->offset,
???????????????????? FALSE);
????}
????PG_RETURN_POINTER(retval);
}
3.4?性能驗證
1)環境 CentOS 6.5
PostgreSQL 9.4.0
2)測試數據準備點擊(此處)折疊或打開 testdb=# create extension hstore;
CREATE EXTENSION
Time: 235.391 ms
testdb=# create table tbhs(id serial,hs hstore);
CREATE TABLE
Time: 17.459 ms
testdb=# insert into tbhs select id,hstore(id::text, md5(id::text)) from generate_series(1,100000) id;
INSERT 0 100000
Time: 970.613 ms
testdb=# select pg_relation_size('tbhs');
?pg_relation_size
------------------
??????????8445952
(1 row)
Time: 0.643 ms
testdb=# select * from tbhs limit 5;
?id | hs
----+-----------------------------------------
??1 | "1"=>"c4ca4238a0b923820dcc509a6f75849b"
??2 | "2"=>"c81e728d9d4c2f636f067f89cc14862c"
??3 | "3"=>"eccbc87e4b5ce2fe28308fd9f2a7baf3"
??4 | "4"=>"a87ff679a2f3e71d9181a67b7542122c"
??5 | "5"=>"e4da3b7fbbce2345d7772b0674a318d5"
(5 rows)
Time: 1.147 ms 注)上面每個記錄只有一個鍵值對,且每個key都是唯一的,這樣的數據模型就不適合使用hstore存儲,本文故意這么做只是為了驗證性能。
2)無索引的查詢點擊(此處)折疊或打開 testdb=# explain (analyze,buffers) select * from tbhs where hs ? '10'; ?????????????????????????????????????????????QUERY PLAN
-----------------------------------------------------------------------------------------------------
?Seq Scan on tbhs (cost=0.00..2281.00 rows=100 width=53) (actual time=0.025..31.587 rows=1 loops=1)
???Filter: (hs ? '10'::text)
???Rows Removed by Filter: 99999
???Buffers: shared hit=1031
?Planning time: 0.055 ms
?Execution time: 31.619 ms
(6 rows)
Time: 32.150 ms 3)建立GiST索引再查詢 點擊(此處)折疊或打開
testdb=# create index tbhs_idx_gist on tbhs using gist(hs);
CREATE INDEX
Time: 6243.003 ms
testdb=# analyze tbhs;
ANALYZE
Time: 77.956 ms
testdb=# select pg_relation_size('tbhs_idx_gist'),pg_relation_size('tbhs_idx_gist')/8192 index_pages;
?pg_relation_size | index_pages
------------------+-------------
??????????4825088 | 589
(1 row)
Time: 0.456 ms ?查詢掃描了45%(265/589)的索引頁,Recheck了1.6%的數據行。
點擊(此處)折疊或打開
testdb=# explain (analyze,buffers) select * from tbhs where hs ? '10';
????????????????????????????????????????????????????????QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
?Bitmap Heap Scan on tbhs (cost=5.06..302.42 rows=100 width=53) (actual time=3.003..4.637 rows=1 loops=1)
???Recheck Cond: (hs ? '10'::text)
???Rows Removed by Index Recheck: 1558
???Heap Blocks: exact=863
???Buffers: shared hit=1128
???-> Bitmap Index Scan on tbhs_idx_gist (cost=0.00..5.03 rows=100 width=0) (actual time=2.869..2.869 rows=1559 loops=1)
?????????Index Cond: (hs ? '10'::text)
?????????Buffers: shared hit=265
?Planning time: 0.153 ms
?Execution time: 5.073 ms
(10 rows)
Time: 6.209 ms @>查詢掃描了19%(114/589)的索引頁,Recheck了0.012%的數據行。
點擊(此處)折疊或打開
testdb=# explain (analyze,buffers) select * from tbhs where hs @> '1=>c4ca4238a0b923820dcc509a6f75849b';
???????????????????????????????????????????????????????QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
?Bitmap Heap Scan on tbhs (cost=5.06..302.42 rows=100 width=53) (actual time=2.123..2.205 rows=1 loops=1)
???Recheck Cond: (hs @> '"1"=>"c4ca4238a0b923820dcc509a6f75849b"'::hstore)
???Rows Removed by Index Recheck: 11
???Heap Blocks: exact=12
???Buffers: shared hit=126
???-> Bitmap Index Scan on tbhs_idx_gist (cost=0.00..5.03 rows=100 width=0) (actual time=2.097..2.097 rows=12 loops=1)
?????????Index Cond: (hs @> '"1"=>"c4ca4238a0b923820dcc509a6f75849b"'::hstore)
?????????Buffers: shared hit=114
?Planning time: 0.407 ms
?Execution time: 2.294 ms
(10 rows)
Time: 3.332 ms 4)建立GIN索引再查詢建立GIN索引對比一下。 點擊(此處)折疊或打開
testdb=# drop index tbhs_idx_gist;
DROP INDEX
Time: 7.036 ms
testdb=# create index tbhs_idx_gin on tbhs using gin(hs);
CREATE INDEX
Time: 2244.241 ms
testdb=# analyze tbhs;
ANALYZE
Time: 64.608 ms
testdb=# select pg_relation_size('tbhs_idx_gin'),pg_relation_size('tbhs_idx_gin')/8192 index_pages;
?pg_relation_size | index_pages
------------------+-------------
?????????17629184 | 2152
(1 row)
Time: 0.641 ms
testdb=# explain (analyze,buffers) select * from tbhs where hs ? '10';
???????????????????????????????????????????????????????QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
?Bitmap Heap Scan on tbhs (cost=16.77..314.14 rows=100 width=53) (actual time=0.096..0.097 rows=1 loops=1)
???Recheck Cond: (hs ? '10'::text)
???Heap Blocks: exact=1
???Buffers: shared hit=5
???-> Bitmap Index Scan on tbhs_idx_gin (cost=0.00..16.75 rows=100 width=0) (actual time=0.086..0.086 rows=1 loops=1)
?????????Index Cond: (hs ? '10'::text)
?????????Buffers: shared hit=4
?Planning time: 0.349 ms
?Execution time: 0.144 ms
(9 rows)
Time: 1.014 ms
testdb=# explain (analyze,buffers) select * from tbhs where hs @> '1=>c4ca4238a0b923820dcc509a6f75849b';
???????????????????????????????????????????????????????QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
?Bitmap Heap Scan on tbhs (cost=28.77..326.14 rows=100 width=53) (actual time=0.070..0.071 rows=1 loops=1)
???Recheck Cond: (hs @> '"1"=>"c4ca4238a0b923820dcc509a6f75849b"'::hstore)
???Heap Blocks: exact=1
???Buffers: shared hit=8
???-> Bitmap Index Scan on tbhs_idx_gin (cost=0.00..28.75 rows=100 width=0) (actual time=0.047..0.047 rows=1 loops=1)
?????????Index Cond: (hs @> '"1"=>"c4ca4238a0b923820dcc509a6f75849b"'::hstore)
?????????Buffers: shared hit=7
?Planning time: 0.131 ms
?Execution time: 0.105 ms
(9 rows)
Time: 0.661 ms GIN索引的查詢效率相當不錯。GIN的缺點是:索引比較大,對更新速度有一定影響。
3.5 參考
http://blog.chinaunix.net/uid-20726500-id-4884626.html
http://blog.chinaunix.net/uid-20726500-id-4884681.html
http://blog.chinaunix.net/uid-20726500-id-4886571.html
http://blog.chinaunix.net/uid-259788-id-4279627.html
總結
以上是生活随笔為你收集整理的关于PostgreSQL的GiST索引之四的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。