當(dāng)前位置：首頁 > 运维知识 > 数据库 >内容正文

数据库

[译]以PostgreSQL为例，谈join计算的代价

發(fā)布時(shí)間：2025/3/21 数据库 46 豆豆

生活随笔收集整理的這篇文章主要介紹了 [译]以PostgreSQL为例，谈join计算的代价小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

join計(jì)算的代價(jià)很高嗎？
看情況
join的代價(jià)依賴于join的條件，索引是什么樣，依賴于表有多大，相關(guān)信息是否已經(jīng)cache住了，使用的什么硬件，配置參數(shù)的信息，統(tǒng)計(jì)信息是否已經(jīng)更新，同時(shí)是否還有其他運(yùn)行的計(jì)算……
暈了？別急！在以下情景下，我們依然可以找到一些規(guī)律來分析判斷：

隨著join的表的數(shù)量增加
隨著這些表的行數(shù)的增加
有沒有索引

此類情況，在工作中經(jīng)常會碰到，比如：如果有一張產(chǎn)品表 product，但業(yè)務(wù)上需要加入一個(gè)產(chǎn)品的狀態(tài)，包括Active、Discontinued、Recalled等。此時(shí)，我們會有3種不同的做法：

在產(chǎn)品表 product 中，增加一列狀態(tài)號 status_id，同時(shí)增加一個(gè)新的狀態(tài)表 status。

在產(chǎn)品表 product 中，增加一列狀態(tài)號 status_id，同時(shí)讓應(yīng)用來定義每個(gè)狀態(tài)號 status_id 對應(yīng)的含義及顯示。

在產(chǎn)品表 product 中，增加一列文本列，用來描述狀態(tài)信息。

通常，我們會選擇第一個(gè)做法。關(guān)于后兩種的做法，通常的質(zhì)疑在兩個(gè)方面：join的性能和開發(fā)人員的工程化能力。后者通常與個(gè)人喜好有關(guān)，姑且不談，咱們來一起討論一下join的性能問題。

為便于討論，選用PostgreSQL測試數(shù)據(jù)來討論。以等值連接為例，讓我們看看執(zhí)行上面的join時(shí)，性能會有什么變化？我們擔(dān)心的性能變慢，那具體會變成多慢。

以下是用來生成測試用的建表語句。

DROP FUNCTION IF EXISTS create_tables(integer, integer, boolean); CREATE FUNCTION create_tables(num_tables integer, num_rows integer, create_indexes boolean) RETURNS void AS $function_text$ BEGIN-- There's no table before the first one, so this one's a little different. Create it here instead of in our loop. DROP TABLE IF EXISTS table_1 CASCADE; CREATE TABLE table_1 (id serial primary key );-- Populate the first table INSERT INTO table_1 (id) SELECTnextval('table_1_id_seq') FROMgenerate_series(1, num_rows);-- Create and populate all the other tables FOR i IN 2..num_tables LOOPEXECUTE 'DROP TABLE IF EXISTS table_' || i || ' CASCADE;';EXECUTE format($$CREATE TABLE table_%1$s (id serial primary key,table_%2$s_id integer references table_%2$s (id));INSERT INTO table_%1$s (table_%2$s_id)SELECTidFROMtable_%2$sORDER BYrandom();$$, i, i-1);IF create_indexes THENEXECUTE 'CREATE INDEX ON table_' || i || ' (table_' || i - 1 || '_id);';END IF; END LOOP; END; $function_text$ LANGUAGE plpgsql;-- We'll want to make sure PostgreSQL has an idea of what's in these tables DROP FUNCTION IF EXISTS analyze_tables(integer); CREATE FUNCTION analyze_tables(num_tables integer) RETURNS void AS $function_text$ BEGINFOR i IN 1..num_tables LOOPEXECUTE 'ANALYZE table_' || i || ';'; END LOOP; END; $function_text$ LANGUAGE plpgsql;

執(zhí)行建表函數(shù)……

SELECT create_tables(10, 10000, False);SELECT * from table_1 limit 10;id ----12345678910 (10 rows)SELECT * from table_2 limit 10;id | table_1_id ----+------------1 | 8242 | 9733 | 8594 | 7895 | 9016 | 1127 | 1628 | 2129 | 33310 | 577 (10 rows)

OK，現(xiàn)在我們可以任意創(chuàng)建所需要的表了。
我們還需要方法來查詢，以測試join的性能。有一些不錯(cuò)的長查詢，但我們不希望手工來編寫，于是我們創(chuàng)建了另一個(gè)函數(shù)來生成它們。只需要告訴它有多少表參與join，以及where子句中最后一張表的最大的id，它就可以執(zhí)行了。

DROP FUNCTION IF EXISTS get_query(integer, integer); CREATE FUNCTION get_query(num_tables integer, max_id integer) RETURNS text AS $function_text$ DECLAREfirst_part text;second_part text;third_part text;where_clause text; BEGINfirst_part := $query$SELECTcount(*)FROMtable_1 AS t1 INNER JOIN$query$;second_part := '';FOR i IN 2..num_tables-1 LOOPsecond_part := second_part || format($query$table_%1$s AS t%1$s ONt%2$s.id = t%1$s.table_%2$s_id INNER JOIN$query$, i, i-1); END LOOP;third_part := format($query$table_%1$s AS t%1$s ONt%2$s.id = t%1$s.table_%2$s_idWHEREt1.id <= %3$s$query$, num_tables, num_tables-1, max_id);RETURN first_part || second_part || third_part || ';'; END; $function_text$ LANGUAGE plpgsql;

下面是一個(gè)生成查詢的示例。

SELECT get_query(5, 10);get_query --------------------------------------------------+SELECT +count(*) +FROM +table_1 AS t1 INNER JOIN +table_2 AS t2 ON +t1.id = t2.table_1_id INNER JOIN+table_3 AS t3 ON +t2.id = t3.table_2_id INNER JOIN+table_4 AS t4 ON +t3.id = t4.table_3_id INNER JOIN+table_5 AS t5 ON +t4.id = t5.table_4_id +WHERE +t1.id <= 10; (1 row)Time: 1.404 ms

OK，讓我們花一些時(shí)間來思考一下，當(dāng)我們運(yùn)行這條查詢時(shí)，我們實(shí)際讓Postgres做了哪些事情。在這條SQL中，我們在詢問表 table_5 中的 table_4_id 列有多少在表 table_4中，而且表 table_4 中 table_3_id 列有多少在表 table_2 中，而且表 table_2 中 table_1_id 列有多少在表 table_1 中，而且 table_1_id 小于等于10。
我們繼續(xù)運(yùn)行……

SELECTcount(*) FROMtable_1 AS t1 INNER JOINtable_2 AS t2 ONt1.id = t2.table_1_id INNER JOINtable_3 AS t3 ONt2.id = t3.table_2_id INNER JOINtable_4 AS t4 ONt3.id = t4.table_3_id INNER JOINtable_5 AS t5 ONt4.id = t5.table_4_id WHEREt1.id <= 10;count -------10 (1 row)Time: 40.494 ms

我們可以通過拋出 EXPLAIN ANALYZE 來查看進(jìn)展。

EXPLAIN ANALYZE SELECTcount(*) FROMtable_1 AS t1 INNER JOINtable_2 AS t2 ONt1.id = t2.table_1_id INNER JOINtable_3 AS t3 ONt2.id = t3.table_2_id INNER JOINtable_4 AS t4 ONt3.id = t4.table_3_id INNER JOINtable_5 AS t5 ONt4.id = t5.table_4_id WHEREt1.id <= 10;QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Aggregate (cost=827.93..827.94 rows=1 width=8) (actual time=43.392..43.392 rows=1 loops=1)-> Hash Join (cost=645.31..827.90 rows=9 width=0) (actual time=35.221..43.353 rows=10 loops=1)Hash Cond: (t5.table_4_id = t4.id)-> Seq Scan on table_5 t5 (cost=0.00..145.00 rows=10000 width=4) (actual time=0.024..3.984 rows=10000 loops=1)-> Hash (cost=645.20..645.20 rows=9 width=4) (actual time=34.421..34.421 rows=10 loops=1)Buckets: 1024 Batches: 1 Memory Usage: 9kB-> Hash Join (cost=462.61..645.20 rows=9 width=4) (actual time=25.281..34.357 rows=10 loops=1)Hash Cond: (t4.table_3_id = t3.id)-> Seq Scan on table_4 t4 (cost=0.00..145.00 rows=10000 width=8) (actual time=0.022..4.828 rows=10000 loops=1)-> Hash (cost=462.50..462.50 rows=9 width=4) (actual time=23.519..23.519 rows=10 loops=1)Buckets: 1024 Batches: 1 Memory Usage: 9kB-> Hash Join (cost=279.91..462.50 rows=9 width=4) (actual time=12.617..23.453 rows=10 loops=1)Hash Cond: (t3.table_2_id = t2.id)-> Seq Scan on table_3 t3 (cost=0.00..145.00 rows=10000 width=8) (actual time=0.017..5.065 rows=10000 loops=1)-> Hash (cost=279.80..279.80 rows=9 width=4) (actual time=12.221..12.221 rows=10 loops=1)Buckets: 1024 Batches: 1 Memory Usage: 9kB-> Hash Join (cost=8.55..279.80 rows=9 width=4) (actual time=0.293..12.177 rows=10 loops=1)Hash Cond: (t2.table_1_id = t1.id)-> Seq Scan on table_2 t2 (cost=0.00..145.00 rows=10000 width=8) (actual time=0.017..5.407 rows=10000 loops=1)-> Hash (cost=8.44..8.44 rows=9 width=4) (actual time=0.054..0.054 rows=10 loops=1)Buckets: 1024 Batches: 1 Memory Usage: 9kB-> Index Only Scan using table_1_pkey on table_1 t1 (cost=0.29..8.44 rows=9 width=4) (actual time=0.024..0.035 rows=10 loops=1)Index Cond: (id <= 10)Heap Fetches: 10Planning time: 1.659 msExecution time: 43.585 ms (26 rows)

我們可以看到，除了使用表table_1的主鍵索引外，都是順序掃描。它還可以怎么做呢？因?yàn)槲覀儧]有建任何索引來優(yōu)化它。
如果我們重新做這個(gè)實(shí)驗(yàn)，告訴* create_tables()*去創(chuàng)建索引……

SELECT create_tables(10, 10000, True);

重新運(yùn)行后，我們得到不同的查詢計(jì)劃。

EXPLAIN ANALYZE SELECTcount(*) FROMtable_1 AS t1 INNER JOINtable_2 AS t2 ONt1.id = t2.table_1_id INNER JOINtable_3 AS t3 ONt2.id = t3.table_2_id INNER JOINtable_4 AS t4 ONt3.id = t4.table_3_id INNER JOINtable_5 AS t5 ONt4.id = t5.table_4_id WHEREt1.id <= 10;QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------Aggregate (cost=88.52..88.53 rows=1 width=8) (actual time=0.411..0.411 rows=1 loops=1)-> Nested Loop (cost=1.43..88.50 rows=9 width=0) (actual time=0.067..0.399 rows=10 loops=1)-> Nested Loop (cost=1.14..85.42 rows=9 width=4) (actual time=0.054..0.304 rows=10 loops=1)-> Nested Loop (cost=0.86..82.34 rows=9 width=4) (actual time=0.043..0.214 rows=10 loops=1)-> Nested Loop (cost=0.57..79.25 rows=9 width=4) (actual time=0.032..0.113 rows=10 loops=1)-> Index Only Scan using table_1_pkey on table_1 t1 (cost=0.29..8.44 rows=9 width=4) (actual time=0.015..0.023 rows=10 loops=1)Index Cond: (id <= 10)Heap Fetches: 10-> Index Scan using table_2_table_1_id_idx on table_2 t2 (cost=0.29..7.86 rows=1 width=8) (actual time=0.007..0.007 rows=1 loops=10)Index Cond: (table_1_id = t1.id)-> Index Scan using table_3_table_2_id_idx on table_3 t3 (cost=0.29..0.33 rows=1 width=8) (actual time=0.008..0.008 rows=1 loops=10)Index Cond: (table_2_id = t2.id)-> Index Scan using table_4_table_3_id_idx on table_4 t4 (cost=0.29..0.33 rows=1 width=8) (actual time=0.007..0.008 rows=1 loops=10)Index Cond: (table_3_id = t3.id)-> Index Only Scan using table_5_table_4_id_idx on table_5 t5 (cost=0.29..0.33 rows=1 width=4) (actual time=0.007..0.008 rows=1 loops=10)Index Cond: (table_4_id = t4.id)Heap Fetches: 10Planning time: 2.287 msExecution time: 0.546 ms (19 rows)

結(jié)果是使用來索引，速度快了很多。而且這與我們預(yù)期的一致。我們現(xiàn)在準(zhǔn)備將這些混到一起，看看隨著表和列數(shù)量的增加時(shí)會有什么變化？
需要說明一下，這些測試是運(yùn)行在AWS RDS db.m4.large實(shí)例上。這是最便宜的實(shí)例，而且也不會在性能上自動(dòng)擴(kuò)容，所以可以作為基準(zhǔn)。我們共運(yùn)行10次，取平均值。
最初的查詢，join涉及的表數(shù)量從2到200，包含3種行設(shè)置，而且沒有索引。

db.m4.large
性能還是不錯(cuò)的。更為重要的是，我們可以計(jì)算每個(gè)增加的join的成本！對于100行的表，下面的數(shù)據(jù)顯示了每次增加表所帶來的執(zhí)行時(shí)間的增加：

表的數(shù)量平均增加量時(shí)間（ms）

2-50	(0.012738 - 0.000327) / (50 - 2) =	0.259
50-100	(0.0353395 - 0.012738) / (100 - 50) =	0.452
100-150	(0.0762056 - 0.0353395) / (150 - 100) =	0.817
150-200	(0.1211591 - 0.0762056) / (200 - 150) =	0.899

甚至在參與join的表數(shù)量已經(jīng)接近200時(shí)，每增加一個(gè)表只增加少于1ms的執(zhí)行時(shí)間。
在討論增加一個(gè)表用來存儲產(chǎn)品的狀態(tài)時(shí)，表增加至1000行時(shí)就過載了。所以，不考慮索引的情況下，增加一個(gè)引用的表并不會對性能產(chǎn)生實(shí)質(zhì)性影響。
由于之前所做的基準(zhǔn)測試的數(shù)據(jù)量不大，如果碰到大表呢？我們重新做了相同的測試，但這次每張表都包含一百萬行。這次只增加到50張表。為什么？運(yùn)行它需要一段時(shí)間，我只有有限都預(yù)算和耐心

百萬級大表時(shí)的性能

這次運(yùn)行結(jié)果曲線就沒有重疊了。請看最大的1百萬行的表，每增加1張join的表，它需要多運(yùn)行93ms。

表的數(shù)量平均增加量時(shí)間（ms）

2-10	(0.8428495 - 0.0924571) / (10 - 2) =	93.799
10-20	(1.781959 - 0.8428495) / (20 - 10) =	93.911
20-30	(2.708342 - 1.781959) / (30 - 20) =	92.638
30-40	(3.649164 - 2.708342) / (40 - 30) =	94.082
40-50	(4.565644 - 3.649164) / (50 - 40) =	91.648

此次都是順序掃描，因此我們增加索引，看看會有什么變化？

增加索引后的性能

增加索引后，性能影響很顯著。當(dāng)能使用到索引時(shí)，不管表中有多少行，測試結(jié)果都差不多。針對10萬行的表的查詢一般最慢，但并總是最慢。

表的數(shù)量平均增加量時(shí)間（ms）

2-50	(0.0119917 - 0.000265) / (50 - 2) =	0.244
50-100	(0.035345 - 0.0119917) / (100 - 50) =	0.467
100-150	(0.0759236 - 0.035345) / (150 - 100) = 92.638	0.811
150-200	(0.1378461 - 0.0759236) / (200 - 150) =	1.238

即使查詢已經(jīng)涉及到150張表，此時(shí)增加1一張表只會增加1.2ms。
最后一個(gè)測試場景，由于歷時(shí)太長，等不及生成200個(gè)1百萬行的表及建立索引，但又想觀察性能的變化，于是選擇測試50張表時(shí)的結(jié)果……

加大數(shù)據(jù)量的測試表的數(shù)量平均增加量時(shí)間（ms）

2-10	(0.0016811 - 0.000276) / (10 - 2) =	0.176
10-20	(0.003771 - 0.0016811) / (20 - 10) =	0.209
20-30	(0.0062328 - 0.003771) / (30 - 20) =	0.246
30-40	(0.0088621 - 0.0062328) / (40 - 30) =	0.263
40-50	(0.0120818 - 0.0088621) / (50 - 40) =	0.322

基于之前增加索引帶來的性能改進(jìn)結(jié)果，這次并沒有帶來太多的性能驚喜。50張1百萬行的表做join，只需要12ms。Cool！
也許這些額外的join操作的成本比我們預(yù)想的要低一些，但有件事需要我們?nèi)タ紤]，雖然每個(gè)增加的join運(yùn)算所占用的時(shí)間很小，但越多的表意味著越多的查詢計(jì)劃需要考慮，這很可能會導(dǎo)致很難找到最佳的查詢計(jì)劃。例如，當(dāng)join的數(shù)量超過 geqo_threshold 時(shí)（默認(rèn)為12），postgres 會停止參考所有可能的查詢計(jì)劃，改為使用通用算法。這會改變查詢計(jì)劃，引起對性能的負(fù)面影響。
由于每個(gè)系統(tǒng)的業(yè)務(wù)千差萬別，一定要基于你的數(shù)據(jù)來測試你的查詢。雖然我們看到增加join的成本很低，但仍然非常有必要去規(guī)范你的數(shù)據(jù)。

[原文] Cost of a Join
非直譯，僅為增加樂趣，向作者的嚴(yán)謹(jǐn)性致敬。

總結(jié)

以上是生活随笔為你收集整理的[译]以PostgreSQL为例，谈join计算的代价的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

数据库

[译]以PostgreSQL为例，谈join计算的代价

總結(jié)