當(dāng)前位置：首頁 > 运维知识 > 数据库 >内容正文

数据库

postgresql索引_PostgreSQL中的索引— 9（BRIN）

發(fā)布時(shí)間：2023/12/16 数据库 34 豆豆

生活随笔收集整理的這篇文章主要介紹了 postgresql索引_PostgreSQL中的索引— 9（BRIN）小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

postgresql索引

indexing engine, the interface of access methods, and the following methods: 索引引擎，訪問方法的接口以及以下方法： hash indexes, 哈希索引， B-trees, B樹， GiST, GiST ， SP-GiST, SP-GiST ， GIN, and GIN和RUM. The topic of this article is BRIN indexes.RUM 。本文的主題是BRIN索引。

布林 (BRIN)

一般概念 (General concept)

Unlike indexes with which we've already got acquainted, the idea of BRIN is to avoid looking through definitely unsuited rows rather than quickly find the matching ones. This is always an inaccurate index: it does not contain TIDs of table rows at all.

與我們已經(jīng)熟悉的索引不同，BRIN的想法是避免瀏覽絕對不合適的行，而不是快速找到匹配的行。這始終是一個(gè)不準(zhǔn)確的索引：它根本不包含表行的TID。

Simplistically, BRIN works fine for columns where values correlate with their physical location in the table. In other words, if a query without ORDER BY clause returns the column values virtually in the increasing or decreasing order (and there are no indexes on that column).

簡而言之，對于值與表中物理位置相關(guān)的列，BRIN可以很好地工作。換句話說，如果沒有ORDER BY子句的查詢實(shí)際上以升序或降序返回列值(并且該列上沒有索引)。

This access method was created in scope of Axle, the European project for extremely large analytical databases, with an eye on tables that are several terabyte or dozens of terabytes large. An important feature of BRIN that enables us to create indexes on such tables is a small size and minimal overhead costs of maintenance.

這種訪問方法是在Axle的范圍內(nèi)創(chuàng)建的， Axle是用于大型分析數(shù)據(jù)庫的歐洲項(xiàng)目，著眼于幾TB或數(shù)十TB的表。 BRIN的一項(xiàng)重要功能使我們能夠在此類表上創(chuàng)建索引，它的體積小且維護(hù)開銷最小。

This works as follows. The table is split into ranges that are several pages large (or several blocks large, which is the same) — hence the name: Block Range Index, BRIN. The index stores summary information on the data in each range. As a rule, this is the minimal and maximal values, but it happens to be different, as shown further. Assume that a query is performed that contains the condition for a column; if the sought values do not get into the interval, the whole range can be skipped; but if they do get, all rows in all blocks will have to be looked through to choose the matching ones among them.

其工作原理如下。表被分成多個(gè)顯示頁面大(或幾個(gè)塊大，這是相同的) 的范圍 -故名：塊范圍索引，布林。索引存儲有關(guān)每個(gè)范圍中數(shù)據(jù)的摘要信息。通常，這是最小值和最大值，但碰巧是不同的，如進(jìn)一步所示。假設(shè)執(zhí)行的查詢包含列的條件；如果所搜索的值未進(jìn)入該間隔，則可以跳過整個(gè)范圍；但如果確實(shí)獲得，則必須仔細(xì)檢查所有塊中的所有行，以在其中選擇匹配的行。

It will not be a mistake to treat BRIN not as an index, but as an accelerator of sequential scan. We can regard BRIN as an alternative to partitioning if we consider each range as a ?virtual? partition.

將BRIN視為索引，而不是順序掃描的加速器，這不是錯(cuò)誤的。如果我們將每個(gè)范圍視為“虛擬”分區(qū)，則可以將BRIN視為分區(qū)的替代方案。

Now let's discuss the structure of the index in more detail.

現(xiàn)在讓我們更詳細(xì)地討論索引的結(jié)構(gòu)。

結(jié)構(gòu)體 (Structure)

The first (more exactly, zero) page contains the metadata.

第一頁(更確切地說是零)包含元數(shù)據(jù)。

Pages with the summary information are located at a certain offset from the metadata. Each index row on those pages contains summary information on one range.

帶有摘要信息的頁面與元數(shù)據(jù)之間有一定的偏移量。這些頁面上的每個(gè)索引行都包含一個(gè)范圍的摘要信息。

Between the meta page and summary data, pages with the reverse range map (abbreviated as ?revmap?) are located. Actually, this is an array of pointers (TIDs) to the corresponding index rows.

在元頁面和摘要數(shù)據(jù)之間，找到具有反向范圍圖 (簡稱為“ revmap”)的頁面。實(shí)際上，這是指向相應(yīng)索引行的指針(TID)的數(shù)組。

For some ranges, the pointer in ?revmap? can lead to no index row (one is marked in gray in the figure). In such a case, the range is considered to have no summary information yet.

對于某些范圍，?revmap?中的指針可能不會導(dǎo)致索引行(圖中的灰色標(biāo)記為一個(gè))。在這種情況下，該范圍被認(rèn)為還沒有摘要信息。

掃描索引 (Scanning the index)

How is the index used if it does not contain references to table rows? This access method certainly cannot return rows TID by TID, but it can build a bitmap. There can be two kinds of bitmap pages: accurate, to the row, and inaccurate, to the page. It's an inaccurate bitmap that is used.

如果索引不包含對表行的引用，該如何使用？這種訪問方法當(dāng)然不能按TID逐行返回TID，但可以構(gòu)建位圖。位圖頁面可以有兩種：精確到位的頁面和不精確的頁面。使用的是不正確的位圖。

The algorithm is simple. The map of ranges is sequentially scanned (that is, the ranges are went through in the order of their location in the table). The pointers are used to determine index rows with summary information on each range. If a range does not contain the value sought, it is skipped, and if it can contain the value (or summary information is unavailable), all pages of the range are added to the bitmap. The resulting bitmap is then used as usual.

該算法很簡單。依次掃描范圍圖(即，范圍按照表中位置的順序進(jìn)行瀏覽)。指針用于確定帶有每個(gè)范圍的摘要信息的索引行。如果范圍不包含所尋求的值，則將其跳過，并且如果范圍可以包含該值(或摘要信息不可用)，則該范圍的所有頁面都將添加到位圖中。然后，照常使用生成的位圖。

更新索引 (Updating the index)

It is more interesting how the index is updated when the table is changed.

更改表時(shí)如何更新索引更有趣。

When adding a new version of a row to a table page, we determine which range it is contained in and use the map of ranges to find the index row with the summary information. All these are simple arithmetic operations. Let, for instance, the size of a range be four and on page 13, a row version with the value of 42 occur. The number of the range (starting with zero) is 13?/?4?=?3, therefore, in ?revmap? we take the pointer with the offset of 3 (its order number is four).

當(dāng)將新版本的行添加到表頁面時(shí)，我們確定該行包含在哪個(gè)范圍中，并使用范圍圖查找包含摘要信息的索引行。所有這些都是簡單的算術(shù)運(yùn)算。例如，假設(shè)范圍的大小為4，然后在第13頁上，出現(xiàn)值為42的行版本。范圍的數(shù)字(從零開始)是13/4 = 3，因此，在《 revmap》中，我們采用偏移量為3的指針(其順序號為4)。

The minimal value for this range is 31, and the maximal one is 40. Since the new value of 42 is out of the interval, we update the maximal value (see the figure). But if the new value is still within the stored limits, the index does not need to be updated.

此范圍的最小值為31，最大值為40。由于新值42不在間隔內(nèi)，因此我們更新最大值(請參見圖)。但是，如果新值仍在存儲的限制內(nèi)，則無需更新索引。

All this relates to the situation when the new version of the page occurs in a range for which the summary information is available. When the index is created, the summary information is computed for all ranges available, but while the table is further expanded, new pages can occur that fall out of the limits. Two options are available here:

所有這些都與頁面的新版本出現(xiàn)在可使用摘要信息的范圍內(nèi)的情況有關(guān)。創(chuàng)建索引時(shí)，將為所有可用范圍計(jì)算摘要信息，但是在進(jìn)一步擴(kuò)展表時(shí)，可能會出現(xiàn)超出限制的新頁面。這里有兩個(gè)選項(xiàng)：

Usually the index is not updated immediately. This is not a big deal: as already mentioned, when scanning the index, the whole range will be looked through. Actual update is done during ?vacuum?, or it can be done manually by calling ?brin_summarize_new_values? function.

通常，索引不會立即更新。沒什么大不了的：如前所述，在掃描索引時(shí)，將瀏覽整個(gè)范圍。實(shí)際更新是在“真空”期間完成的，也可以通過調(diào)用“ brin_summarize_new_values”函數(shù)手動(dòng)完成。

If we create the index with ?autosummarize? parameter, the update will be done immediately. But when pages of the range are populated with new values, updates can happen too often, therefore, this parameter is turned off by default.

如果我們使用?autosummarize?參數(shù)創(chuàng)建索引，則更新將立即完成。但是，當(dāng)使用新值填充范圍頁面時(shí)，更新可能會經(jīng)常發(fā)生，因此，默認(rèn)情況下此參數(shù)處于關(guān)閉狀態(tài)。

When new ranges occur, the size of ?revmap? can increase. Whenever the map, located between the meta page and summary data, needs to be extended by another page, existing row versions are moved to some other pages. So, the map of ranges is always located between the meta page and summary data.

當(dāng)出現(xiàn)新范圍時(shí)，?revmap?的大小可能會增加。每當(dāng)位于元頁面和摘要數(shù)據(jù)之間的地圖需要由另一頁面擴(kuò)展時(shí)，現(xiàn)有行版本就會移至其他頁面。因此，范圍圖始終位于元頁面和摘要數(shù)據(jù)之間。

When a row is deleted,… nothing happens. We can notice that sometimes the minimal or maximal value will be deleted, in which case the interval could be reduced. But to detect this, we would have to read all values in the range, and this is costly.

當(dāng)刪除一行時(shí)，…什么也沒有發(fā)生。我們可以注意到，有時(shí)最小值或最大值將被刪除，在這種情況下可以減小間隔。但是要檢測到這一點(diǎn)，我們將必須讀取該范圍內(nèi)的所有值，這是昂貴的。

The correctness of the index is not affected, but search may require looking through more ranges than is actually needed. In general, summary information can be manually recalculated for such a zone (by calling ?brin_desummarize_range? and ?brin_summarize_new_values? functions), but how can we detect such a need? Anyway, no conventional procedure is available to this end.

索引的正確性不受影響，但是搜索可能需要查看比實(shí)際需要更多的范圍。通常，可以手動(dòng)重新計(jì)算此類區(qū)域的摘要信息(通過調(diào)用?brin_desummarize_range?和?brin_summarize_new_values?函數(shù))，但是我們?nèi)绾螜z測到這種需求？無論如何，沒有常規(guī)的程序可用于此目的。

Finally, updating a row is just a deletion of the outdated version and addition of a new one.

最后， 更新一行只是刪除過時(shí)的版本，而增加新的版本。

例 (Example)

Let's try to build our own mini data warehouse for the data from tables of the demo database. Let's assume that for the purpose of BI reporting, a denormalized table is needed to reflect the flights departed from an airport or landed in the airport to the accuracy of a seat in the cabin. The data for each airport will be added to the table once a day, when it is midnight in the appropriate time zone. The data will be neither updated nor deleted.

讓我們嘗試為演示數(shù)據(jù)庫表中的數(shù)據(jù)構(gòu)建自己的小型數(shù)據(jù)倉庫。假設(shè)出于BI報(bào)告的目的，需要使用非規(guī)范化表格來反映從機(jī)場起飛或降落在機(jī)場的航班到機(jī)艙座位的準(zhǔn)確性。每個(gè)機(jī)場的數(shù)據(jù)每天都會在適當(dāng)時(shí)區(qū)的午夜12點(diǎn)添加到表中。數(shù)據(jù)將不會被更新或刪除。

The table will look as follows:

該表如下所示：

demo=# create table flights_bi(airport_code char(3),airport_coord point, -- geo coordinates of airportairport_utc_offset interval, -- time zoneflight_no char(6), -- flight numberflight_type text. -- flight type: departure / arrival scheduled_time timestamptz, -- scheduled departure/arrival time of flightactual_time timestamptz, -- actual time of flightaircraft_code char(3),seat_no varchar(4), -- seat numberfare_conditions varchar(10), -- travel classpassenger_id varchar(20),passenger_name text );

We can simulate the procedure of loading the data using nested loops: an external one — by days (we will consider a?large database, therefore 365 days), and an internal loop — by time zones (from UTC+02 to UTC+12). The query is pretty long and not of particular interest, so I'll hide it under the spoiler.

我們可以模擬使用嵌套循環(huán)加載數(shù)據(jù)的過程：一個(gè)外部循環(huán)-按天(我們將考慮一個(gè)大型數(shù)據(jù)庫，因此為365天)，一個(gè)內(nèi)部循環(huán)-按時(shí)區(qū)(從UTC + 02到UTC + 12) 。該查詢很長，并且沒有特別的興趣，因此我將其隱藏在擾流器下。

模擬將數(shù)據(jù)加載到存儲中 (Simulation of loading the data to the storage)

DO $$ <<local>> DECLAREcurdate date := (SELECT min(scheduled_departure) FROM flights);utc_offset interval; BEGINWHILE (curdate <= bookings.now()::date) LOOPutc_offset := interval '12 hours';WHILE (utc_offset >= interval '2 hours') LOOPINSERT INTO flights_biWITH flight (airport_code,airport_coord,flight_id,flight_no,scheduled_time,actual_time,aircraft_code,flight_type) AS (-- прибытияSELECT a.airport_code,a.coordinates,f.flight_id,f.flight_no,f.scheduled_departure,f.actual_departure,f.aircraft_code,'departure'FROM airports a,flights f,pg_timezone_names tznWHERE a.airport_code = f.departure_airportAND f.actual_departure IS NOT NULLAND tzn.name = a.timezoneAND tzn.utc_offset = local.utc_offsetAND timezone(a.timezone, f.actual_departure)::date = curdateUNION ALL-- вылетыSELECT a.airport_code,a.coordinates,f.flight_id,f.flight_no,f.scheduled_arrival,f.actual_arrival,f.aircraft_code,'arrival'FROM airports a,flights f,pg_timezone_names tznWHERE a.airport_code = f.arrival_airportAND f.actual_arrival IS NOT NULLAND tzn.name = a.timezoneAND tzn.utc_offset = local.utc_offsetAND timezone(a.timezone, f.actual_arrival)::date = curdate)SELECT f.airport_code,f.airport_coord,local.utc_offset,f.flight_no,f.flight_type,f.scheduled_time,f.actual_time,f.aircraft_code,s.seat_no,s.fare_conditions,t.passenger_id,t.passenger_nameFROM flight fJOIN seats sON s.aircraft_code = f.aircraft_codeLEFT JOIN boarding_passes bpON bp.flight_id = f.flight_idAND bp.seat_no = s.seat_noLEFT JOIN ticket_flights tfON tf.ticket_no = bp.ticket_noAND tf.flight_id = bp.flight_idLEFT JOIN tickets tON t.ticket_no = tf.ticket_no;RAISE NOTICE '%, %', curdate, utc_offset;utc_offset := utc_offset - interval '1 hour';END LOOP;curdate := curdate + 1;END LOOP; END; $$;demo=# select count(*) from flights_bi;count ----------30517076 (1 row)demo=# select pg_size_pretty(pg_total_relation_size('flights_bi'));pg_size_pretty ----------------4127 MB (1 row)

We get 30 million rows and 4?GB. Not so large a size, but good enough for a laptop: sequential scan took me about 10?seconds.

我們得到3000萬行和4 GB。尺寸不算大，但足以用于筆記本電腦：順序掃描花了我大約10秒鐘。

我們應(yīng)該在哪些列上創(chuàng)建索引？ (On what columns should we create the index?)

Since BRIN indexes have a small size and moderate overhead costs and updates happen infrequently, if any, a rare opportunity arises to build many indexes ?just in case?, for example, on all fields on which analyst users can create their ad-hoc queries. Won't come useful — never mind, but even an index that is not very efficient will work better than sequential scan for sure. Of course, there are fields on which it is absolutely useless to build an index; pure common sense will prompt them.

由于BRIN索引的大小小且管理費(fèi)用適中，并且更新很少發(fā)生(如果有的話)，因此出現(xiàn)了難得的機(jī)會(例如，以防萬一)建立許多索引，例如，在分析師用戶可以創(chuàng)建其臨時(shí)查詢的所有字段上。不會有用-沒關(guān)系，但是即使是效率不高的索引也肯定會比順序掃描更好。當(dāng)然，在某些字段上建立索引絕對是沒有用的。純粹的常識會提示他們。

But it should be odd to limit ourselves to this piece of advice, therefore, let's try to state a more accurate criterion.

但是將自己限制在這條建議上應(yīng)該很奇怪，因此，讓我們嘗試提出一個(gè)更準(zhǔn)確的標(biāo)準(zhǔn)。

We've already mentioned that the data must somewhat correlate with its physical location. Here it makes sense to remember that PostgreSQL gathers table column statistics, which include the correlation value. The planner uses this value to select between a regular index scan and bitmap scan, and we can use it to estimate the applicability of BRIN index.

我們已經(jīng)提到，數(shù)據(jù)必須與其物理位置有所關(guān)聯(lián)。這里要記住，PostgreSQL收集表列統(tǒng)計(jì)信息，其中包括相關(guān)值。計(jì)劃者使用此值在常規(guī)索引掃描和位圖掃描之間進(jìn)行選擇，我們可以使用它來估計(jì)BRIN索引的適用性。

In the above example, the data is evidently ordered by days (by ?scheduled_time?, as well as by ?actual_time? — there is no much difference). This is because when rows are added to the table (without deletions and updates), they are laid out in the file one after another. In the simulation of data loading we did not even use ORDER BY clause, therefore, dates within a day can be, in general, mixed up in an arbitrary way, but ordering must be in place. Let's check this:

在上面的示例中，數(shù)據(jù)顯然按天排序(按“ scheduled_time”和“ actual_time”排序-差別不大)。這是因?yàn)閷⑿刑砑拥奖碇?沒有刪除和更新)時(shí)，它們在文件中一個(gè)接一個(gè)地排列。在數(shù)據(jù)加載的模擬中，我們甚至沒有使用ORDER BY子句，因此，通常一天內(nèi)的日期可以以任意方式混合，但是必須有序。讓我們檢查一下：

The value that is not too close to zero (ideally, near plus-minus one, as in this case), tells us that BRIN index will be appropriate.

該值不太接近零(在這種情況下，理想情況下，接近正負(fù)1)告訴我們BRIN指數(shù)是合適的。

The travel class ?fare_condition? (the column contains three unique values) and type of the flight ?flight_type? (two unique values) unexpectedly appeared to be in the second and third places. This is an illusion: formally the correlation is high, while actually on several successive pages all possible values will be encountered for sure, which means that BRIN won't do any good.

出差航班類別“ fare_condition?(該列包含三個(gè)唯一值)和航班類型“ flight_type?(兩個(gè)唯一值)出乎意料地位于第二和第三位。這是一種錯(cuò)覺：形式上的相關(guān)性很高，而實(shí)際上在幾個(gè)連續(xù)的頁面上肯定會遇到所有可能的值，這意味著BRIN不會發(fā)揮任何作用。

The time zone ?airport_utc_offset? goes next: in the considered example, within a day cycle, airports are ordered by time zones ?by construction?.

接下來是時(shí)區(qū)“ airport_utc_offset”：在所考慮的示例中，在一天周期內(nèi)，按時(shí)區(qū)“按構(gòu)造”對機(jī)場進(jìn)行了排序。

It's these two fields, time and time zone, that we will further experiment with.

我們將進(jìn)一步試驗(yàn)這兩個(gè)字段(時(shí)間和時(shí)區(qū))。

可能削弱相關(guān)性 (Possible weakening of the correlation)

The correlation that is place ?by construction? can be easily weakened when the data is changed. And the matter here is not in a change to a particular value, but in the structure of the multiversion concurrency control: the outdated row version is deleted on one page, but a new version may be inserted wherever free space is available. Due to this, whole rows get mixed up during updates.

更改數(shù)據(jù)時(shí)，很容易削弱“構(gòu)造”位置的相關(guān)性。此處的問題不是更改特定值，而是多版本并發(fā)控件的結(jié)構(gòu)：過時(shí)的行版本在一頁上被刪除，但是只要有可用空間，就可以插入新版本。因此，整個(gè)行在更新期間會混合在一起。

We can partially control this effect by reducing the value of ?fillfactor? storage parameter and this way leaving free space on a page for future updates. But do we want to increase the size of an already huge table? Besides, this does not resolve the issue of deletions: they also ?set traps? for new rows by freeing the space somewhere inside existing pages. Due to this, rows that otherwise would get to the end of file, will be inserted at some arbitrary place.

我們可以通過減小?fillfactor?存儲參數(shù)的值來部分控制此效果，并通過這種方式在頁面上留下可用空間以供將來更新。但是，我們是否要增加已經(jīng)很大的桌子的大小？此外，這不能解決刪除問題：它們還通過釋放現(xiàn)有頁面內(nèi)某處的空間來為新行“設(shè)置陷阱”。因此，否則將到達(dá)文件末尾的行將插入到任意位置。

By the way, this is a curious fact. Since BRIN index does not contain references to table rows, its availability should not hinder HOT updates at all, but it does.

順便說一句，這是一個(gè)奇怪的事實(shí)。由于BRIN索引不包含對表行的引用，因此它的可用性不應(yīng)完全阻止HOT更新，但它確實(shí)可以。

So, BRIN is mainly designed for tables of large and even huge sizes that are either not updated at all or updated very slightly. However, it perfectly copes with the addition of new rows (to the end of the table). This is not surprising since this access method was created with a view to data warehouses and analytical reporting.

因此，BRIN主要設(shè)計(jì)用于甚至根本不更新或更新很小的大型甚至大型表。但是，它完美地應(yīng)對了新行的增加(到表的末尾)。這并不奇怪，因?yàn)閯?chuàng)建此訪問方法是為了查看數(shù)據(jù)倉庫和分析報(bào)告。

我們需要選擇什么大小的范圍？ (What size of a range do we need to select?)

If we deal with a terabyte table, our main concern when selecting the size of a range will probably be not to make BRIN index too large. However, in our situation, we can afford analyzing data more accurately.

如果處理一個(gè)TB的表，那么在選擇范圍大小時(shí)，我們主要關(guān)心的可能不是使BRIN索引太大。但是，在我們的情況下，我們可以提供更準(zhǔn)確的數(shù)據(jù)分析能力。

To do this, we can select unique values of a column and see on how many pages they occur. Localization of the values increases the chances of success in applying BRIN index. Moreover, the found number of pages will prompt the size of a range. But if the value is ?spread? over all pages, BRIN is useless.

為此，我們可以選擇列的唯一值，并查看它們出現(xiàn)在多少頁上。值的本地化增加了成功應(yīng)用BRIN指數(shù)的機(jī)會。此外，找到的頁數(shù)將提示范圍的大小。但是，如果該值在所有頁面上都“傳播”，則BRIN是無用的。

Of course, we should use this technique keeping a watchful eye on an internal structure of the data. For example, it makes no sense to consider each date (more exactly, a timestamp, also including time) as a unique value — we need to round it to days.

當(dāng)然，我們應(yīng)該使用這種技術(shù)來密切注意數(shù)據(jù)的內(nèi)部結(jié)構(gòu)。例如，將每個(gè)日期(更確切地說是時(shí)間戳，還包括時(shí)間)視為唯一值是沒有意義的-我們需要將其舍入為幾天。

Technically, this analysis can be done by looking at the value of the hidden ?ctid? column, which provides the pointer to a row version (TID): the number of the page and the number of the row inside the page. Unfortunately, there is no conventional technique to decompose TID into its two components, therefore, we have to cast types through the text representation:

從技術(shù)上講，可以通過查看隱藏的“ ctid”列的值來完成此分析，該值提供了指向行版本(TID)的指針：頁面數(shù)和頁面內(nèi)行數(shù)。不幸的是，沒有傳統(tǒng)的技術(shù)可以將TID分解為兩個(gè)部分，因此，我們必須通過文本表示來轉(zhuǎn)換類型：

demo=# select min(numblk), round(avg(numblk)) avg, max(numblk) from ( select count(distinct (ctid::text::point)[0]) numblkfrom flights_bigroup by scheduled_time::date ) t;min | avg | max ------+------+------1192 | 1500 | 1796 (1 row)demo=# select relpages from pg_class where relname = 'flights_bi';relpages ----------528172 (1 row)

We can see that each day is distributed across pages pretty evenly, and days are slightly mixed up with each other (1500?&times 365?= 547500, which is only a little larger than the number of pages in the table 528172). This is actually clear ?by construction? anyway.

我們可以看到，每天幾乎均勻地分布在頁面上，并且天彼此之間略有混淆(1500＆times 365 = 547500，這僅比表528172中的頁面數(shù)大一點(diǎn))。無論如何，這實(shí)際上是“通過建設(shè)”明確的。

Valuable information here is a specific number of pages. With a conventional range size of 128 pages, each day will populate 9–14 ranges. This seems realistic: with a query for a specific day, we can expect an error around 10%.

此處的重要信息是特定數(shù)量的頁面。傳統(tǒng)的范圍大小為128頁，每天將填充9-14個(gè)范圍。這似乎很現(xiàn)實(shí)：查詢特定的一天，我們可以預(yù)期出現(xiàn)10％左右的錯(cuò)誤。

Let's try:

我們試試吧：

demo=# create index on flights_bi using brin(scheduled_time);

The size of the index is as small as 184?KB:

索引的大小小至184 KB：

demo=# select pg_size_pretty(pg_total_relation_size('flights_bi_scheduled_time_idx'));pg_size_pretty ----------------184 kB (1 row)

In this case, it hardly makes sense to increase the size of a range at the cost of losing the accuracy. But we can reduce the size if required, and the accuracy will, on the contrary, increase (along with the size of the index).

在這種情況下，以損失精度為代價(jià)增加范圍的大小幾乎沒有意義。但是如果需要，我們可以減小大小，相反，準(zhǔn)確性會提高(以及索引的大小)。

Now let's look at time zones. Here we cannot use a brute-force approach either. All values should be divided by the number of day cycles instead since the distribution is repeated within each day. Besides, since there are few time zones only, we can look at the entire distribution:

現(xiàn)在讓我們看一下時(shí)區(qū)。在這里，我們也不能使用暴力手段。所有值都應(yīng)除以天周期數(shù)，而不是因?yàn)槊刻於紩貜?fù)分配。此外，由于只有幾個(gè)時(shí)區(qū)，我們可以查看整個(gè)分布：

demo=# select airport_utc_offset, count(distinct (ctid::text::point)[0])/365 numblk from flights_bi group by airport_utc_offset order by 2;airport_utc_offset | numblk --------------------+--------12:00:00 | 606:00:00 | 802:00:00 | 1011:00:00 | 1308:00:00 | 2809:00:00 | 2910:00:00 | 4004:00:00 | 4707:00:00 | 11005:00:00 | 23103:00:00 | 932 (11 rows)

On average, the data for each time zone populates 133?pages a day, but the distribution is highly non-uniform: Petropavlovsk-Kamchatskiy and Anadyr fit as few as six pages, while Moscow and its neighborhood require hundreds of them. The default size of a range is no good here; let's, for example, set it to four pages.

平均而言，每個(gè)時(shí)區(qū)的數(shù)據(jù)每天填充133頁，但分布高度不均勻：Petropavlovsk-Kamchatskiy和Anadyr僅有六頁，而莫斯科及其附近地區(qū)則需要數(shù)百頁。范圍的默認(rèn)大小在這里不合適。例如，將其設(shè)置為四個(gè)頁面。

demo=# create index on flights_bi using brin(airport_utc_offset) with (pages_per_range=4);demo=# select pg_size_pretty(pg_total_relation_size('flights_bi_airport_utc_offset_idx'));pg_size_pretty ----------------6528 kB (1 row)

執(zhí)行計(jì)劃 (Execution plan)

Let's look at how our indexes work. Let's select some day, say, a week ago (in the demo database, ?today? is determined by ?booking.now? function):

讓我們看一下索引的工作方式。讓我們選擇某天，例如一周前(在演示數(shù)據(jù)庫中，“今天”由“ booking.now”函數(shù)確定)：

demo=# \set d 'bookings.now()::date - interval \'7 days\''demo=# explain (costs off,analyze)select *from flights_biwhere scheduled_time >= :d and scheduled_time < :d + interval '1 day';QUERY PLAN --------------------------------------------------------------------------------Bitmap Heap Scan on flights_bi (actual time=10.282..94.328 rows=83954 loops=1)Recheck Cond: ...Rows Removed by Index Recheck: 12045Heap Blocks: lossy=1664-> Bitmap Index Scan on flights_bi_scheduled_time_idx(actual time=3.013..3.013 rows=16640 loops=1)Index Cond: ...Planning time: 0.375 msExecution time: 97.805 ms

As we can see, the planner used the index created. How accurate is it? The ratio of the number of rows that meet the query conditions (?rows? of Bitmap Heap Scan node) to the total number of rows returned using the index (the same value plus Rows Removed by Index Recheck) tells us about this. In this case 83954?/?(83954?+?12045), which is approximately 90%, as expected (this value will change from one day to another).

如我們所見，計(jì)劃者使用了創(chuàng)建的索引。它有多精確？滿足查詢條件的行數(shù)(“位圖堆掃描”節(jié)點(diǎn)的“行”)與使用索引返回的總行數(shù)(相同的值加上通過索引重新檢查刪除的行)之比告訴我們這一點(diǎn)。在這種情況下，為83954 /(83954 + 12045)，大約為預(yù)期值的90％(此值將一天到一天更改)。

Where does the 16640 number in ?actual rows? of Bitmap Index Scan node originate from? The thing is that this node of the plan builds an inaccurate (page-by-page) bitmap and is completely unaware of how many rows the bitmap will touch, while something needs to be shown. Therefore, in despair one page is assumed to contain 10 rows. The bitmap contains 1664 pages in total (this value is shown in ?Heap Blocks: lossy=1664?); so, we just get 16640. Altogether, this is a senseless number, which we should not pay attention to.

位圖索引掃描節(jié)點(diǎn)的“實(shí)際行”中的16640數(shù)字從何而來？問題在于，該計(jì)劃的該節(jié)點(diǎn)將構(gòu)建不準(zhǔn)確的(逐頁)位圖，并且完全不知道該位圖將觸摸多少行，而需要顯示某些內(nèi)容。因此，絕望地假設(shè)一頁包含10行。位圖總共包含1664頁(此值在《堆塊：有損= 1664》中顯示)；因此，我們只得到16640。這是一個(gè)毫無意義的數(shù)字，我們不應(yīng)該注意。

How about airports? For example, let's take the time zone of Vladivostok, which populates 28 pages a day:

機(jī)場呢？例如，讓我們以符拉迪沃斯托克(Vladivostok)的時(shí)區(qū)為例，該時(shí)區(qū)每天填充28頁：

demo=# explain (costs off,analyze)select *from flights_biwhere airport_utc_offset = interval '8 hours';QUERY PLAN ----------------------------------------------------------------------------------Bitmap Heap Scan on flights_bi (actual time=75.151..192.210 rows=587353 loops=1)Recheck Cond: (airport_utc_offset = '08:00:00'::interval)Rows Removed by Index Recheck: 191318Heap Blocks: lossy=13380-> Bitmap Index Scan on flights_bi_airport_utc_offset_idx(actual time=74.999..74.999 rows=133800 loops=1)Index Cond: (airport_utc_offset = '08:00:00'::interval)Planning time: 0.168 msExecution time: 212.278 ms

The planner again uses the BRIN index created. The accuracy is worse (about 75% in this case), but this is expected since the correlation is lower.

計(jì)劃者再次使用創(chuàng)建的BRIN索引。準(zhǔn)確性較差(在這種情況下約為75％)，但這是可以預(yù)期的，因?yàn)橄嚓P(guān)性較低。

Several BRIN indexes (just like any other ones) can certainly be joined at the bitmap level. For example, the following is the data on the selected time zone for a month (notice ?BitmapAnd? node):

當(dāng)然，可以在位圖級別上連接幾個(gè)BRIN索引(就像其他索引一樣)。例如，以下是所選時(shí)區(qū)一個(gè)月的數(shù)據(jù)(注意“ BitmapAnd”節(jié)點(diǎn))：

demo=# \set d 'bookings.now()::date - interval \'60 days\''demo=# explain (costs off,analyze)select *from flights_biwhere scheduled_time >= :d and scheduled_time < :d + interval '30 days'and airport_utc_offset = interval '8 hours';QUERY PLAN ---------------------------------------------------------------------------------Bitmap Heap Scan on flights_bi (actual time=62.046..113.849 rows=48154 loops=1)Recheck Cond: ...Rows Removed by Index Recheck: 18856Heap Blocks: lossy=1152-> BitmapAnd (actual time=61.777..61.777 rows=0 loops=1)-> Bitmap Index Scan on flights_bi_scheduled_time_idx(actual time=5.490..5.490 rows=435200 loops=1)Index Cond: ...-> Bitmap Index Scan on flights_bi_airport_utc_offset_idx(actual time=55.068..55.068 rows=133800 loops=1)Index Cond: ...Planning time: 0.408 msExecution time: 115.475 ms

與B樹比較 (Comparison with B-tree)

What if we create regular B-tree index on the same field as BRIN?

如果我們在與BRIN相同的字段上創(chuàng)建常規(guī)B樹索引，該怎么辦？

demo=# create index flights_bi_scheduled_time_btree on flights_bi(scheduled_time);demo=# select pg_size_pretty(pg_total_relation_size('flights_bi_scheduled_time_btree'));pg_size_pretty ----------------654 MB (1 row)

It appeared to be several thousand times larger than our BRIN! However, the query is performed a little faster: the planner used statistics to figure out that the data is physically ordered and it is not needed to build a bitmap and, mainly, that the index condition does not need to be rechecked:

它似乎比我們的BRIN大數(shù)千倍！但是，查詢的執(zhí)行速度要快一些：計(jì)劃者使用統(tǒng)計(jì)信息來確定數(shù)據(jù)是物理排序的，不需要構(gòu)建位圖，并且主要是不需要重新檢查索引條件：

demo=# explain (costs off,analyze)select *from flights_biwhere scheduled_time >= :d and scheduled_time < :d + interval '1 day';QUERY PLAN ----------------------------------------------------------------Index Scan using flights_bi_scheduled_time_btree on flights_bi(actual time=0.099..79.416 rows=83954 loops=1)Index Cond: ...Planning time: 0.500 msExecution time: 85.044 ms

That's what is so wonderful about BRIN: we sacrifice the efficiency, but gain very much space.

對于BRIN而言，這真是太妙了：我們犧牲了效率，卻獲得了很大的空間。

操作員類別 (Operator classes)

最小最大 (minmax)

For data types whose values can be compared with one another, summary information consists of the minimal and maximal values. Names of the corresponding operator classes contain ?minmax?, for example, ?date_minmax_ops?. Actually, these are data types that we were considering so far, and most of the types are of this kind.

對于其值可以相互比較的數(shù)據(jù)類型，摘要信息由最小值和最大值組成。相應(yīng)的運(yùn)算符類別的名稱包含?minmax?，例如?date_minmax_ops?。實(shí)際上，這些是我們到目前為止正在考慮的數(shù)據(jù)類型，并且大多數(shù)類型都是這種類型。

包括的 (inclusive)

Comparison operators are defined not for all data types. For example, they are not defined for points (?point? type), which represent the geographical coordinates of airports. By the way, it's for this reason that the statistics do not show the correlation for this column.

并非為所有數(shù)據(jù)類型定義比較運(yùn)算符。例如，沒有為代表機(jī)場地理坐標(biāo)的點(diǎn)(“點(diǎn)”類型)定義它們。順便說一下，正是由于這個(gè)原因，統(tǒng)計(jì)信息并未顯示此列的相關(guān)性。

demo=# select attname, correlation from pg_stats where tablename='flights_bi' and attname = 'airport_coord';attname | correlation ---------------+-------------airport_coord | (1 row)

But many of such types enable us to introduce a concept of a ?bounding area?, for example, a bounding rectangle for geometric shapes. We discussed in detail how GiST index uses this feature. Similarly, BRIN also enables gathering summary information on columns having data types like these: the bounding area for all values inside a range is just the summary value.

但是許多這樣的類型使我們能夠引入“邊界區(qū)域”的概念，例如，幾何形狀的邊界矩形。我們詳細(xì)討論了GiST索引如何使用此功能。同樣，BRIN還可以收集具有以下數(shù)據(jù)類型的列的摘要信息：范圍內(nèi)所有值的邊界區(qū)域僅是摘要值。

Unlike for GiST, the summary value for BRIN must be of the same type as the values being indexed. Therefore, we cannot build the index for points, although it is clear that the coordinates could work in BRIN: the longitude is closely connected with the time zone. Fortunately, nothing hinders creation of the index on an expression after transforming points into degenerate rectangles. At the same time, we will set the size of a range to one page, just to show the limit case:

與GiST不同，BRIN的摘要值必須與所索引的值具有相同的類型。因此，盡管很明顯坐標(biāo)可以在BRIN中工作，但我們無法建立點(diǎn)的索引：經(jīng)度與時(shí)區(qū)緊密相關(guān)。幸運(yùn)的是，在將點(diǎn)轉(zhuǎn)換為退化的矩形后，沒有任何事情會妨礙在表達(dá)式上創(chuàng)建索引。同時(shí)，我們將范圍的大小設(shè)置為一頁，以顯示極限情況：

demo=# create index on flights_bi using brin (box(airport_coord)) with (pages_per_range=1);

The size of the index is as small as 30?MB even in such an extreme situation:

即使在這種極端情況下，索引的大小也只有30 MB：

demo=# select pg_size_pretty(pg_total_relation_size('flights_bi_box_idx'));pg_size_pretty ----------------30 MB (1 row)

Now we can make up queries that limit the airports by coordinates. For example:

現(xiàn)在，我們可以組成通過坐標(biāo)限制機(jī)場的查詢。例如：

demo=# select airport_code, airport_name from airports where box(coordinates) <@ box '120,40,140,50';airport_code | airport_name --------------+-----------------KHV | Khabarovsk-NovyiVVO | Vladivostok (2 rows)

The planner will, however, refuse to use our index.

但是，計(jì)劃者將拒絕使用我們的索引。

demo=# analyze flights_bi;demo=# explain select * from flights_bi where box(airport_coord) <@ box '120,40,140,50';QUERY PLAN ---------------------------------------------------------------------Seq Scan on flights_bi (cost=0.00..985928.14 rows=30517 width=111)Filter: (box(airport_coord) <@ '(140,50),(120,40)'::box)

Why? Let's disable sequential scan and see what happens:

為什么？讓我們禁用順序掃描，看看會發(fā)生什么：

demo=# set enable_seqscan = off;demo=# explain select * from flights_bi where box(airport_coord) <@ box '120,40,140,50';QUERY PLAN --------------------------------------------------------------------------------Bitmap Heap Scan on flights_bi (cost=14079.67..1000007.81 rows=30517 width=111)Recheck Cond: (box(airport_coord) <@ '(140,50),(120,40)'::box)-> Bitmap Index Scan on flights_bi_box_idx(cost=0.00..14072.04 rows=30517076 width=0)Index Cond: (box(airport_coord) <@ '(140,50),(120,40)'::box)

It appears that the index can be used, but the planner supposes that the bitmap will have to be built on the whole table (look at ?rows? of Bitmap Index Scan node), and it is no wonder that the planner chooses sequential scan in this case. The issue here is that for geometric types, PostgreSQL does not gather any statistics, and the planner has to go blindly:

看來可以使用索引，但是計(jì)劃者認(rèn)為位圖必須建立在整個(gè)表上(請看“位圖索引掃描”節(jié)點(diǎn)的“行”)，也就不足為奇了。這個(gè)案例。這里的問題是，對于幾何類型，PostgreSQL不會收集任何統(tǒng)計(jì)信息，并且計(jì)劃者必須盲目行動(dòng)：

Alas. But there are no complaints about the index — it does work and works fine:

唉。但是沒有人對該索引有任何抱怨，它確實(shí)可以正常工作：

demo=# explain (costs off,analyze) select * from flights_bi where box(airport_coord) <@ box '120,40,140,50';QUERY PLAN ----------------------------------------------------------------------------------Bitmap Heap Scan on flights_bi (actual time=158.142..315.445 rows=781790 loops=1)Recheck Cond: (box(airport_coord) <@ '(140,50),(120,40)'::box)Rows Removed by Index Recheck: 70726Heap Blocks: lossy=14772-> Bitmap Index Scan on flights_bi_box_idx(actual time=158.083..158.083 rows=147720 loops=1)Index Cond: (box(airport_coord) <@ '(140,50),(120,40)'::box)Planning time: 0.137 msExecution time: 340.593 ms

The conclusion must be like this: PostGIS is needed if anything nontrivial is required of the geometry. It can gather statistics anyway.

結(jié)論必須是這樣的：如果幾何圖形有任何重要要求，則需要PostGIS。它仍然可以收集統(tǒng)計(jì)信息。

內(nèi)部構(gòu)造 (Internals)

The conventional extension ?pageinspect? enables us to look inside BRIN index.

傳統(tǒng)的擴(kuò)展名“ pageinspect”使我們能夠查看BRIN索引的內(nèi)部。

First, the metainformation will prompt us the size of a range and how many pages are allocated for ?revmap?:

首先，元信息將提示我們范圍的大小以及?revmap?分配了多少頁：

demo=# select * from brin_metapage_info(get_raw_page('flights_bi_scheduled_time_idx',0));magic | version | pagesperrange | lastrevmappage ------------+---------+---------------+----------------0xA8109CFA | 1 | 128 | 3 (1 row)

Pages 1–3 here are allocated for ?revmap?, while the rest contain summary data. From ?revmap? we can get references to summary data for each range. Say, the information on the first range, incorporating first 128 pages, is located here:

此處的第1至3頁分配給?revmap?，其余的則包含摘要數(shù)據(jù)。從《 revmap》中，我們可以獲得每個(gè)范圍的摘要數(shù)據(jù)的引用。說，第一個(gè)范圍的信息(包含前128頁)位于以下位置：

demo=# select * from brin_revmap_data(get_raw_page('flights_bi_scheduled_time_idx',1)) limit 1;pages ---------(6,197) (1 row)

And this is the summary data itself:

這是摘要數(shù)據(jù)本身：

demo=# select allnulls, hasnulls, value from brin_page_items(get_raw_page('flights_bi_scheduled_time_idx',6),'flights_bi_scheduled_time_idx' ) where itemoffset = 197;allnulls | hasnulls | value ----------+----------+----------------------------------------------------f | f | {2016-08-15 02:45:00+03 .. 2016-08-15 17:15:00+03} (1 row)

Next range:

下一個(gè)范圍：

demo=# select * from brin_revmap_data(get_raw_page('flights_bi_scheduled_time_idx',1)) offset 1 limit 1;pages ---------(6,198) (1 row)demo=# select allnulls, hasnulls, value from brin_page_items(get_raw_page('flights_bi_scheduled_time_idx',6),'flights_bi_scheduled_time_idx' ) where itemoffset = 198;allnulls | hasnulls | value ----------+----------+----------------------------------------------------f | f | {2016-08-15 06:00:00+03 .. 2016-08-15 18:55:00+03} (1 row)

And so on.

等等。

For ?inclusion? classes, the ?value? field will display something like

對于“包含”類，“值”字段將顯示類似

{(94.4005966186523,69.3110961914062),(77.6600036621,51.6693992614746) .. f .. f}

The first value is the embedding rectangle, and ?f? letters at the end denote lacking empty elements (the first one) and lacking unmergeable values (the second one). Actually, the only unmergeable values are ?IPv4? and ?IPv6? addresses (?inet? data type).

第一個(gè)值是嵌入矩形，末尾的“ f”字母表示缺少空元素(第一個(gè))和缺少不可合并的值(第二個(gè))。實(shí)際上，唯一不可合并的值是“ IPv4”和“ IPv6”地址(“ inet”數(shù)據(jù)類型)。

物產(chǎn) (Properties)

Reminding you of the queries that have already been provided.

提醒您已經(jīng)提供的查詢。

The following are the properties of the access method:

以下是訪問方法的屬性：

Indexes can be created on several columns. In this case, its own summary statistics are gathered for each column, but they are stored together for each range. Of course, this index makes sense if one and the same size of a range is suitable for all columns.

可以在幾列上創(chuàng)建索引。在這種情況下，將為每列收集其自己的摘要統(tǒng)計(jì)信息，但對于每個(gè)范圍將它們一起存儲。當(dāng)然，如果一個(gè)且相同大小的范圍適用于所有列，則此索引才有意義。

The following index-layer properties are available:

以下索引層屬性可用：

Evidently, only bitmap scan is supported.

顯然，僅支持位圖掃描。

However, lack of clustering may seem confusing. Seemingly, since BRIN index is sensitive to physical order of rows, it would be logical to be able to cluster data according to the index. But this is not so. We can only create a ?regular? index (B-tree or GiST, depending on the data type) and cluster according to it. By the way, do you want to cluster a supposedly huge table taking into account Exclusive locks, execution time, and consumption of disk space during rebuilding?

但是，缺乏群集似乎令人困惑。看來，由于BRIN索引對行的物理順序很敏感，因此能夠根據(jù)索引對數(shù)據(jù)進(jìn)行聚類是合乎邏輯的。但是事實(shí)并非如此。我們只能創(chuàng)建一個(gè)“常規(guī)”索引(B樹或GiST，取決于數(shù)據(jù)類型)并根據(jù)它進(jìn)行聚類。順便說一句，您是否要考慮到排他鎖，執(zhí)行時(shí)間以及重建過程中磁盤空間的消耗，來對一個(gè)據(jù)稱龐大的表進(jìn)行聚類？

The following are the column-layer properties:

以下是列層屬性：

The only available property is the ability to manipulate NULLs.

唯一可用的屬性是操作NULL的能力。

Read on.繼續(xù)閱讀。

翻譯自: https://habr.com/en/company/postgrespro/blog/452900/