MySQL 8.0 新特性之统计直方图
概覽
MySQL8.0實現了統計直方圖。利用直方圖,用戶可以對一張表的一列做數據分布的統計,特別是針對沒有索引的字段。這可以幫助查詢優化器找到更優的執行計劃。統計直方圖的主要使用場景是用來計算字段選擇性,即過濾效率。
可以通過以下方式來創建或者刪除直方圖:
ANALYZE TABLE tbl_name UPDATE HISTOGRAM ON col_name [, col_name] WITH N BUCKETS; ANALYZE TABLE tbl_name DROP HISTOGRAM ON col_name [, col_name];buckets默認是100。統計直方圖的信息存儲在數據字典表"column_statistcs"中,可以通過視圖information_schema.COLUMN_STATISTICS訪問。直方圖以靈活的JSON的格式存儲。ANALYZE TABLE會基于表大小自動判斷是否要進行取樣操作。ANALYZE TABLE也會基于表中列的數據分布情況以及bucket的數量來決定是否要建立等寬直方圖(singleton)還是等高直方圖(equi-height)。
什么是直方圖
數據庫中,查詢優化器負責將SQL轉換成最有效的執行計劃。有時候,查詢優化器會走不到最優的執行計劃,導致花費了更多不必要的時間。造成這種情況的主要原因是,查詢優化器有時無法準確的知道以下幾個問題的答案:
?●??每個表有多少行??●??每一列有多少不同的值?
?●??每一列的數據分布情況?
舉例說明:一張簡單的表,兩個字段,一個字段是person_id,另一個字段是time_of_day,表示睡覺時間
CREATE TABLE bedtime ( person_id INT, time_of_day TIME);對于time_of_day列,大部分人上床時間會在晚上11:00左右。所以下面第一個查詢會比第二個查詢返回更多的行數:
1) SELECT * FROM bedtime WHERE time_of_day BETWEEN "22:00:00" AND "23:59:00" 2) SELECT * FROM bedtime WHERE time_of_day BETWEEN "12:00:00" AND "14:00:00"如果沒有統計數據,優化器會假設time_of_day的值是均勻分配的,即一個人的上床時間在下午3點和晚上11點的概率差不多。如何才能使查詢優化器知道數據的分布情況?一個解決方法就是在列上建立統計直方圖。
直方圖能近似獲得一列的數據分布情況,從而讓數據庫知道它含有哪些數據。直方圖有多種形式,MySQL支持了兩種:等寬直方圖(singleton)、等高直方圖(equi-height)。直方圖的共同點是,它們都將數據分到了一系列的buckets中去。MySQL會自動將數據劃到不同的buckets中,也會自動決定創建哪種類型的直方圖。
如何創建和刪除統計直方圖
為了管理統計直方圖,ANALYZE TABLE命令新增了兩個子句:
ANALYZE TABLE tbl_name UPDATE HISTOGRAM ON col_name [, col_name] WITH N BUCKETS; ANALYZE TABLE tbl_name DROP HISTOGRAM ON col_name [, col_name];第一個表示一次可以為一個或多個列創建統計直方圖:
mysql> ANALYZE TABLE payment UPDATE HISTOGRAM ON amount WITH 32 BUCKETS; +----------------+-----------+----------+---------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+---------------------------------------------------+ | sakila.payment | histogram | status | Histogram statistics created for column 'amount'. | +----------------+-----------+----------+---------------------------------------------------+ 1 row in set (0.27 sec) mysql> ANALYZE TABLE payment UPDATE HISTOGRAM ON amount, payment_date WITH 32 BUCKETS; +----------------+-----------+----------+---------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+---------------------------------------------------------+ | sakila.payment | histogram | status | Histogram statistics created for column 'amount'. | | sakila.payment | histogram | status | Histogram statistics created for column 'payment_date'. | +----------------+-----------+----------+---------------------------------------------------------+buckets的值必須指定,可以設置為1到1024,默認值是100。
對于不同的數據集合,buckets的值取決于以下幾個因素:
?●??這列有多少不同的值?●??數據的分布情況
?●??需要多高的準確性
但是,某些buckets的值能提升的關于數據分布情況的準確性相當低。所以,建議的做法是,開始的時候將buckets的值設的低一點,比如32,然后如果沒有滿足期望,再往上增大。
上面這個例子中,我們對于amount列建立了兩次直方圖。第一個語句,建立了一個新的直方圖;第二個語句,amount列的直方圖被重寫了。
如果需要刪除已經創建的直方圖,用DROP HISTOGRAM就可以實現:
mysql> ANALYZE TABLE payment DROP HISTOGRAM ON payment_date; +----------------+-----------+----------+---------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+---------------------------------------------------------+ | sakila.payment | histogram | status | Histogram statistics removed for column 'payment_date'. | +----------------+-----------+----------+---------------------------------------------------------+UPDATE HISTOGRAM可以一次性為多個列創建直方圖。如果命令中間寫錯,ANALYZE TABLE仍然會起作用。比如,你指定了三列,但第二列不存在。MySQL仍然會為第一列和第三列創建直方圖。
mysql> ANALYZE TABLE customer UPDATE HISTOGRAM ON c_birth_day, c_foobar, c_birth_month WITH 32 BUCKETS; +----------------+-----------+----------+----------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+----------------------------------------------------------+ | tpcds.customer | histogram | status | Histogram statistics created for column 'c_birth_day'. | | tpcds.customer | histogram | status | Histogram statistics created for column 'c_birth_month'. | | tpcds.customer | histogram | Error | The column 'c_foobar' does not exist. | +----------------+-----------+----------+----------------------------------------------------------+ 3 rows in set (0.15 sec)數據庫內部發生了什么
當你讀過MySQL手冊,你可能已經注意到新的系統變量histogram_generation_max_mem_size。當用戶建立統計直方圖,這個值是用來控制大約多少內存能允許被使用。那么,為什么要控制這個呢?
當你在建立直方圖的時候,MySQL server會將所有數據讀到內存中,然后在內存中進行操作,包括排序。如果對一個很大的表建立直方圖,可能會有風險將幾百M的數據都讀到內存中,但這是不明智的。為了規避這個風險,MySQL會根據給定的histogram_generation_max_mem_size的值計算該將多少行數據讀到內存中。如果根據當前histogram_generation_max_mem_size的限制,MySQL認為只能讀一部分數據,那么MySQL會進行取樣。通過“sampling-rate”屬性,可以觀察到取樣比率。
mysql> SET histogram_generation_max_mem_size = 1000000; Query OK, 0 rows affected (0.00 sec) mysql> ANALYZE TABLE customer UPDATE HISTOGRAM ON c_birth_country WITH 16 BUCKETS; +----------------+-----------+----------+------------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+------------------------------------------------------------+ | tpcds.customer | histogram | status | Histogram statistics created for column 'c_birth_country'. | +----------------+-----------+----------+------------------------------------------------------------+ 1 row in set (0.22 sec) mysql> SELECT histogram->>'$."sampling-rate"' -> FROM information_schema.column_statistics -> WHERE table_name = "customer" -> AND column_name = "c_birth_country"; +---------------------------------+ | histogram->>'$."sampling-rate"' | +---------------------------------+ | 0.048743243211626014 | +---------------------------------+ 1 row in set (0.00 sec)優化器創建了一個直方圖,大約讀了c_birth_country列4.8%的數據。取樣是不確定的,因此意義不大。同樣的數據,同樣的兩條語句‘‘ANALYZE TABLE tbl UPDATE HISTOGRAM …’’,如果用了取樣,得到的直方圖可能就不一樣。
查詢案例
統計直方圖可以帶來些什么?我們可以看個例子,這個例子中用了直方圖,在執行時間上會有很大的不同。
環境:
?●??TPC-DS Benchmark with scale factor of 1?●??Intel Core i7-4770
?●??Debian Stretch
?●??MySQL 8.0 RC1
?●??innodb_buffer_pool_size = 2G
?●??optimizer_switch = "condition_fanout_filter=on"
Query 90
查詢如下:上午售賣的數量與晚上售賣的數量的比率。
mysql> SELECT CAST(amc AS DECIMAL(15, 4)) / CAST(pmc AS DECIMAL(15, 4)) am_pm_ratio -> FROM (SELECT COUNT(*) amc -> FROM web_sales, -> household_demographics, -> time_dim, -> web_page -> WHERE ws_sold_time_sk = time_dim.t_time_sk -> AND ws_ship_hdemo_sk = household_demographics.hd_demo_sk -> AND ws_web_page_sk = web_page.wp_web_page_sk -> AND time_dim.t_hour BETWEEN 9 AND 9 + 1 -> AND household_demographics.hd_dep_count = 2 -> AND web_page.wp_char_count BETWEEN 5000 AND 5200) at, -> (SELECT COUNT(*) pmc -> FROM web_sales, -> household_demographics, -> time_dim, -> web_page -> WHERE ws_sold_time_sk = time_dim.t_time_sk -> AND ws_ship_hdemo_sk = household_demographics.hd_demo_sk -> AND ws_web_page_sk = web_page.wp_web_page_sk -> AND time_dim.t_hour BETWEEN 15 AND 15 + 1 -> AND household_demographics.hd_dep_count = 2 -> AND web_page.wp_char_count BETWEEN 5000 AND 5200) pt -> ORDER BY am_pm_ratio -> LIMIT 100; +-------------+ | am_pm_ratio | +-------------+ | 1.27619048 | +-------------+ 1 row in set (1.48 sec)可以看到,查詢花費了1.5秒左右。看起來不算多,但是通過在一列上建立直方圖,可以讓執行速度快三倍。
mysql> ANALYZE TABLE web_page UPDATE HISTOGRAM ON wp_char_count WITH 8 BUCKETS; +----------------+-----------+----------+----------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+----------------------------------------------------------+ | tpcds.web_page | histogram | status | Histogram statistics created for column 'wp_char_count'. | +----------------+-----------+----------+----------------------------------------------------------+ 1 row in set (0.06 sec) mysql> SELECT ... +-------------+ | am_pm_ratio | +-------------+ | 1.27619048 | +-------------+ 1 row in set (0.50 sec)通過這個直方圖,查詢花費了0.5秒左右。原因呢?主要的原因是,查詢語句中的謂詞“web_page.wp_char_count BETWEEN 5000 AND 5200”。沒有直方圖的時候,優化器會假設web_page表中符合謂詞“web_page.wp_char_count BETWEEN 5000 AND 5200”的數據占到總數據11.11%左右。但,這是錯誤的。用下面的查詢語句,可以看到實際上滿足條件的數據只有1.6%。
mysql> SELECT -> (SELECT COUNT(*) FROM web_page WHERE web_page.wp_char_count BETWEEN 5000 AND 5200) -> / -> (SELECT COUNT(*) FROM web_page) AS ratio; +--------+ | ratio | +--------+ | 0.0167 | +--------+ 1 row in set (0.00 sec)通過直方圖,優化器會知道這個信息,并且更早進行表join,因此執行時間快了三倍。
Query 61
查詢如下:在給定的年份和月份,有和沒有廣告宣傳的情況下貨物的售賣比率。
mysql> SELECT promotions, -> total, -> CAST(promotions AS DECIMAL(15, 4)) / CAST(total AS DECIMAL(15, 4)) * 100 -> FROM (SELECT SUM(ss_ext_sales_price) promotions -> FROM store_sales, -> store, -> promotion, -> date_dim, -> customer, -> customer_address, -> item -> WHERE ss_sold_date_sk = d_date_sk -> AND ss_store_sk = s_store_sk -> AND ss_promo_sk = p_promo_sk -> AND ss_customer_sk = c_customer_sk -> AND ca_address_sk = c_current_addr_sk -> AND ss_item_sk = i_item_sk -> AND ca_gmt_offset = -5 -> AND i_category = 'Home' -> AND ( p_channel_dmail = 'Y' -> OR p_channel_email = 'Y' -> OR p_channel_tv = 'Y' ) -> AND s_gmt_offset = -5 -> AND d_year = 2000 -> AND d_moy = 12) promotional_sales, -> (SELECT SUM(ss_ext_sales_price) total -> FROM store_sales, -> store, -> date_dim, -> customer, -> customer_address, -> item -> WHERE ss_sold_date_sk = d_date_sk -> AND ss_store_sk = s_store_sk -> AND ss_customer_sk = c_customer_sk -> AND ca_address_sk = c_current_addr_sk -> AND ss_item_sk = i_item_sk -> AND ca_gmt_offset = -5 -> AND i_category = 'Home' -> AND s_gmt_offset = -5 -> AND d_year = 2000 -> AND d_moy = 12) all_sales -> ORDER BY promotions, -> total -> LIMIT 100; +------------+------------+--------------------------------------------------------------------------+ | promotions | total | CAST(promotions AS DECIMAL(15, 4)) / CAST(total AS DECIMAL(15, 4)) * 100 | +------------+------------+--------------------------------------------------------------------------+ | 3213210.07 | 5966836.78 | 53.85114741 | +------------+------------+--------------------------------------------------------------------------+ 1 row in set (2.78 sec)可以看到,查詢花費了2.8秒左右。但是,查詢優化器不知道s_gmt_offset列只有一個不同的值。沒有統計數據的情況下,優化器會用所謂的“hard-coded guesstimates”,會假設10%的數據符合條件“ca_gmt_offset = -5“。如果在這個列上增加一個直方圖,優化器會知道所有的數據都符合條件,因此會走一個更好的執行計劃。
mysql> ANALYZE TABLE store UPDATE HISTOGRAM ON s_gmt_offset WITH 8 BUCKETS; +-------------+-----------+----------+---------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +-------------+-----------+----------+---------------------------------------------------------+ | tpcds.store | histogram | status | Histogram statistics created for column 's_gmt_offset'. | +-------------+-----------+----------+---------------------------------------------------------+ 1 row in set (0.06 sec) mysql> SELECT ... +------------+------------+--------------------------------------------------------------------------+ | promotions | total | CAST(promotions AS DECIMAL(15, 4)) / CAST(total AS DECIMAL(15, 4)) * 100 | +------------+------------+--------------------------------------------------------------------------+ | 3213210.07 | 5966836.78 | 53.85114741 | +------------+------------+--------------------------------------------------------------------------+ 1 row in set (1.37 sec)有了直方圖,查詢花了不到1.4秒,差不多提升了2倍。原因是:
?●??第一個執行計劃,優化器選擇了第一個派生表在store表上做了全表掃描,然后對表item, store_sales, date_dim, customer,customer_address分別做了主鍵查找。?●??但是,當MySQL意識到store表會比它猜測的返回更多的數據時,優化器會在item表上做全表掃描,然后對store_sales, store, date_dim, customer,customer_address 分別做主鍵查找。
為什么不用索引?
索引往往也能做上述工作,比如:
mysql> CREATE INDEX s_gmt_offset_idx ON store (s_gmt_offset); Query OK, 0 rows affected (0.53 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql> SELECT ... +------------+------------+--------------------------------------------------------------------------+ | promotions | total | CAST(promotions AS DECIMAL(15, 4)) / CAST(total AS DECIMAL(15, 4)) * 100 | +------------+------------+--------------------------------------------------------------------------+ | 3213210.07 | 5966836.78 | 53.85114741 | +------------+------------+--------------------------------------------------------------------------+ 1 row in set (1.41 sec)但是,用直方圖而不是索引有以下兩個原因:
?●??維護一個索引有代價。每一次的insert、update、delete都會需要更新索引,會對性能有一定的影響。而直方圖一次創建永不更新,除非明確去更新它。所以不會影響insert、update、delete的性能。?●??如果有索引,優化器用使用index dives技術來估算符合條件范圍的記錄數量。這種方式也是有代價的,特別是查詢語句條件中有很長的IN列表。直方圖相對而言代價小,因此可能更合適。
檢索統計直方圖
統計直方圖以JSON的形式存在數據字典中。可以用內建的JSON函數built-in JSON functions從直方圖獲取一些信息。舉例來說,如果需要知道amount列的直方圖的創建或者更新時間,可以用JSON unquoting extraction operator來獲取信息:
mysql> SELECT -> HISTOGRAM->>'$."last-updated"' AS last_updated -> FROM INFORMATION_SCHEMA.COLUMN_STATISTICS -> WHERE -> SCHEMA_NAME = "sakila" -> AND TABLE_NAME = "payment" -> AND COLUMN_NAME = "amount"; +----------------------------+ | last_updated | +----------------------------+ | 2017-09-15 11:54:25.000000 | +----------------------------+如果要查找實際有多少個buckets,以及用analyze table時指定了多少個buckets,可以如下:
mysql> SELECT -> TABLE_NAME, -> COLUMN_NAME, -> HISTOGRAM->>'$."number-of-buckets-specified"' AS num_buckets_specified, -> JSON_LENGTH(HISTOGRAM, '$.buckets') AS num_buckets_created -> FROM INFORMATION_SCHEMA.COLUMN_STATISTICS -> WHERE -> SCHEMA_NAME = "sakila"; +------------+--------------+-----------------------+---------------------+ | TABLE_NAME | COLUMN_NAME | num_buckets_specified | num_buckets_created | +------------+--------------+-----------------------+---------------------+ | payment | amount | 32 | 19 | | payment | payment_date | 32 | 32 | +------------+--------------+-----------------------+---------------------+經測試,num_buckets_created與字段的distinct值很接近,近似相等;但是num_buckets_created不會大于num_buckets_specified。如果num_buckets_created與num_buckets_specified相等,那么存在可能,在創建直方圖的時候指定的buckets不夠多,那么此時可以通過增加buckets的數量,來提高直方圖的準確性。
buckets可以設置為1到1024
優化器trace
如果你想要知道直方圖做了什么,最簡單的方式就是看一下執行計劃:
mysql> EXPLAIN SELECT * FROM customer WHERE c_birth_day BETWEEN 1 AND 10; +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | 1 | SIMPLE | customer | NULL | ALL | NULL | NULL | NULL | NULL | 98633 | 11.11 | Using where | +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ 1 row in set, 1 warning (0.00 sec) mysql> ANALYZE TABLE customer UPDATE HISTOGRAM ON c_birth_day WITH 32 BUCKETS; +----------------+-----------+----------+--------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+--------------------------------------------------------+ | tpcds.customer | histogram | status | Histogram statistics created for column 'c_birth_day'. | +----------------+-----------+----------+--------------------------------------------------------+ 1 row in set (0.10 sec) mysql> EXPLAIN SELECT * FROM customer WHERE c_birth_day BETWEEN 1 AND 10; +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | 1 | SIMPLE | customer | NULL | ALL | NULL | NULL | NULL | NULL | 98633 | 32.12 | Using where | +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ 1 row in set, 1 warning (0.00 sec)可以看到filtered列,從默認的11.11%變成了更精確的32.12%。但是,如果有多個條件,有些有直方圖,有些沒有,就比較難判斷優化器做了什么改進:
mysql> EXPLAIN SELECT * FROM customer WHERE c_birth_day <= 20 AND c_birth_year = 1967; +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | 1 | SIMPLE | customer | NULL | ALL | NULL | NULL | NULL | NULL | 98633 | 6.38 | Using where | +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ 1 row in set, 1 warning (0.00 sec)如果想要知道更多關于直方圖統計的細節,可以使用trace:
mysql> SET OPTIMIZER_TRACE = "enabled=on"; Query OK, 0 rows affected (0.00 sec) mysql> SET OPTIMIZER_TRACE_MAX_MEM_SIZE = 1000000; Query OK, 0 rows affected (0.00 sec) mysql> EXPLAIN SELECT * FROM customer WHERE c_birth_day <= 20 AND c_birth_year = 1967; mysql> SELECT JSON_EXTRACT(TRACE, "$**.filtering_effect") FROM INFORMATION_SCHEMA.OPTIMIZER_TRACE; +----------------------------------------------------------------------------------------+ | JSON_EXTRACT(TRACE, "$**.filtering_effect") | +----------------------------------------------------------------------------------------+ | [[{"condition": "(`customer`.`c_birth_day` <= 20)", "histogram_selectivity": 0.6376}]] | +----------------------------------------------------------------------------------------+ 1 row in set (0.00 sec)這里用了JSON_EXTRACT從trace里取出相關的部分。對于每個條件,直方圖被使用的話,就會看到估算過的字段的選擇性。在這個例子里,通過直方圖,對“c_birth_day <= 20”條件,估算出63.76%的數據滿足條件。事實上,與實際的數據分布情況基本一致:
mysql> SELECT -> (SELECT count(*) FROM customer WHERE c_birth_day <= 20) -> / -> (SELECT COUNT(*) FROM customer) AS ratio; +--------+ | ratio | +--------+ | 0.6376 | +--------+ 1 row in set (0.03 sec)原文發布時間為:2018-09-18 本文作者:韓杰·沃趣科技 本文來自云棲社區合作伙伴“老葉茶館”,了解相關信息可以關注“老葉茶館”。
總結
以上是生活随笔為你收集整理的MySQL 8.0 新特性之统计直方图的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 03-高级选择器,属性选择器,伪类选择器
- 下一篇: 每天坚持一个CSS——社会人