SSB数据集导入ClickHouse
目錄
- 1.SSB模型介紹
- 2.安裝ssb-dbgen
- 3.在ClickHouse中創建表
- 4.向ClickHouse中導入數據
- 5.將star schema轉換為flat schema
- 6.查詢
- 7.注意
1.SSB模型介紹
SSB(Star Schema Benchmark)是麻省州立大學波士頓校區的研究人員定義的基于現實商業應用的數據模型,用來評價決策支持技術方面應用的性能。本文用它來測試云原生ClickHouse性能。
SSB基準測試包括:
1個事實表:lineorder
4個維度表:customer,part,date,supplier
13條標準SQL查詢測試語句:統計查詢、多表關聯、sum、復雜條件、group by、order by等組合方式。
2.安裝ssb-dbgen
$ git clone http://github.com/vadimtk/ssb-dbgen.git $ cd ssb-dbgen $ make$ ./dbgen -s 10 -T c $ ./dbgen -s 10 -T l (數據量較大,謹慎設置-s參數) $ ./dbgen -s 1000 -T p $ ./dbgen -s 1000 -T s $ ./dbgen -s 10000 -T d3.在ClickHouse中創建表
首先解決ClickHouse建表不支持換行問題:
進入clickhouse-client時加一個-m的參數即可。
注意:創建表時要選擇集群,每次連接ClickHouse時都會隨機進入一個Pod,數據庫表如果沒建在集群上,建表語句不會自動同步到其他Pod上。退出后pod不會自動銷毀,除非手動刪除。
#查看所有的pod: kubectl get pod -n ckk8s #K8s 進入特定的pod: kubectl exec -it -n ckk8s podName /bin/bash創建表:
create database ssb on cluster cluster1; use ssb;CREATE TABLE customer on cluster cluster1 (C_CUSTKEY UInt32,C_NAME String,C_ADDRESS String,C_CITY LowCardinality(String),C_NATION LowCardinality(String),C_REGION LowCardinality(String),C_PHONE String,C_MKTSEGMENT LowCardinality(String) ) ENGINE = MergeTree ORDER BY (C_CUSTKEY);CREATE TABLE lineorder on cluster cluster1 (LO_ORDERKEY UInt32,LO_LINENUMBER UInt8,LO_CUSTKEY UInt32,LO_PARTKEY UInt32,LO_SUPPKEY UInt32,LO_ORDERDATE Date,LO_ORDERPRIORITY LowCardinality(String),LO_SHIPPRIORITY UInt8,LO_QUANTITY UInt8,LO_EXTENDEDPRICE UInt32,LO_ORDTOTALPRICE UInt32,LO_DISCOUNT UInt8,LO_REVENUE UInt32,LO_SUPPLYCOST UInt32,LO_TAX UInt8,LO_COMMITDATE Date,LO_SHIPMODE LowCardinality(String) ) ENGINE = MergeTree PARTITION BY toYear(LO_ORDERDATE) ORDER BY (LO_ORDERDATE, LO_ORDERKEY);CREATE TABLE part on cluster cluster1 (P_PARTKEY UInt32,P_NAME String,P_MFGR LowCardinality(String),P_CATEGORY LowCardinality(String),P_BRAND LowCardinality(String),P_COLOR LowCardinality(String),P_TYPE LowCardinality(String),P_SIZE UInt8,P_CONTAINER LowCardinality(String) ) ENGINE = MergeTree ORDER BY P_PARTKEY;CREATE TABLE supplier on cluster cluster1 (S_SUPPKEY UInt32,S_NAME String,S_ADDRESS String,S_CITY LowCardinality(String),S_NATION LowCardinality(String),S_REGION LowCardinality(String),S_PHONE String ) ENGINE = MergeTree ORDER BY S_SUPPKEY;CREATE TABLE date on cluster cluster1 (D_DATEKEY date32,D_DATE String,D_DAYOFWEEK String,D_MONTH String,D_YEAR UInt32,D_YEARMONTHNUM UInt32,D_YEARMONTH String,D_DAYNUMINWEEK UInt8,D_DAYNUMINMONTH UInt8,D_DAYNUMINYEAR UInt32,D_MONTHNUMINYEAR UInt8,D_WEEKNUMINYEAR UInt8,D_SELLINGSEASON String,D_LASTDAYINWEEKFL String,D_LASTDAYINMONTHFL String,D_HOLIDAYFL String,D_WEEKDAYFL String) ENGINE = MergeTree ORDER BY D_DATEKEY;4.向ClickHouse中導入數據
$ clickhouse-client -m -h 10.43.237.127 -u clickhouse_operator --password clickhouse_operator_password --query "INSERT INTO ssb.customer FORMAT CSV" < customer.tbl $ clickhouse-client -m -h 10.43.237.127 -u clickhouse_operator --password clickhouse_operator_password --query "INSERT INTO ssb.part FORMAT CSV" < part.tbl $ clickhouse-client -m -h 10.43.237.127 -u clickhouse_operator --password clickhouse_operator_password --query "INSERT INTO ssb.supplier FORMAT CSV" < supplier.tbl $ clickhouse-client -m -h 10.43.237.127 -u clickhouse_operator --password clickhouse_operator_password --query "INSERT INTO ssb.lineorder FORMAT CSV" < lineorder.tbl文件大小:
[root@p64001v data]# ll 總用量 10404124 -rw-r--r-- 1 root root 3345252364 10月 21 20:11 customer.tbl -rw-r--r-- 1 root root 6910512767 10月 21 20:12 lineorder.tbl -rw-r--r-- 1 root root 201396298 10月 21 20:12 part.tbl -rw-r--r-- 1 root root 196652980 10月 21 20:12 supplier.tbl [root@p64001v data]# ls -lh 總用量 10G -rw-r--r-- 1 root root 3.2G 10月 21 20:11 customer.tbl -rw-r--r-- 1 root root 6.5G 10月 21 20:12 lineorder.tbl -rw-r--r-- 1 root root 193M 10月 21 20:12 part.tbl -rw-r--r-- 1 root root 188M 10月 21 20:12 supplier.tbl數據量(條):
customer:30000000 lineorder:59986052 part:2000000 supplier:20000005.將star schema轉換為flat schema
SET max_memory_usage = 20000000000;CREATE TABLE lineorder_flat on cluster cluster1 ENGINE = MergeTree PARTITION BY toYear(LO_ORDERDATE) ORDER BY (LO_ORDERDATE, LO_ORDERKEY) AS SELECTl.LO_ORDERKEY AS LO_ORDERKEY,l.LO_LINENUMBER AS LO_LINENUMBER,l.LO_CUSTKEY AS LO_CUSTKEY,l.LO_PARTKEY AS LO_PARTKEY,l.LO_SUPPKEY AS LO_SUPPKEY,l.LO_ORDERDATE AS LO_ORDERDATE,l.LO_ORDERPRIORITY AS LO_ORDERPRIORITY,l.LO_SHIPPRIORITY AS LO_SHIPPRIORITY,l.LO_QUANTITY AS LO_QUANTITY,l.LO_EXTENDEDPRICE AS LO_EXTENDEDPRICE,l.LO_ORDTOTALPRICE AS LO_ORDTOTALPRICE,l.LO_DISCOUNT AS LO_DISCOUNT,l.LO_REVENUE AS LO_REVENUE,l.LO_SUPPLYCOST AS LO_SUPPLYCOST,l.LO_TAX AS LO_TAX,l.LO_COMMITDATE AS LO_COMMITDATE,l.LO_SHIPMODE AS LO_SHIPMODE,c.C_NAME AS C_NAME,c.C_ADDRESS AS C_ADDRESS,c.C_CITY AS C_CITY,c.C_NATION AS C_NATION,c.C_REGION AS C_REGION,c.C_PHONE AS C_PHONE,c.C_MKTSEGMENT AS C_MKTSEGMENT,s.S_NAME AS S_NAME,s.S_ADDRESS AS S_ADDRESS,s.S_CITY AS S_CITY,s.S_NATION AS S_NATION,s.S_REGION AS S_REGION,s.S_PHONE AS S_PHONE,p.P_NAME AS P_NAME,p.P_MFGR AS P_MFGR,p.P_CATEGORY AS P_CATEGORY,p.P_BRAND AS P_BRAND,p.P_COLOR AS P_COLOR,p.P_TYPE AS P_TYPE,p.P_SIZE AS P_SIZE,p.P_CONTAINER AS P_CONTAINER FROM ssb.lineorder AS l INNER JOIN ssb.customer AS c ON c.C_CUSTKEY = l.LO_CUSTKEY INNER JOIN ssb.supplier AS s ON s.S_SUPPKEY = l.LO_SUPPKEY INNER JOIN ssb.part AS p ON p.P_PARTKEY = l.LO_PARTKEY;6.查詢
SELECTtoYear(LO_ORDERDATE) AS year,S_NATION,P_CATEGORY,sum(LO_REVENUE - LO_SUPPLYCOST) AS profit FROM lineorder_flat WHERE C_REGION = 'AMERICA' AND S_REGION = 'AMERICA' AND (year = 1997 OR year = 1998) AND (P_MFGR = 'MFGR#1' OR P_MFGR = 'MFGR#2') GROUP BYyear,S_NATION,P_CATEGORY ORDER BYyear ASC,S_NATION ASC,P_CATEGORY ASC;7.注意
由于我的CK是部署在K8S上,所以在CK上創建表最好是創建分布式表和本地表兩張表。如果只創建本地表,查詢本地表時數據會不全。
- 創建本地表:
其中if not exists fy.test_table與/{shard}/fy.test_table中的fy.test_table建議保持一致。其實沒有必要一致,但/{shard}/fy.test_table處的必須保證每個表都不一樣,因此將此處的值和表名一樣即可。
- 創建分布式表:
Distributed中的四項分別代表集群名,數據庫名,本地表名,分片方法。
- 查詢:
我這里是在集群內節點進行的。一般使用分布式表進行查詢,使用本地表查詢只返回當前節點的數據,使用分布式表查詢返回所有節點上符合要求的數據。
- 刪除:
clickhouse一般不刪除數據,刪除的成本太高,都是直接增加數據。一般也不更新數據,alter table update/delete不支持分布式DDL,在分布式環境中需要手動在每個節點上更新/刪除數據。
總結
以上是生活随笔為你收集整理的SSB数据集导入ClickHouse的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 【Python报错】MemoryErro
- 下一篇: 5G NR SSB 学习