数据库内核月报

背景

车联网场景，一个行程由若干个点组成，业务需要按行程查询，按。

但是问题来了：

一个行程包含多个点，那么一次查询意味着需要查询并返回多个点（多条记录）。

由于有多个设备（例如汽车），大概率可能同时活跃，同时上传点，因此以堆表存储的话，一个行程中的多条记录，实际上在数据库的表里面，存储在多个BLOCK里面。

如上分析，性能问题：IO放大。

如何优化？

1、行程记录，按行聚集存储。

类似cluster操作。

2、行程记录，合并到单条，聚集存储。

类似将一个行程多条记录聚合。

例子

下面分别测试几种优化方法带来的性能优化效果。

1、cluster

2、array 聚合带压缩

3、array 聚合不带压缩

4、jsonb 聚合带压缩

5、jsonb 聚合不带压缩

6、text 聚合带压缩

7、text 聚合不带压缩

1 原始状态

create unlogged table t_sensor(id serial8 primary key, sid int8, att text, crt_time timestamp);  
create index idx_t_sensor_1 on t_sensor (sid, crt_time);  

vi test.sql  
  
\set sid random(1,10000)  
insert into t_sensor(sid,att,crt_time) values (:sid, md5(random()::text), now());  
  
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 50 -j 50 -t 2000000  

查看一个行程，需要访问11227个数据块。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_sensor where sid=1 order by crt_time;  
                                                                 QUERY PLAN                                                                   
--------------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx_t_sensor_1 on public.t_sensor  (cost=0.49..253.36 rows=11121 width=57) (actual time=0.020..47.591 rows=11189 loops=1)  
   Output: id, sid, att, crt_time  
   Index Cond: (t_sensor.sid = 1)  
   Buffers: shared hit=3406 read=7821  
 Planning Time: 0.092 ms  
 Execution Time: 48.303 ms  
(6 rows)  
  
postgres=# select 119*11227*8/1024.0;  
      ?column?        
--------------------  
 10437.601562500000  
(1 row)  

压测性能

vi test.sql  
  
\set sid random(1,10000)  
select * from t_sensor where sid=:sid order by crt_time ;  
  
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 56 -j 56 -T 120

transaction type: ./test.sql  
scaling factor: 1  
query mode: prepared  
number of clients: 56  
number of threads: 56  
duration: 120 s  
number of transactions actually processed: 14307  
latency average = 470.175 ms  
latency stddev = 43.500 ms  
tps = 119.037842 (including connections establishing)  
tps = 119.098221 (excluding connections establishing)  
statement latencies in milliseconds:  
         0.003  \set sid random(1,10000)  
       470.178  select * from t_sensor where sid=:sid order by crt_time ;  

2 cluster

按行程，时间索引聚集。一次查询返回一条聚合后的记录。

postgres=# cluster t_sensor USING idx_t_sensor_1 ;

查询一个行程，扫描174个数据块。IO 骤降。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_sensor where sid=1 order by crt_time;  
                                                                QUERY PLAN                                                                   
-------------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx_t_sensor_1 on public.t_sensor  (cost=0.49..253.39 rows=11149 width=57) (actual time=0.018..2.212 rows=11189 loops=1)  
   Output: id, sid, att, crt_time  
   Index Cond: (t_sensor.sid = 1)  
   Buffers: shared hit=174  
 Planning Time: 0.094 ms  
 Execution Time: 2.816 ms  
(6 rows)  

压测性能，见末尾。

3 array 带压缩

create unlogged table t_sensor_agg(sid int8 primary key, agg t_sensor[]);  
insert into t_sensor_agg select sid,array_agg(t_sensor order by crt_time) from t_sensor group by sid;  

瓶颈：array类型的INPUT OUTPUT接口，以及压缩解压缩接口

Samples: 3M of event 'cpu-clock', Event count (approx.): 559749813667                   
Overhead  Shared Object                           Symbol                                
59%  postgres                                [.] array_out                         
70%  postgres                                [.] record_out                        
74%  postgres                                [.] pglz_decompress                   
78%  libc-2.17.so                            [.] __memcpy_ssse3_back               
72%  postgres                                [.] pg_ltostr_zeropad                 
34%  [kernel]                                [k] run_timer_softirq                 
23%  postgres                                [.] pg_lltoa                          
13%  postgres                                [.] 0x000000000035c350                
76%  postgres                                [.] heap_deform_tuple                 
49%  libc-2.17.so                            [.] __strlen_sse2_pminub              
36%  postgres                                [.] appendStringInfoChar              
36%  [kernel]                                [k] copy_user_enhanced_fast_string    
29%  postgres                                [.] 0x000000000035c36c                
28%  postgres                                [.] 0x000000000035c362                
17%  postgres                                [.] FunctionCall1Coll                 
92%  postgres                                [.] hash_search_with_hash_value       
86%  [kernel]                                [k] _raw_spin_unlock_irqrestore       
84%  postgres                                [.] j2date                            
82%  postgres                                [.] 0x000000000035c357                
76%  postgres                                [.] palloc                            
76%  postgres                                [.] lookup_type_cache                 
67%  postgres                                [.] 0x000000000035c360                
66%  postgres                                [.] timestamp2tm                      
64%  [kernel]                                [k] rcu_process_callbacks             
64%  [kernel]                                [k] __do_softirq                     

vi test.sql  
  
\set sid random(1,10000)  
select * from t_sensor_agg4 where sid=:sid ;  

4 array 不带压缩

create unlogged table t_sensor_agg1(sid int8 primary key, agg t_sensor[]);  
alter table t_sensor_agg1 alter column agg set storage external;  
insert into t_sensor_agg1 select sid,array_agg(t_sensor order by crt_time) from t_sensor group by sid;  

瓶颈：array类型的INPUT OUTPUT接口

Samples: 1M of event 'cpu-clock', Event count (approx.): 310127790569                  
Overhead  Shared Object                           Symbol                               
06%  postgres                                [.] array_out                        
44%  postgres                                [.] record_out                       
20%  libc-2.17.so                            [.] __memcpy_ssse3_back              
86%  [kernel]                                [k] run_timer_softirq                
75%  postgres                                [.] pg_ltostr_zeropad                
65%  postgres                                [.] heap_deform_tuple                
28%  postgres                                [.] pg_lltoa                         
14%  postgres                                [.] 0x000000000035c350               
87%  [kernel]                                [k] copy_user_enhanced_fast_string   
52%  libc-2.17.so                            [.] __strlen_sse2_pminub             
47%  postgres                                [.] appendStringInfoChar             
32%  postgres                                [.] 0x000000000035c36c               
30%  postgres                                [.] 0x000000000035c362               
20%  postgres                                [.] FunctionCall1Coll                
11%  postgres                                [.] hash_search_with_hash_value      
87%  postgres                                [.] j2date                           
85%  postgres                                [.] 0x000000000035c357               
81%  [kernel]                                [k] _raw_spin_unlock_irqrestore      
76%  postgres                                [.] lookup_type_cache                
75%  postgres                                [.] 0x000000000046d33b               
74%  postgres                                [.] palloc                           
72%  [kernel]                                [k] rcu_process_callbacks            
68%  postgres                                [.] timestamp2tm                     
68%  postgres                                [.] pfree                         

5 jsonb 带压缩

create unlogged table t_sensor_agg2(sid int8 primary key, agg jsonb);  
insert into t_sensor_agg2 select sid,jsonb_agg(t_sensor order by crt_time) from t_sensor group by sid;  

6 jsonb 不带压缩

create unlogged table t_sensor_agg3(sid int8 primary key, agg jsonb);  
alter table t_sensor_agg3 alter column agg set storage external;  
insert into t_sensor_agg3 select sid,jsonb_agg(t_sensor order by crt_time) from t_sensor group by sid;  

7 text 带压缩

create unlogged table t_sensor_agg4(sid int8 primary key, agg text);  
insert into t_sensor_agg4 select sid,string_agg(t_sensor::text, '|' order by crt_time) from t_sensor group by sid;  

8 text 不带压缩

create unlogged table t_sensor_agg5(sid int8 primary key, agg text);  
alter table t_sensor_agg5 alter column agg set storage external;  
insert into t_sensor_agg5 select sid,string_agg(t_sensor::text, '|' order by crt_time) from t_sensor group by sid;  

9 index only scan 类似聚集表效果

所有内容作为INDEX的KEY，类似聚集表的效果(相邻内容在同一个INDEX PAGE里面)。查询时走INDEX ONLY SCAN扫描方法，扫描的BLOCK最少。

注意：btree 索引内容不能超过1/3 PAGE (因为BTREE是双向链表，一个PAGE至少要有一条有效记录，所以有这个限制。)。

写入数据

create table t_sensor (id serial8 primary key, sid int8, att text, crt_time timestamp);  
  
create index idx_t_sensor_1 on t_sensor (sid, crt_time, att, id);  
  
  
vi test.sql  
\set sid random(1,10000)    
insert into t_sensor(sid,att,crt_time) values (:sid, md5(random()::text), now());    
  
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 50 -j 50 -t 2000000  
  
  
transaction type: ./test.sql  
scaling factor: 1  
query mode: prepared  
number of clients: 50  
number of threads: 50  
number of transactions per client: 2000000  
number of transactions actually processed: 100000000/100000000  
latency average = 0.193 ms  
latency stddev = 0.461 ms  
tps = 257995.418591 (including connections establishing)  
tps = 258024.212148 (excluding connections establishing)  
statement latencies in milliseconds:  
         0.001  \set sid random(1,10000)    
         0.192  insert into t_sensor(sid,att,crt_time) values (:sid, md5(random()::text), now());  

生成VM文件（autovacuum触发时会自动生成，但是为了立马看效果，手工执行一下。）

vacuum analyze t_sensor;

INDEX ONLY SCAN, IO减少效果如下：

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_sensor where sid=2 order by crt_time;  
                                                                 QUERY PLAN                                                                    
---------------------------------------------------------------------------------------------------------------------------------------------  
 Index Only Scan using idx_t_sensor_1 on public.t_sensor  (cost=0.60..70.41 rows=9960 width=57) (actual time=0.019..2.109 rows=9978 loops=1)  
   Output: id, sid, att, crt_time  
   Index Cond: (t_sensor.sid = 2)  
   Heap Fetches: 0  
   Buffers: shared hit=235  
 Planning Time: 0.090 ms  
 Execution Time: 2.652 ms  
(7 rows)  

查询性能：

vi test.sql  
  
\set sid random(1,10000)  
select * from t_sensor where sid=:sid order by crt_time;  
  
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 28 -j 28 -T 120  
  
  
transaction type: ./test.sql  
scaling factor: 1  
query mode: prepared  
number of clients: 28  
number of threads: 28  
duration: 120 s  
number of transactions actually processed: 283638  
latency average = 11.844 ms  
latency stddev = 1.931 ms  
tps = 2363.410561 (including connections establishing)  
tps = 2363.913145 (excluding connections establishing)  
statement latencies in milliseconds:  
         0.002  \set sid random(1,10000)    
        11.842  select * from t_sensor where sid=:sid order by crt_time;  

小结

目标数据分散在多个BLOCK中，引起IO放大的问题，通过聚集存储，或者通过聚合存储，可以很好的解决这个问题。

聚合后，另一个瓶颈则是聚合后的类型（array,jsonb,text）的IN OUT接口。

/	表存储	行程查询 qps
原始(IO 放大)	8880 MB	119
顺序(无IO 放大)	8880 MB	2057
index only scan(类似聚集表)(无IO 放大)	8880 MB	2363
聚合array(压缩)	4523 MB	2362
聚合array(不压缩)	8714 MB	2515
聚合json(压缩)	5052 MB	3102
聚合json(不压缩)	13 GB	3184
聚合text(压缩)	4969 MB	6057
聚合text(不压缩)	7692 MB	5997

从上面的测试，可以看到IN OUT函数接口的开销，text<jsonb<array(composite array)。

实际的优化例子，可参考末尾的几篇文章。例如：

1、按时间分区，旧的分区使用cluster，按行程整理数据，使用AB表切换，解决IO放大的问题。

2、异步聚合，将点数据准实时按行程ID，聚合到聚合后的行程表。

3、使用INDEX ONLY SCAN, 达到聚集表效果。对业务无任何侵入性。(例如按天分区，加全量(业务需要查询的字段)索引。)，相当于两倍存储空间(一份在堆表，一份在索引中)。

参考

《PostgreSQL IoT，车联网 - 实时轨迹、行程实践》

《PostgreSQL AB表切换最佳实践 - 提高切换成功率，杜绝雪崩 - 珍藏级》

《PostgreSQL 时序最佳实践 - 证券交易系统数据库设计 - 阿里云RDS PostgreSQL最佳实践》

数据库内核月报－ 2018 / 12

PgSQL · 应用案例 · PostgreSQL IoT，车联网 - 实时轨迹、行程实践

背景

例子