Author: digoal
PostgreSQL 可靠性与大多数关系数据库一样,都是通过REDO来保障的。
群里有位童鞋问了一个问题,为什么PostgreSQL的REDO块大小默认是8K的,不是512字节。
这位童鞋提问的理由是,大多数的块设备扇区大小是512字节的,512字节可以保证原子写,而如果REDO的块大于512字节,可能会出现partial write。
那么PostgreSQL的redo(wal) 块大小设置为8KB时,靠谱吗?本文将给大家分析一下。
1. 当开启了易失缓存时,如果写数据的块大小大于磁盘原子写的大小(通常为512字节),掉电则可能出现partial write。
例如disk cache,没有掉电保护,而且操作系统的fsync接口不感知disk cache,如果你调用了fsync,即使返回成功,数据其实可能还在disk cache里面。
当发生掉电时,在disk cache里的数据会丢失掉,如果程序写一个8K的数据,因为磁盘的原子写小于8K,则可能出现8K里有些写成功了,有些没有写成功,即partial write。
(ps: 某些企业级SSD可以通过电容残余的电量,将DISK CACHE里的数据持久化下来,但是请不要相信所有磁盘都有这个功能)
2. 当开启了易失缓存时,如果写数据的块大小小于或等于磁盘原子写的大小(即”原子写”),掉电时也可能出现partial write。
对于MySQL来说,REDO的写为512字节的,其中包含12个字节的头信息,4个字节的校验信息。
这个怎么理解呢,为什么没有对齐则可能出现。
1. 前面提到了,如果没有对齐,并且开启了易失缓存,原子写是没有用的,同样会出现partial write。
2. 如果没有对齐,会造成写放大,本来写512字节的,磁盘上会造成写1024字节(将两个扇区数据读出来再与要写的数据合并, 分成两个扇区回写)。
1. 开启易失缓存时,原子写一样会丢失易失缓存中的数据。
2. 当未对齐时,原子写并不是真的原子写。
数据库只靠REDO的原子写,如果不考虑以上两个因素,起不到保证数据可靠性和一致性的作用。
1. shared buffer 中的dirty page在write前,必须要保证对应的redo已经持久化(指已经落到非易失存储介质)。
2. 在检查点后出现的脏页,必须要在redo中写dirty page的full page。
这2条保证的是数据文件的一致性。
3. 在不考虑standby的情况下,当设置为同步提交的事务在事务提交时,必须等待事务产生的REDO已持久化才返回(指已经落到非易失存储介质)。
参考
《PostgreSQL 9.6 同步多副本 与 remote_apply事务同步级别》
4. 当设置为异步提交的事务在事务提交时,不需要等待事务产生的REDO持久化。
由于有第一条的保护,所以即使使用异步事务,丢失REDO buffer中的数据后,也不会出现不一致(比如一半提交,一半未提交)的情况,仅仅丢失redo buffer中未提交的事务而已。
一致性由PostgreSQL MVCC的机制来保证,不会读到脏数据。
1. 在使用COW的文件系统(如btrfs, zfs)时,可以关闭full page write,因为这种文件系统可以保证不会出现partial write。
2. 对齐,可以避免写放大的问题。
3. 不要使用易失缓存,但是可以使用有掉电保护的易失缓存。
PostgreSQL认为系统提供的fsync调用是可靠的,即写到了持久化的存储。
如果连fsync都不可靠了,管它是不是原子写,都是不可靠的。
包括DirectIO在内(PostgreSQL支持REDO使用DirectIO),也无法感知disk cache,所以请慎重。
首先,前面已经分析了,原子写并不能抵御易失存储导致的丢数据。
1. PostgreSQL redo block是有checksum的,可以保证块的一致性,不会APPLY不一致的块。
2. 事务提交时,返回给用户前,一定会保证REDO已持久化。
所以用户收到反馈的事务,一定是持久化的,不可能存在partial write。
而没有收到反馈或未结束的事务,才有可能包含partial write,那么问题就简化了:
这些没有收到反馈或未结束的事务产生的REDO 出现partial write会不会导致数据不一致?
回答是不会,参考前面 “PostgreSQL如何保证数据库可靠性”,MVCC机制可以保证这些 。
数据库参数
wal_level = logical
便于观察,验证
产生测试数据
pgbench -i -s 100
模拟压力测试
pgbench -M prepared -n -r -P 2 -c 32 -j 32 -T 1000
观测到产生了一些XLOG,约200秒后,测试过程中强制停库,下次启动会进入恢复状态
pg_ctl stop -m immediate
记录接下来要纂改的REDO文件以及之前的文件最后的内容
纂改的前一个文件的末尾的一些内容,用于判断已持久化的记录
能看到几笔commit rec就行了
pg_xlogdump -b 0000000100000116000000F7 0000000100000116000000F7 | tail -n 20
rmgr: Heap2 len (rec/tot): 8/ 58, tx: 0, lsn: 116/F7FFFA80, prev 116/F7FFFA50, desc: CLEAN remxid 772066680
blkref #0: rel 1663/13241/38254 fork main blk 90969
772208346已持久化
rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208346, lsn: 116/F7FFFAC0, prev 116/F7FFFA80, desc: COMMIT 2016-10-11 15:04:16.395000 CST
rmgr: Heap len (rec/tot): 3/ 79, tx: 772208353, lsn: 116/F7FFFAF0, prev 116/F7FFFAC0, desc: INSERT off 130
blkref #0: rel 1663/13241/38242 fork main blk 17723
rmgr: Heap len (rec/tot): 14/ 163, tx: 772208368, lsn: 116/F7FFFB40, prev 116/F7FFFAF0, desc: HOT_UPDATE off 71 xmax 772208368 ; new off 76 xmax 0
blkref #0: rel 1663/13241/38254 fork main blk 90969
rmgr: Heap2 len (rec/tot): 8/ 58, tx: 0, lsn: 116/F7FFFBE8, prev 116/F7FFFB40, desc: CLEAN remxid 772061924
blkref #0: rel 1663/13241/38254 fork main blk 37123
rmgr: Heap len (rec/tot): 14/ 74, tx: 772208358, lsn: 116/F7FFFC28, prev 116/F7FFFBE8, desc: HOT_UPDATE off 22 xmax 772208358 ; new off 25 xmax 0
blkref #0: rel 1663/13241/38251 fork main blk 34
rmgr: Heap len (rec/tot): 14/ 78, tx: 772208360, lsn: 116/F7FFFC78, prev 116/F7FFFC28, desc: HOT_UPDATE off 121 xmax 772208360 ; new off 123 xmax 0
blkref #0: rel 1663/13241/38245 fork main blk 124
772208344已持久化
rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208344, lsn: 116/F7FFFCC8, prev 116/F7FFFC78, desc: COMMIT 2016-10-11 15:04:16.395018 CST
rmgr: Heap len (rec/tot): 14/ 163, tx: 772208369, lsn: 116/F7FFFCF8, prev 116/F7FFFCC8, desc: HOT_UPDATE off 67 xmax 772208369 ; new off 73 xmax 0
blkref #0: rel 1663/13241/38254 fork main blk 37123
rmgr: Heap len (rec/tot): 14/ 78, tx: 772208355, lsn: 116/F7FFFDA0, prev 116/F7FFFCF8, desc: HOT_UPDATE off 97 xmax 772208355 ; new off 110 xmax 0
blkref #0: rel 1663/13241/38245 fork main blk 988
772208351,772208352已持久化
rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208351, lsn: 116/F7FFFDF0, prev 116/F7FFFDA0, desc: COMMIT 2016-10-11 15:04:16.395031 CST
rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208352, lsn: 116/F7FFFE20, prev 116/F7FFFDF0, desc: COMMIT 2016-10-11 15:04:16.395031 CST
rmgr: Heap len (rec/tot): 3/ 79, tx: 772208354, lsn: 116/F7FFFE50, prev 116/F7FFFE20, desc: INSERT off 117
blkref #0: rel 1663/13241/38242 fork main blk 17727
rmgr: Heap len (rec/tot): 7/ 53, tx: 772208357, lsn: 116/F7FFFEA0, prev 116/F7FFFE50, desc: LOCK off 133: xid 772208357 LOCK_ONLY EXCL_LOCK
blkref #0: rel 1663/13241/38251 fork main blk 42
已持久化
rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208353, lsn: 116/F7FFFED8, prev 116/F7FFFEA0, desc: COMMIT 2016-10-11 15:04:16.395037 CST
rmgr: Heap len (rec/tot): 14/ 78, tx: 772208363, lsn: 116/F7FFFF08, prev 116/F7FFFED8, desc: HOT_UPDATE off 127 xmax 772208363 ; new off 186 xmax 0
blkref #0: rel 1663/13241/38245 fork main blk 79
已持久化
rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208345, lsn: 116/F7FFFF58, prev 116/F7FFFF08, desc: COMMIT 2016-10-11 15:04:16.395040 CST
rmgr: Heap len (rec/tot): 7/ 53, tx: 772208349, lsn: 116/F7FFFF88, prev 116/F7FFFF58, desc: LOCK off 154: xid 772208349 LOCK_ONLY EXCL_LOCK
blkref #0: rel 1663/13241/38251 fork main blk 38
查看某事务的REDO
pg_xlogdump -x 772208351 0000000100000116000000F7 0000000100000116000000F7
rmgr: Heap len (rec/tot): 14/ 163, tx: 772208351, lsn: 116/F7FFD3B8, prev 116/F7FFD378, desc: HOT_UPDATE off 8 xmax 772208351 ; new off 73 xmax 0, blkref #0: rel 1663/13241/38254 blk 69436
rmgr: Heap len (rec/tot): 14/ 78, tx: 772208351, lsn: 116/F7FFE6A0, prev 116/F7FFE660, desc: HOT_UPDATE off 17 xmax 772208351 ; new off 40 xmax 0, blkref #0: rel 1663/13241/38245 blk 117
rmgr: Heap len (rec/tot): 14/ 74, tx: 772208351, lsn: 116/F7FFF048, prev 116/F7FFEFA0, desc: HOT_UPDATE off 165 xmax 772208351 ; new off 166 xmax 0, blkref #0: rel 1663/13241/38251 blk 35
rmgr: Heap len (rec/tot): 3/ 79, tx: 772208351, lsn: 116/F7FFF7D8, prev 116/F7FFF788, desc: INSERT off 66, blkref #0: rel 1663/13241/38242 blk 17736
rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208351, lsn: 116/F7FFFDF0, prev 116/F7FFFDA0, desc: COMMIT 2016-10-11 15:04:16.395031 CST
被纂改的文件的头部的内容,用于判断未持久化的记录
这里显示的都是将要纂改掉,对PG来说就是未持久化的事务,数据库恢复后是不会显示的.
pg_xlogdump -b -n 20 0000000100000116000000F8 0000000100000116000000F9
纂改后,772208342这个事务将不可见
rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208342, lsn: 116/F8000038, prev 116/F7FFFFC0, desc: COMMIT 2016-10-11 15:04:16.395055 CST
rmgr: Heap len (rec/tot): 14/ 78, tx: 772208362, lsn: 116/F8000068, prev 116/F8000038, desc: HOT_UPDATE off 148 xmax 772208362 ; new off 154 xmax 0
blkref #0: rel 1663/13241/38245 fork main blk 90
rmgr: Heap len (rec/tot): 14/ 78, tx: 772208365, lsn: 116/F80000B8, prev 116/F8000068, desc: HOT_UPDATE off 85 xmax 772208365 ; new off 89 xmax 0
blkref #0: rel 1663/13241/38245 fork main blk 68
rmgr: Heap2 len (rec/tot): 8/ 58, tx: 0, lsn: 116/F8000108, prev 116/F80000B8, desc: CLEAN remxid 772208308
blkref #0: rel 1663/13241/38254 fork main blk 146480
rmgr: Heap len (rec/tot): 14/ 74, tx: 772208349, lsn: 116/F8000148, prev 116/F8000108, desc: HOT_UPDATE off 154 xmax 772208349 ; new off 155 xmax 772208349
blkref #0: rel 1663/13241/38251 fork main blk 38
rmgr: Heap len (rec/tot): 3/ 79, tx: 772208358, lsn: 116/F8000198, prev 116/F8000148, desc: INSERT off 101
blkref #0: rel 1663/13241/38242 fork main blk 17730
rmgr: Heap len (rec/tot): 14/ 74, tx: 772208359, lsn: 116/F80001E8, prev 116/F8000198, desc: HOT_UPDATE off 78 xmax 772208359 ; new off 85 xmax 0
blkref #0: rel 1663/13241/38251 fork main blk 31
rmgr: Heap len (rec/tot): 14/ 163, tx: 772208370, lsn: 116/F8000238, prev 116/F80001E8, desc: HOT_UPDATE off 25 xmax 772208370 ; new off 71 xmax 0
blkref #0: rel 1663/13241/38254 fork main blk 146480
纂改后,772208354这个事务将不可见
rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208354, lsn: 116/F80002E0, prev 116/F8000238, desc: COMMIT 2016-10-11 15:04:16.395071 CST
rmgr: Heap2 len (rec/tot): 8/ 58, tx: 0, lsn: 116/F8000310, prev 116/F80002E0, desc: CLEAN remxid 772112027
blkref #0: rel 1663/13241/38254 fork main blk 121847
rmgr: Heap len (rec/tot): 14/ 74, tx: 772208355, lsn: 116/F8000350, prev 116/F8000310, desc: HOT_UPDATE off 82 xmax 772208355 ; new off 86 xmax 0
blkref #0: rel 1663/13241/38251 fork main blk 31
rmgr: Heap len (rec/tot): 14/ 78, tx: 772208366, lsn: 116/F80003A0, prev 116/F8000350, desc: HOT_UPDATE off 73 xmax 772208366 ; new off 104 xmax 0
blkref #0: rel 1663/13241/38245 fork main blk 86
rmgr: Heap2 len (rec/tot): 8/ 58, tx: 0, lsn: 116/F80003F0, prev 116/F80003A0, desc: CLEAN remxid 772176420
blkref #0: rel 1663/13241/38254 fork main blk 162972
rmgr: Heap len (rec/tot): 14/ 74, tx: 772208363, lsn: 116/F8000430, prev 116/F80003F0, desc: HOT_UPDATE off 23 xmax 772208363 ; new off 26 xmax 0
blkref #0: rel 1663/13241/38251 fork main blk 30
rmgr: Heap len (rec/tot): 14/ 74, tx: 772208360, lsn: 116/F8000480, prev 116/F8000430, desc: HOT_UPDATE off 164 xmax 772208360 ; new off 167 xmax 0
blkref #0: rel 1663/13241/38251 fork main blk 35
rmgr: Heap len (rec/tot): 14/ 163, tx: 772208371, lsn: 116/F80004D0, prev 116/F8000480, desc: HOT_UPDATE off 2 xmax 772208371 ; new off 72 xmax 0
blkref #0: rel 1663/13241/38254 fork main blk 121847
纂改后,772208358这个事务将不可见
rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208358, lsn: 116/F8000578, prev 116/F80004D0, desc: COMMIT 2016-10-11 15:04:16.395090 CST
rmgr: Heap2 len (rec/tot): 8/ 58, tx: 0, lsn: 116/F80005A8, prev 116/F8000578, desc: CLEAN remxid 772172802
blkref #0: rel 1663/13241/38254 fork main blk 120028
rmgr: Heap len (rec/tot): 14/ 163, tx: 772208372, lsn: 116/F80005E8, prev 116/F80005A8, desc: HOT_UPDATE off 57 xmax 772208372 ; new off 71 xmax 0
blkref #0: rel 1663/13241/38254 fork main blk 162972
纂改后,772208350这个事务将不可见
rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208350, lsn: 116/F8000690, prev 116/F80005E8, desc: COMMIT 2016-10-11 15:04:16.395095 CST
...
纂改redo
cd $PGDATA/pg_xlog
找到最后两个xlog文件,纂改一下offset 100字节后的中间位置,模拟partial write。
dd if=/dev/zero of=./0000000100000116000000F8 bs=1 count=10000 skip=100
dd if=/dev/zero of=./0000000100000116000000F9 bs=1 count=10000 skip=100
启动数据库,进入恢复状态,当读到checksum不一致的block,停止继续往前,也就是说数据库恢复到这里截至。
未恢复的事务造成的变更,对用户不可见。
2016-10-11 15:10:49.909 CST,,,15039,,57fc9076.3abf,1,,2016-10-11 15:10:46 CST,,0,LOG,00000,"ending log output to stderr",,"Future log output will go to log destination ""csvlog"".",,,,,,"PostmasterMain, postmaster.c:1223",""
由于强制停库,数据库进入恢复状态
2016-10-11 15:10:49.910 CST,,,15042,,57fc9079.3ac2,1,,2016-10-11 15:10:49 CST,,0,LOG,00000,"database system was interrupted; last known up at 2016-10-11 15:03:14 CST",,,,,,,,"StartupXLOG, xlog.c:5934",""
2016-10-11 15:10:49.991 CST,,,15042,,57fc9079.3ac2,2,,2016-10-11 15:10:49 CST,,0,LOG,00000,"database system was not properly shut down; automatic recovery in progress",,,,,,,,"StartupXLOG, xlog.c:6414",""
2016-10-11 15:10:49.992 CST,,,15042,,57fc9079.3ac2,3,,2016-10-11 15:10:49 CST,,0,LOG,00000,"redo starts at 116/9D8E4600",,,,,,,,"StartupXLOG, xlog.c:6669",""
读到被纂改的REDO时,停止恢复
2016-10-11 15:11:21.215 CST,,,15042,,57fc9079.3ac2,4,,2016-10-11 15:10:49 CST,,0,LOG,00000,"invalid magic number 0000 in log segment 0000000100000116000000F8, offset 0",,,,,,,,"ReadRecord, xlog.c:3942",""
2016-10-11 15:11:21.215 CST,,,15042,,57fc9079.3ac2,5,,2016-10-11 15:10:49 CST,,0,LOG,00000,"redo done at 116/F7FFFF88",,,,,,,,"StartupXLOG, xlog.c:6921",""
2016-10-11 15:11:21.215 CST,,,15042,,57fc9079.3ac2,6,,2016-10-11 15:10:49 CST,,0,LOG,00000,"last completed transaction was at log time 2016-10-11 15:04:16.39504+08",,,,,,,,"StartupXLOG, xlog.c:6926",""
2016-10-11 15:11:21.216 CST,,,15042,,57fc9079.3ac2,7,,2016-10-11 15:10:49 CST,,0,LOG,00000,"checkpoint starting: end-of-recovery immediate",,,,,,,,"LogCheckpointStart, xlog.c:7949",""
2016-10-11 15:11:23.223 CST,,,15042,,57fc9079.3ac2,8,,2016-10-11 15:10:49 CST,,0,LOG,00000,"checkpoint complete: wrote 215999 buffers (1.3%); 0 transaction log file(s) added, 1 removed, 0 recycled; write=1.598 s, sync=0.405 s, total=2.006 s; sync files=20, longest=0.207 s, average=0.020 s; distance=1481838 kB, estimate=1481838 kB",,,,,,,,"LogCheckpointEnd, xlog.c:8031",""
2016-10-11 15:11:23.223 CST,,,15042,,57fc9079.3ac2,9,,2016-10-11 15:10:49 CST,,0,LOG,00000,"MultiXact member wraparound protections are now enabled",,,,,,,,"SetOffsetVacuumLimit, multixact.c:2628",""
2016-10-11 15:11:23.405 CST,,,15039,,57fc9076.3abf,2,,2016-10-11 15:10:46 CST,,0,LOG,00000,"database system is ready to accept connections",,,,,,,,"reaper, postmaster.c:2792",""
2016-10-11 15:11:23.405 CST,,,15083,,57fc909b.3aeb,1,,2016-10-11 15:11:23 CST,,0,LOG,00000,"autovacuum launcher started",,,,,,,,"AutoVacLauncherMain, autovacuum.c:416",""
验证
纂改前在REDO中显示提交的事务,验证确实已提交。
postgres=# select xmin,* from pgbench_history where xmin in (772208346,772208344,772208351,772208352,772208353,772208345);
xmin | tid | bid | aid | delta | mtime | filler
-----------+-----+-----+---------+-------+----------------------------+--------
772208345 | 109 | 76 | 96685 | 4792 | 2016-10-11 15:04:16.394519 |
772208353 | 657 | 1 | 7473886 | 1540 | 2016-10-11 15:04:16.394708 |
772208344 | 146 | 58 | 2671263 | -2297 | 2016-10-11 15:04:16.394504 |
772208352 | 55 | 57 | 9608997 | 2862 | 2016-10-11 15:04:16.39463 |
772208351 | 531 | 8 | 4235604 | 1582 | 2016-10-11 15:04:16.394601 |
772208346 | 105 | 83 | 5770382 | 590 | 2016-10-11 15:04:16.394542 |
(6 rows)
纂改后在REDO中显示已提交的事务,显示未提交,所以partial write没有影响数据库的一致性。
postgres=# select * from pgbench_history where xmin in (772208342,772208354,772208358,772208350);
tid | bid | aid | delta | mtime | filler
-----+-----+-----+-------+-------+--------
(0 rows)
通过检验。
./configure --with-wal-blocksize=?
Allowed values are 1,2,4,8,16,32,64.
如果每产生一笔redo都要fsync,性能就差了,所以FSYNC实际上是有调度的。
redo buffer的作用就是减少FSYNC的次数。
1. 当wal writer sleep超过设置的sleep时间(通常设置为10毫秒)时,触发fsync,将redo buffer中已写完整的BLOCK持久化到REDO FILE。
2. 当wal writer write(异步写)的字节数超过配置的阈值(wal_writer_flush_after)时,触发fsync,将redo buffer中已写完整的BLOCK持久化到REDO FILE。
3. 当事务结束时,检查wal write全局变量,LSN是否已FLUSH,如果没有落盘,则触发fsync。
4. 第三种情况,如果开启了分组提交,则多个正在提交的事务只会请求一次fsync。
5. 当redo 日志文件发生切换时,会触发fsync,确保文件持久化。
src/backend/postmaster/walwriter.c
* The WAL writer background process is new as of Postgres 8.3. It attempts
* to keep regular backends from having to write out (and fsync) WAL pages.
* Also, it guarantees that transaction commit records that weren't synced
* to disk immediately upon commit (ie, were "asynchronously committed")
* will reach disk within a knowable time --- which, as it happens, is at
* most three times the wal_writer_delay cycle time.
*
* Note that as with the bgwriter for shared buffers, regular backends are
* still empowered to issue WAL writes and fsyncs when the walwriter doesn't
* keep up. This means that the WALWriter is not an essential process and
* can shutdown quickly when requested.
*
* Because the walwriter's cycle is directly linked to the maximum delay
* before async-commit transactions are guaranteed committed, it's probably
* unwise to load additional functionality onto it. For instance, if you've
* got a yen to create xlog segments further in advance, that'd be better done
* in bgwriter than in walwriter.
*
* The walwriter is started by the postmaster as soon as the startup subprocess
* finishes. It remains alive until the postmaster commands it to terminate.
* Normal termination is by SIGTERM, which instructs the walwriter to exit(0).
* Emergency termination is by SIGQUIT; like any backend, the walwriter will
* simply abort and exit on SIGQUIT.
*
* If the walwriter exits unexpectedly, the postmaster treats that the same
* as a backend crash: shared memory may be corrupted, so remaining backends
* should be killed by SIGQUIT and then a recovery cycle started.
......
/*
* Loop forever
*/
for (;;)
{
......
/*
* Do what we're here for; then, if XLogBackgroundFlush() found useful
* work to do, reset hibernation counter.
*/
if (XLogBackgroundFlush())
left_till_hibernate = LOOPS_UNTIL_HIBERNATE;
else if (left_till_hibernate > 0)
left_till_hibernate--;
......
src/backend/access/transam/xlog.c
/*
* Write & flush xlog, but without specifying exactly where to.
*
* We normally write only completed blocks; but if there is nothing to do on
* that basis, we check for unwritten async commits in the current incomplete
* block, and write through the latest one of those. Thus, if async commits
* are not being used, we will write complete blocks only.
*
* If, based on the above, there's anything to write we do so immediately. But
* to avoid calling fsync, fdatasync et. al. at a rate that'd impact
* concurrent IO, we only flush WAL every wal_writer_delay ms, or if there's
* more than wal_writer_flush_after unflushed blocks.
*
* We can guarantee that async commits reach disk after at most three
* wal_writer_delay cycles. (When flushing complete blocks, we allow XLogWrite
* to write "flexibly", meaning it can stop at the end of the buffer ring;
* this makes a difference only with very high load or long wal_writer_delay,
* but imposes one extra cycle for the worst case for async commits.)
*
* This routine is invoked periodically by the background walwriter process.
*
* Returns TRUE if there was any work to do, even if we skipped flushing due
* to wal_writer_delay/wal_flush_after.
*/
bool
XLogBackgroundFlush(void)
{
XLogwrtRqst WriteRqst;
bool flexible = true;
static TimestampTz lastflush;
TimestampTz now;
int flushbytes;
/* XLOG doesn't need flushing during recovery */
if (RecoveryInProgress())
return false;
/* read LogwrtResult and update local state */
SpinLockAcquire(&XLogCtl->info_lck);
LogwrtResult = XLogCtl->LogwrtResult;
WriteRqst = XLogCtl->LogwrtRqst;
SpinLockRelease(&XLogCtl->info_lck);
/* back off to last completed page boundary */
WriteRqst.Write -= WriteRqst.Write % XLOG_BLCKSZ;
/* if we have already flushed that far, consider async commit records */
if (WriteRqst.Write <= LogwrtResult.Flush)
{
SpinLockAcquire(&XLogCtl->info_lck);
WriteRqst.Write = XLogCtl->asyncXactLSN;
SpinLockRelease(&XLogCtl->info_lck);
flexible = false; /* ensure it all gets written */
}
/*
* If already known flushed, we're done. Just need to check if we are
* holding an open file handle to a logfile that's no longer in use,
* preventing the file from being deleted.
*/
if (WriteRqst.Write <= LogwrtResult.Flush)
{
if (openLogFile >= 0)
{
if (!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo))
{
XLogFileClose();
}
}
return false;
}
/*
* Determine how far to flush WAL, based on the wal_writer_delay and
* wal_writer_flush_after GUCs.
*/
now = GetCurrentTimestamp();
flushbytes =
WriteRqst.Write / XLOG_BLCKSZ - LogwrtResult.Flush / XLOG_BLCKSZ;
if (WalWriterFlushAfter == 0 || lastflush == 0)
{
/* first call, or block based limits disabled */
WriteRqst.Flush = WriteRqst.Write;
lastflush = now;
}
// sleep时间调度,更新Flush位点
else if (TimestampDifferenceExceeds(lastflush, now, WalWriterDelay))
{
/*
* Flush the writes at least every WalWriteDelay ms. This is important
* to bound the amount of time it takes for an asynchronous commit to
* hit disk.
*/
WriteRqst.Flush = WriteRqst.Write;
lastflush = now;
}
// wal writer write(异步写)累计调度,更新Flush位点
else if (flushbytes >= WalWriterFlushAfter)
{
/* exceeded wal_writer_flush_after blocks, flush */
WriteRqst.Flush = WriteRqst.Write;
lastflush = now;
}
// 否则不执行fsync
else
{
/* no flushing, this time round */
WriteRqst.Flush = 0;
}
#ifdef WAL_DEBUG
if (XLOG_DEBUG)
elog(LOG, "xlog bg flush request write %X/%X; flush: %X/%X, current is write %X/%X; flush %X/%X",
(uint32) (WriteRqst.Write >> 32), (uint32) WriteRqst.Write,
(uint32) (WriteRqst.Flush >> 32), (uint32) WriteRqst.Flush,
(uint32) (LogwrtResult.Write >> 32), (uint32) LogwrtResult.Write,
(uint32) (LogwrtResult.Flush >> 32), (uint32) LogwrtResult.Flush);
#endif
START_CRIT_SECTION();
/* now wait for any in-progress insertions to finish and get write lock */
WaitXLogInsertionsToFinish(WriteRqst.Write);
LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
LogwrtResult = XLogCtl->LogwrtResult;
if (WriteRqst.Write > LogwrtResult.Write ||
WriteRqst.Flush > LogwrtResult.Flush)
{
XLogWrite(WriteRqst, flexible);
}
LWLockRelease(WALWriteLock);
END_CRIT_SECTION();
/* wake up walsenders now that we've released heavily contended locks */
WalSndWakeupProcessRequests();
/*
* Great, done. To take some work off the critical path, try to initialize
* as many of the no-longer-needed WAL buffers for future use as we can.
*/
AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
/*
* If we determined that we need to write data, but somebody else
* wrote/flushed already, it should be considered as being active, to
* avoid hibernating too early.
*/
return true;
}
/*
* Write and/or fsync the log at least as far as WriteRqst indicates.
*
* If flexible == TRUE, we don't have to write as far as WriteRqst, but
* may stop at any convenient boundary (such as a cache or logfile boundary).
* This option allows us to avoid uselessly issuing multiple writes when a
* single one would do.
*
* Must be called with WALWriteLock held. WaitXLogInsertionsToFinish(WriteRqst)
* must be called before grabbing the lock, to make sure the data is ready to
* write.
*/
static void
XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
{
......
日志切换时,触发fsync
if (finishing_seg)
{
issue_xlog_fsync(openLogFile, openLogSegNo);
......
根据LogwrtResult.Flush位点与请求Flush位点的对比,判断是否需要调用fsync
即前面的调度
/*
* If asked to flush, do so
*/
if (LogwrtResult.Flush < WriteRqst.Flush &&
LogwrtResult.Flush < LogwrtResult.Write)
{
/*
* Could get here without iterating above loop, in which case we might
* have no open file or the wrong one. However, we do not need to
* fsync more than one file.
*/
if (sync_method != SYNC_METHOD_OPEN &&
sync_method != SYNC_METHOD_OPEN_DSYNC)
{
if (openLogFile >= 0 &&
!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo))
XLogFileClose();
if (openLogFile < 0)
{
XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo);
openLogFile = XLogFileOpen(openLogSegNo);
openLogOff = 0;
}
issue_xlog_fsync(openLogFile, openLogSegNo);
}
/* signal that we need to wakeup walsenders later */
WalSndWakeupRequest();
LogwrtResult.Flush = LogwrtResult.Write;
}
......
1. https://www.pgcon.org/2012/schedule/attachments/258_212_Internals%20Of%20PostgreSQL%20Wal.pdf
如果要深入了解PostgreSQL redo的内部机制,可以参考以上文档以及源码。