数据库内核月报

背景

当有写IO等待问题时，有一些调整参数的方法可以像打了鸡血一样提速，但是安全性如何？

1 危险系数最高

关闭fsync。本该持久化同步写的(例如checkpoint，wal write)变成了异步，可能丢数据，同时会导致数据库的数据不一致，极度危险。

任何时候都不建议关闭fsync。

2 危险系数高

关闭full_page_write。

full page write用来防止data block的partial write。原理：每次checkpoint后第一次被修改（或新增）的数据块，需要COPY完整的数据块内容到wal，当数据库CRASH后，从最后一次成功的checkpoint开始采用WAL进行恢复，因此可以修复partial write的data block。

例如pg_basebackup， start_backup等物理备份操作，会强制开启full page write，因为我们COPY datafile时，数据库可以处于读写状态，可能会拷贝到PARTIAL BLOCK，有full page 的话，在恢复时，可以将这些partial block恢复到一致的状态。

什么时候可以没有风险的关闭full page write？

1、如果块设备对齐，并支持原子写（原子写大于或等于一个DATA FILE数据页的大小），可以关闭full_page_write。

postgres=# show block_size;      
 block_size       
------------      
 8192      
(1 row)      

2、如果文件系统是COW模式，可以关闭。例如zfs文件系统可以关闭full_page_write。

3 危险系数次高

打开data_sync_retry。

data_sync_retry (boolean)

When set to off, which is the default, PostgreSQL will raise a PANIC-level error on failure to flush modified data files to the filesystem. This causes the database server to crash. This parameter can only be set at server start.

On some operating systems, the status of data in the kernel’s page cache is unknown after a write-back failure. In some cases it might have been entirely forgotten, making it unsafe to retry; the second attempt may be reported as successful, when in fact the data has been lost. In these circumstances, the only way to avoid data loss is to recover from the WAL after any failure is reported, preferably after investigating the root cause of the failure and replacing any faulty hardware.

If set to on, PostgreSQL will instead report an error but continue to run so that the data flushing operation can be retried in a later checkpoint. Only set it to on after investigating the operating system’s treatment of buffered data in case of write-back failure.

默认值是安全的。

如果你要设置为ON，务必确保OS层的fsync是可以retry并且可靠的。

《PostgreSQL 12 preview - 可靠性提升 - data_sync_retry 消除os层write back failed status不可靠的问题》

4 危险系数次高

基于异步流复制standby 的 HA。在发生failover时，可能有丢失WAL日志的风险。

PostgreSQL可以通过流复制建立STANDBY，流复制基于WAL的变化，在STANDBY回放，生成与主节点一模一样（BLOCK级别一致）的数据库。

流复制支持异步模式、同步模式。

异步模式下，事务结束时不需要等待STANDBY已接收事务对应的事务结束时写的commit

rollback wal record。

同步模式下，事务结束时需要等待STANDBY已接收事务对应的事务结束时写的commit

rollback wal record。

因此使用异步STANDBY来实施HA，可能有丢失事务的风险。

1、使用异步STANDBY HA架构，在failover后可能丢多少事务？通常取决于网络。一般在上游写入量特别大时，WAL的产生可以达到100MB/s以上。如果网络带宽不是问题的话，基本上延迟在10毫秒内。

2、使用异步STANDBY HA架构，在failover后会不会导致数据库数据不一致？不会，只有可能丢事务，但是不会出现数据库数据不一致的情况（事务原子性没有问题，事务提交先后没有问题。）

使用心跳，可以将RPO控制在业务可接受范围内。

《PostgreSQL 三种心跳(keepalive)指标的应用 - 时间戳、redo(wal)位点、事务号》

如果需要RPO=0，请使用

1、同步复制或多副本（quorum based sync replication）的standby 。

2、存储层面多副本。

5 危险系数低

关闭synchronous_commit。

关闭synchronous_commit，与使用异步流复制standby 的 HA类似。但是要低于异步STANDBY发生FAILOVER带来的风险。

关闭synchronous_commit。数据库崩溃时不会导致数据丢失。仅当操作系统崩溃时才会有数据丢失的风险，风险多大取决于以下参数。

建议设置：

synchronous_commit=off

wal_writer_delay=10ms # 每隔 10 ms ，主动持久化wal buffer到DISK（持久化存储）

wal_writer_flush_after=1MB # 每隔1MB WAL，主动持久化wal buffer到DISK（持久化存储）

以上设置，操作系统崩溃时数据库最多丢30ms (wal_writer_delay乘以3)日志。但是不会导致数据库状态不一致（事务原子性、事务先后），可以确保没有提交的事务被完全回滚、已提交的事务保证完整。

synchronous_commit (enum)

Specifies whether transaction commit will wait for WAL records to be written to disk before the command returns a “success” indication to the client. Valid values are on, remote_apply, remote_write, local, and off. The default, and safe, setting is on.

When off, there can be a delay between when success is reported to the client and when the transaction is really guaranteed to be safe against a server crash.
(The maximum delay is three times wal_writer_delay.)

Unlike fsync, setting this parameter to off does not create any risk of database inconsistency:

an operating system or database crash might result in some recent allegedly-committed transactions being lost, but the database state will be just the same as if those transactions had been aborted cleanly.

So, turning synchronous_commit off can be a useful alternative when performance is more important than exact certainty about the durability of a transaction. For more discussion see Section 29.3.

standby相关：

If synchronous_standby_names is non-empty, this parameter also controls whether or not transaction commits will wait for their WAL records to be replicated to the standby server(s).

When set to on, commits will wait until replies from the current synchronous standby(s) indicate they have received the commit record of the transaction and flushed it to disk. This ensures the transaction will not be lost unless both the primary and all synchronous standbys suffer corruption of their database storage.

When set to remote_apply, commits will wait until replies from the current synchronous standby(s) indicate they have received the commit record of the transaction and applied it, so that it has become visible to queries on the standby(s).

When set to remote_write, commits will wait until replies from the current synchronous standby(s) indicate they have received the commit record of the transaction and written it out to their operating system. This setting is sufficient to ensure data preservation even if a standby instance of PostgreSQL were to crash, but not if the standby suffers an operating-system-level crash, since the data has not necessarily reached stable storage on the standby.

Finally, the setting local causes commits to wait for local flush to disk, but not for replication. This is not usually desirable when synchronous replication is in use, but is provided for completeness.

If synchronous_standby_names is empty, the settings on, remote_apply, remote_write and local all provide the same synchronization level: transaction commits only wait for local flush to disk.

任何时候都可以修改synchronous_commit，支持事务级、会话级设置，支持USER，DB级设置，支持集群级设置。

This parameter can be changed at any time; the behavior for any one transaction is determined by the setting in effect when it commits. It is therefore possible, and useful, to have some transactions commit synchronously and others asynchronously. For example, to make a single multistatement transaction commit asynchronously when the default is the opposite, issue SET LOCAL synchronous_commit TO OFF within the transaction.

wal_writer_delay (integer)

Specifies how often the WAL writer flushes WAL. After flushing WAL it sleeps for wal_writer_delay milliseconds, unless woken up by an asynchronously committing transaction.

If the last flush happened less than wal_writer_delay milliseconds ago and less than wal_writer_flush_after bytes of WAL have been produced since, then WAL is only written to the operating system, not flushed to disk.

The default value is 200 milliseconds (200ms). Note that on many systems, the effective resolution of sleep delays is 10 milliseconds; setting wal_writer_delay to a value that is not a multiple of 10 might have the same results as setting it to the next higher multiple of 10. This parameter can only be set in the postgresql.conf file or on the server command line.

wal_writer_flush_after (integer)

Specifies how often the WAL writer flushes WAL. If the last flush happened less than wal_writer_delay milliseconds ago and less than wal_writer_flush_after bytes of WAL have been produced since, then WAL is only written to the operating system, not flushed to disk.

If wal_writer_flush_after is set to 0 then WAL data is flushed immediately.

The default is 1MB. This parameter can only be set in the postgresql.conf file or on the server command line.

6 无危险

分组提交

commit_delay=10ms

commit_siblings=5

commit_delay (integer)

commit_delay adds a time delay, measured in microseconds, before a WAL flush is initiated. This can improve group commit throughput by allowing a larger number of transactions to commit via a single WAL flush, if system load is high enough that additional transactions become ready to commit within the given interval. However, it also increases latency by up to commit_delay microseconds for each WAL flush. Because the delay is just wasted if no other transactions become ready to commit, a delay is only performed if at least commit_siblings other transactions are active when a flush is about to be initiated. Also, no delays are performed if fsync is disabled. The default commit_delay is zero (no delay). Only superusers can change this setting.

In PostgreSQL releases prior to 9.3, commit_delay behaved differently and was much less effective: it affected only commits, rather than all WAL flushes, and waited for the entire configured delay even if the WAL flush was completed sooner. Beginning in PostgreSQL 9.3, the first process that becomes ready to flush waits for the configured interval, while subsequent processes wait only until the leader completes the flush operation.

commit_siblings (integer)

Minimum number of concurrent open transactions to require before performing the commit_delay delay. A larger value makes it more probable that at least one other transaction will become ready to commit during the delay interval. The default is five transactions.

如何选择，总结

取决于业务对写入性能、数据RPO的要求。

1、永远不要关闭fsync。

不安全的关闭fsync，会导致数据丢失，数据状态不一致等各种问题。

2、如果块设备不能保证大于或等于DATA BLOCK SIZE的原子性，或者文件系统不支持COW，请千万不要关闭full_page_write。

不安全的关闭full_page_write，会导致数据库状态不一致。

3、如果有RPO=0的要求，请不要使用异步STANDBY模式的HA。

建议可以使用同步流复制、quorum based sync replication、存储级同步多副本等。

4、如果有RPO=0的要求，建议不要关闭synchronous_commit。

如果对RPO的要求大于或等于30毫秒，可以关闭synchronous_commit来提高写性能（机械盘效果立竿见影，SSD亦有提高）。

又或者说如果你的环境已经使用了异步流复制的HA或者其他环节有丢失数据的风险时，关闭synchronous_commit引入的操作系统崩溃时30ms WAL事务日志丢失可能性已不足为虑。

PS: 关闭synchronous_commit 并不会导致数据库状态不一致（事务原子性、事务先后都没有问题），所以状态不一致的没有这类风险。

5、data_sync_retry=off # 默认值是安全的。