数据库内核月报

数据库内核月报 - 2018 / 01

MySQL · MyRocks · MyRocks参数介绍

Author: 张远

以下参数是db级别的,全局有效

参数
说明
备注
rocksdb_block_cache
缓存uncompressed blocks,此cache有分区优化,分区数由table_cache_numshardbits控制,默认为6即64个分区。 每个分区至少大于512k(rocksdb::LRUCache::LRUCache)
默认为512M
rocksdb_max_total_wal_size
如果WAL超过rocksdb_max_total_wal_size,会swich memtable并flush memtable
默认为0, 表示大小不能超过所有 columnfamily write_buffer的4倍
rocksdb_wal_size_limit_mb
purge wal时最多可以保留wal的最大大小 (对应DBOptions::WAL_size_limit_MB)
默认为0,表示不控制保留wal数量, 只要memtable flush了wal都可以 purge
rocksdb_wal_ttl_seconds
控制purge wal的频率,每隔rocksdb_wal_ttl_seconds/2 purge一次
。如果rocksdb_wal_size_limit_mb > 0, 那么每600s purge一次(kDefaultIntervalToDeleteObsoleteWAL)
默认为0
rocksdb_manual_wal_flush
If true WAL is not flushed automatically after each write. Instead it relies on manual invocation of FlushWAL to write the WAL buffer to its file.
默认为true
rocksdb_deadlock_detect
是否开启死锁检测
默认是关闭的
rocksdb_wal_bytes_per_sync
每rocksdb_wal_bytes_per_sync字节sync一次WAL(WritableFileWriter::Flush)
默认为0, 每次都刷
rocksdb_wal_recovery_mode
重启时recovery模式
1: Fail to start, do not recover
0: If corrupted last entry: truncate and start
2: Truncate everything after corrupted entry
• Even not corrupted entries
• Acceptable on slaves
3: Truncate only corrupted entry • Most dangerous option
rocksdb_strict_collation_exceptions
可以取非memcompare类型collation的表
取值为正则表达式,如"t1,t2*"
rpl_skip_tx_api
Use write batches for replication thread instead of tx api
作用于备库
rocksdb_master_skip_tx_api
Disables Transaction API Enables WriteBatch API, There is no row lock,UPDATE and DELETEs are faster
You must ensure no concurrent operation running
rocksdb_read_free_rpl_tables
用正则表达式指定使用read free replication的库表,如.*或t.*
默认为空
rocksdb_info_log_level
日志级别,数值越小越详细
0:debug_level 1:info_level 2:warn_level 3:error_level 4:fatal_level 5:header_level
rocksdb_perf_context_level
指定 perf context的级别
0,1: disable
2: enable only count stats
3: Other than count stats, also enable time stats except for mutexes
4: enable count and time stats
默认0
rocksdb_max_background_jobs
后台工作线程数
老版本还分为rocksdb_max_background_jobs和max_background_compactions,新版合为一个,会自动分配两者数量。 https://github.com/facebook/rocksdb/wiki/Thread-Pool
rocksdb_commit_in_the_middle
Commit rows implicitly every rocksdb_bulk_load_size, 设置rocksdb_bulk_load为on时自动commit in middle
默认OFF,
不建议全局设置,应回话级别设置
rocksdb_blind_delete_primary_key
通过主键delete 有且仅有主键索引的表时,不需要读取数据,直接通过指定的主键来删除
默认OFF,
DELETES by Primary Key Works:
DELETE FROM t WHERE id IN (1, 2, 3, 4, 5, 6, ...., 10000)
Does not work:
DELETE .. WHERE id < 10
rocksdb_use_direct_reads
use O_DIRECT for reading data
默认OFF
rocksdb_use_direct_io_for_flush_and_compaction
use O_DIRECT for flush and compact
默认OFF
rocksdb_skip_fill_cache
Skip filling block cache on read requests
默认OFF,
DDL load 时使用
gap_lock_raise_error
Using Gap Lock without full unique key in multi-table or multi-statement transactions is not allowed.
违法以上情况使用gap lock会记入错误日志
默认false
gap_lock_write_log
Using Gap Lock without full unique key in multi-table or multi-statement transactions is not allowed.
违法以上情况使用gap lock会记入gap_lock_log_file指定的文件中
默认false
gap_lock_log_file
指定记录gap lock的文件
rocksdb_stats_dump_period_sec

控制Statistic信息记录到LOG中的频率(DBImpl::PrintStatistics)
默认600,
Note that currently it is only dumped after a compaction. So if the database doesn't serve any write for a long time, statistics may not be dumped, despite of options.stats_dump_period_sec.
rocksdb_compaction_readahead_size
If non-zero, we perform bigger reads when doing compaction. If you're running RocksDB on spinning disks, you should set this to at least 2MB. That way RocksDB's compaction is doing sequential instead of random reads.
默认为0
rocksdb_advise_random_on_open
If set true, will hint the underlying file system that the file access pattern is random, when a sst file is opened.
默认ON
rocksdb_max_row_locks
事务最多可以持有锁的个数
默认1M
rocksdb_bytes_per_sync
每rocksdb_wal_bytes_per_sync字节sync一次sst文件(WritableFileWriter::Flush)
默认为0, 每次都刷
You may consider using rate_limiter to regulate write rate to device.
When rate limiter is enabled, it automatically enables bytes_per_sync
to 1MB.
rocksdb_enable_ttl
Enable expired TTL records to be dropped during compaction
默认ON
rocksdb_enable_ttl_read_filtering
For tables with TTL, expired records are skipped/filtered out during processing and in query results. Disabling this will allow these records to be seen, but as a result rows may disappear in the middle of transactions as they are dropped during compaction. Use with caution.
默认ON
rocksdb_bulk_load
bulk_load开关
默认OFF,
rocksdb_bulk_load_allow_unsorted
支持非主键排序数据的bulk_load
默认OFF
rocksdb_bulk_load_size
每rocksdb_bulk_load_size次write进行一次bulk_load
默认1000次
rocksdb_enable_bulk_load_api
Enables using SstFileWriter for bulk loading
默认ON
rocksdb_enable_2pc
是否开启2pc
默认ON
rocksdb_rate_limiter_bytes_per_sec
控制读写sst的速度
DBOptions::rate_limiter bytes_per_sec for RocksDB
默认0
rocksdb_sst_mgr_rate_bytes_per_sec
控制删除sst的速度
DBOptions::sst_file_manager rate_bytes_per_sec for RocksDB
默认0
rocksdb_delayed_write_rate
WriteStall时delay的时间,单位微秒(DBOptions::delayed_write_rate)
默认0
rocksdb_write_disable_wal
是否关闭WAL
默认为OFF
rocksdb_flush_log_at_trx_commit
Sync wal on transaction commit
Similar to innodb_flush_log_at_trx_commit.
1: sync on commit,
0,2: not sync on commit
默认1
rocksdb_cache_index_and_filter_blocks
index和filter blocks是否缓存到block cache
默认ON
rocksdb_pin_l0_filter_and_index_blocks_in_cache
if cache_index_and_filter_blocks is true and the below is true, then filter and index blocks are stored in the cache, but a reference is held in the "table reader" object so the blocks are pinned and only evicted from cache when the table reader is freed.
默认ON

以上参数可以通过show variables查看

更详细可以参考代码 db_options_type_info

include/rocksdb/options.h 

以下参数是column family级别的,可以分别对每个column family设置

参数
说明
备注
write_buffer_size
memtable内存大小
默认
max_write_buffer_number
memtable的最大个数
默认2
min_write_buffer_number_to_merge
it is the minimum number of memtables to be merged before flushing to storage. For example, if this option is set to 2, immutable memtables are only flushed when there are two of them
默认1
target_file_size_base
level1 sst大小
默认64M
target_file_size_multiplier
level L(L>1) sst大小
target_file_size_base * (target_file_size_multiplier ^ (L-1))
默认1,

For example, if target_file_size_base is 2MB and
target_file_size_multiplier is 10, then each file on level-1 will
be 2MB, and each file on level-2 will be 20MB,
and each file on level-3 will be 200MB
max_bytes_for_level_base
level1的sst总大小
默认256M
max_bytes_for_level_multiplier
level L的sst总大小为 max_bytes_for_level_base*(max_bytes_for_level_multiplier)^(L-1))*max_bytes_for_level_multiplier_additional(L-1)
(VersionStorageInfo::CalculateBaseBytes)
默认10
max_bytes_for_level_multiplier_additional
Different max-size multipliers for different levels.

(VersionStorageInfo::CalculateBaseBytes)
默认:1:1:1:1:1:1:1
num_levels
level数量
默认7
level0_file_num_compaction_trigger
当level0文件数量超过此值时触发level0 compact
默认4
level0_slowdown_writes_trigger
当level0文件数量超过此值时触发x写delay
默认20
level0_stop_writes_trigger
当level0文件数量超过此值时触发停写
默认36
pin_l0_filter_and_index_blocks_in_cache
if cache_index_and_filter_blocks is true and the below is true, then filter and index blocks are stored in the cache, but a reference is held in the "table reader" object so the blocks are pinned and only evicted from cache when the table reader is freed.
默认1,
column family单独设置会覆盖rocksdb_pin_l0_filter_and_index_blocks_in_cache
cache_index_and_filter_blocks
index和filter blocks是否缓存到block cache
默认1,
column family单独设置会覆盖rocksdb_cache_index_and_filter_blocks
optimize_filters_for_hits
设置为True, 最后一层不保存filter信息,最后一层bloomfilter实际没有用处
默认OFF
filter_policy
指定filter策略
filter_policy=bloomfilter:10:false
表示使用bloomfilter,
bits_per_key_=10, hash函数个数为10*ln2,
false:use_block_based_builder_=false,表示使用full filter
prefix_extractor
指定filter使用前缀
prefix_extractor=capped:24表示最多取前缀24个字节,另外还有fixed:n方式表示只取前缀n个字节,忽略小于n个字节的key. 具体可参考CappedPrefixTransform,FixedPrefixTransform
partition_filters
表示时否使用partitioned filter
默认false
filter 参数优先级如下 block base > partitioned > full. 比如说同时指定use_block_based_builder_=true和partition_filters=true实际使用的block based filter
whole_key_filtering
If true, place whole keys in the filter (not just prefixes)
默认1
level_compaction_dynamic_level_bytes
In this mode, size target of levels are changed dynamically based on size of the last level.
减少写放大
memtable
指定memtable类型(skiplist/vector/hash_linkedlist/prefix_hash/cuckoo)
默认skiplist
compaction_pri
compact选择文件策略
kByCompensatedSize
Slightly prioritize larger files by size compensated by #deletes
kOldestLargestSeqFirst
First compact files whose data's latest update time is oldest
kOldestSmallestSeqFirst
First compact files whose range hasn't been compacted to the next level for the longest
kMinOverlappingRatio
First compact files whose ratio between overlapping size in next level and its size is the smallest
默认kByCompensatedSize
compression_per_level
指定每个level的压缩策略
It usually makes sense to avoid compressing levels 0 and 1 and to compress data only in higher levels. You can even set slower compression in highest level and faster compression in lower levels (by highest we mean Lmax).
bottommost_compression
指定最底level的压缩策略
arena_block_size
rocksdb内存分配单位KBlockSize由参数arena_block_size指定
arena_block_size不指定时默认为write_buffer_size的1/8.
soft_pending_compaction_bytes_limit
All writes will be slowed down to at least delayed_write_rate if estimated
bytes needed to be compaction exceed this threshold
默认64G
hard_pending_compaction_bytes_limit
All writes are stopped if estimated bytes needed to be compaction exceed this threshold.
默认256G

以上参数可以通过select * from information_schema.rocksdb_cf_options查看

更详细可以参考代码ParseColumnFamilyOption, cf_options_type_info

include/rocksdb/table.h
rocksdb/util/options_helper.h
rocksdb/options/options_helper.cc
include/rocksdb/advanced_options.h

参数配置示例

rocksdb_default_cf_options=memtable=vector;
arena_block_size=10M;
disable_auto_compactions=1;
min_write_buffer_number_to_merge=1;
write_buffer_size=100000m;
target_file_size_base=32m;
max_bytes_for_level_base=512m;
level0_file_num_compaction_trigger=20;
level0_slowdown_writes_trigger=30;
level0_stop_writes_trigger=30;
max_write_buffer_number=5;
compression_per_level=kNoCompression:kNoCompression:kNoCompression:kNoCompression:kNoCompression:kNoCompression;
bottommost_compression=kNoCompression;
block_based_table_factory={cache_index_and_filter_blocks=1;filter_policy=bloomfilter:10:false;whole_key_filtering=1};
level_compaction_dynamic_level_bytes=false;
optimize_filters_for_hits=true

参数修改示例

SET @@global.rocksdb_update_cf_options='cf1={write_buffer_size=8m;target_file_size_base=2m};cf2={write_buffer_size =16m;max_bytes_for_level_multiplier=8};cf3={target_file_size_base=4m};';

注意:此方式可以动态修改,但没有持久化到OPTIONS文件中, 需手动修改OPTIONS文件