Author: zhuyan
主要的代码和流程在参考文档 3,4 这种已经讲解的比较清楚了,一个 Coordinator 线程负责处理刷脏请求,计算刷脏的量,然后分配给几个 Worker 线程去刷不同的 Buffer Pool Instance, 完成刷脏后,Coordinator 线程进入下一轮刷脏。
Coordinator 和 Worker 之间通过 page_cleaner->slots[i]->state 来协同,page_cleaner_state_t 有四种状态,代码注释说明了状态之间的迁移。
/** State for page cleaner array slot */
enum page_cleaner_state_t {
/** Not requested any yet.
Moved from FINISHED by the coordinator. */
PAGE_CLEANER_STATE_NONE = 0,
/** Requested but not started flushing.
Moved from NONE by the coordinator. */
PAGE_CLEANER_STATE_REQUESTED,
/** Flushing is on going.
Moved from REQUESTED by the worker. */
PAGE_CLEANER_STATE_FLUSHING,
/** Flushing was finished.
Moved from FLUSHING by the worker. */
PAGE_CLEANER_STATE_FINISHED
};
Coordinate 入口函数 buf_flush_page_coordinator_thread,主循环刷脏逻辑:
page cleaner 的循环刷脏周期是 1s,如果不足 1s 就需要 sleep,超过 1s 可能是刷脏太慢,不足 1s 可能是被其它线程唤醒的。
/* The page_cleaner skips sleep if the server is
idle and there are no pending IOs in the buffer pool
and there is work to do. */
if (srv_check_activity(last_activity) /*和上次循环对比,有没有新增的 activity */
|| buf_get_n_pending_read_ios() || n_flushed == 0) { /* 有没有 pending io */
ret_sleep = pc_sleep_if_needed(next_loop_time, sig_count);
if (srv_shutdown_state != SRV_SHUTDOWN_NONE) {
break;
}
} else if (ut_time_ms() > next_loop_time) {
ret_sleep = OS_SYNC_TIME_EXCEEDED;
} else {
ret_sleep = 0;
}
错误日志里有时候会看到这样的日志:
Page cleaner took xx ms to flush xx and evict xx pages
这个表示上一轮刷脏进行的比较缓慢,首先 ret_sleep == OS_SYNC_TIME_EXCEEDED, 并且本轮刷脏和上一轮刷脏超过 3s,warn_interval 控制输出日志的频率,如果持续打日志,就要看看 IO 延迟了。
Sync flush 不受 io_capacity/io_capacity_max 的限制,所以会对性能产生比较大的影响。
/* Note that the buf_flush_sync_lsn which is the maximum lsn that
* primary must flush to disk, so if it greater than the oldest_lsn,
* we still need to wake up page cleaner thread to flush. */
oldest_lsn = buf_pool_get_oldest_modification_lwm();
if (ret_sleep != OS_SYNC_TIME_EXCEEDED && srv_flush_sync &&
oldest_lsn < buf_flush_sync_lsn) {
/* Request flushing for threads */
pc_request(ULINT_MAX, buf_flush_sync_lsn);
/* Coordinator also treats requests */
while (pc_flush_slot() > 0) {
}
pc_wait_finished(&n_flushed_lru, &n_flushed_list);
}
pc_request 是 Coordinate 分发的入口,有两个限制参数,page 数量或者 lsn,sync flush 只有对 lsn 的限制。 pc_flush_slot 和 pc_wait_finished 是刷脏和等待 worker 线程返回。
TIPS: pc_ 前缀是 page cleaner 的缩写
当系统有负载的时候,为了避免频繁刷脏影响用户,会计算出每次刷脏的 page 数量
else if (srv_check_activity(last_activity)) {
ulint n_to_flush;
lsn_t lsn_limit = 0;
/* Estimate pages from flush_list to be flushed */
if (ret_sleep == OS_SYNC_TIME_EXCEEDED) {
last_activity = srv_get_activity_count();
n_to_flush =
page_cleaner_flush_pages_recommendation(&lsn_limit, last_pages);
} else {
n_to_flush = 0;
}
/* Request flushing for threads */
pc_request(n_to_flush, lsn_limit);
/* Coordinator also treats requests */
while (pc_flush_slot() > 0) {
}
pc_wait_finished(&n_flushed_lru, &n_flushed_list);
}
系统空闲的时候不用担心刷脏影响用户线程,可以使用最大的 io_capacity 刷脏。RDS 有参数 srv_idle_flush_pct 控制刷脏比例,默认是 100%。
} else if (ret_sleep == OS_SYNC_TIME_EXCEEDED) {
/* no activity, slept enough */
buf_flush_lists(PCT_IO(100), LSN_MAX, &n_flushed);
...
}
在这篇文章 中已经把刷脏算法讲解的非常清楚了,这块就把公式列一下。
/* 总的计算公式,n_pages 是本轮尝试刷脏的量,是三个值的平均 */
#define PCT_IO(p) ((ulong)(innodb_io_capacity * ((double)(p) / 100.0)))
n_pages = (PCT_IO(pct_total) + avg_page_rate + pages_for_lsn) / 3;
if (n_pages > innodb_max_io_capacity) {
n_pages = innodb_max_io_capacity;
}
page_rate = sum_pages / time_elapsed; // 一个计算周期内的刷脏速度
avg_page_rate = (avg_page_rate + page_rate) / 2; // 平均速度
其中 page_rate 和 lsn_rate 都是 srv_flushing_avg_loops 秒去尝试更新一次,避免刷脏抖动太快。avg_page_rate 加入计算,也是为了平缓刷脏。
F(avg_page_rate) = F(page_rate, srv_flushing_avg_loops);
lsn_rate = cur_lsn - prev_lsn / time_elapsed; // 一个计算周期内的lsn产生速度
lsn_avg_rate = (lsn_avg_rate + lsn_rate) / 2; // 平均速度
// lsn_avg_rate转换为脏页数
lsn_t target_lsn = oldest_lsn + lsn_avg_rate * buf_flush_lsn_scan_factor;
sum_pages_for_lsn = 计算flush list中所有小于targe_lsn的脏页数
sum_pages_for_lsn /= buf_flush_lsn_scan_factor;
pages_for_lsn = min(sum_pages_for_lsn, innodb_max_io_capacity * 2);
LSN 的平均产生速度包含了多少个脏页,这个参考因素可以快速 Get 到流量的变化,一定程度上增大或者减缓刷脏。
F(pages_for_lsn) =
F(**lsn_rate**, srv_flushing_avg_loops, buf_flush_lsn_scan_factor, innodb_max_io_capacity)
Note: 这部分扫描每个 buffer pool instance 找脏页数量的时候,5.7.6 做了优化(参考文档2),每一批刷的脏页数,在各个 buffer pool instance 中根据里面脏页数量的比列分配,这样就可以做到均衡刷脏。因为各个 buffer pool instance 中的脏页比例可能是不一样的。
pct_total = max(pct_for_dirty, pct_for_lsn);
因为 Redo Log 的空间是有限的,Buffer Pool 的资源是有限的,并且 Buffer Pool 中的脏页 oldest_modification_lsn 限制了 checkpoint lsn, 间接的限制了 Redo 空间的使用。所以脏页的推进会释放 buffer pool 和 redo 的可使用空间,因此在刷脏的时候也需要参考当前脏页的比例和 Redo log 的 ‘age’。
double dirty_pct = buf_get_modified_ratio_pct();
pct_for_dirty = (dirty_pct * 100) /
(srv_max_buf_pool_modified_pct + 1)
除了 dirty_pct 之外,srv_max_dirty_pages_pct_lwm 和 srv_max_buf_pool_modified_pct 也影响着 pct_for_dirty 的值。具体逻辑:
if (srv_max_dirty_pages_pct_lwm == 0) {
/* The user has not set the option to preflush dirty
pages as we approach the high water mark. */
if (dirty_pct >= srv_max_buf_pool_modified_pct) {
/* We have crossed the high water mark of dirty
pages In this case we start flushing at 100% of
innodb_io_capacity. */
return (100);
}
} else if (dirty_pct >= srv_max_dirty_pages_pct_lwm) {
/* We should start flushing pages gradually. */
return (static_cast<ulint>((dirty_pct * 100) /
(srv_max_buf_pool_modified_pct + 1)));
}
return (0);
F(pct_for_dirty) = F(dirty_pct, srv_max_dirty_pages_pct_lwm, srv_max_buf_pool_modified_pct);
#define PCT_IO(p) ((ulong)(srv_io_capacity * ((double)(p) / 100.0)))
age = cur_lsn > adjusted_oldest_lsn ? cur_lsn - adjusted_oldest_lsn : 0;
auto limit_for_age = log_get_max_modified_age_async();
lsn_age_factor = (age * 100) / limit_for_age;
pct_for_lsn = (srv_max_io_capacity / srv_io_capacity) *
(lsn_age_factor * sqrt((double)lsn_age_factor)) /
7.5
n_pages = PCT_IO(pct_for_lsn)
= srv_io_capacity *
(srv_max_io_capacity / srv_io_capacity) *
(lsn_age_factor * sqrt((double)lsn_age_factor)) /
7.5 / 100
= srv_max_io_capacity *
(lsn_age_factor * sqrt((double)lsn_age_factor)) /
7.5 / 100
自适应刷脏主要影响的值是 pct_for_lsn,由开关 srv_adaptive_flushing 控制,但也不完全由开关控制。完整的逻辑,还是看代码比较直观:
static ulint af_get_pct_for_lsn(lsn_t age) /*!< in: current age of LSN. */
{
const lsn_t log_margin =
log_translate_sn_to_lsn(log_free_check_margin(*log_sys));
ut_a(log_sys->lsn_capacity_for_free_check > log_margin);
const lsn_t log_capacity = log_sys->lsn_capacity_for_free_check - log_margin;
lsn_t lsn_age_factor;
lsn_t af_lwm = (srv_adaptive_flushing_lwm * log_capacity) / 100;
if (age < af_lwm) {
/* No adaptive flushing. */
return (0);
}
auto limit_for_age = log_get_max_modified_age_async();
ut_a(limit_for_age >= log_margin);
limit_for_age -= log_margin;
if (age < limit_for_age && !srv_adaptive_flushing) {
/* We have still not reached the max_async point and
the user has disabled adaptive flushing. */
return (0);
}
/* If we are here then we know that either:
1) User has enabled adaptive flushing
2) User may have disabled adaptive flushing but we have reached
max_async_age. */
lsn_age_factor = (age * 100) / limit_for_age;
ut_ad(srv_max_io_capacity >= srv_io_capacity);
return (static_cast<ulint>(((srv_max_io_capacity / srv_io_capacity) *
(lsn_age_factor * sqrt((double)lsn_age_factor))) /
7.5));
}
F(pct_for_lsn) = F(**age**, log_capacity, srv_adaptive_flushing_lwm,
log_sys->max_modified_age_async, srv_adaptive_flushing, srv_max_io_capacity);
如果最终选择了 pct_for_lsn, 那么公式中带入会把 srv_io_capacity 约掉。
同步刷脏的触发主要在 checkpoint 线程中,函数:log_consider_sync_flush
lsn_t flush_up_to = oldest_lsn;
/* Redo 的 age 超过 log.max_modified_age_sync 触发 sync flush */
if (current_lsn - oldest_lsn > log.max_modified_age_sync) {
ut_a(current_lsn > log.max_modified_age_sync || in_recover_mode());
flush_up_to = current_lsn - log.max_modified_age_sync;
}
/* 或者其他线程显示的请求到某个 LSN */
const lsn_t requested_checkpoint_lsn = log.requested_checkpoint_lsn;
if (requested_checkpoint_lsn > flush_up_to) {
flush_up_to = requested_checkpoint_lsn;
}
if (flush_up_to > oldest_lsn) {
log_preflush_pool_modified_pages(log, flush_up_to);
}
F(flush_up_to) = F(**age**, log.max_modified_age_sync)
开关控制 srv_flush_sync 在 log_preflush_pool_modified_pages 决定是否做真正的 sync_flush.
innodb_page_cleaners page cleaner 线程的数量,因为每一个 Buffer Pool Instance 同时只会有一个 pager cleaner 线程处理,所以配置的线程数不能超过 innodb_buffer_pool_instances 大小,超过就配置相同大小。
innodb_max_dirty_pages_pct_lwm 代码中对应变量:srv_max_dirty_pages_pct_lwm, 如果系统中脏页比例超过这个值, 将会计算 pct_for_dirty 纳入到 PCT_IO(pct_total) 中。
innodb_max_dirty_pages_pct 代码中对应变量:srv_max_buf_pool_modified_pct, 系统中最大脏页比例,和 srv_max_dirty_pages_pct_lwm 一起,影响 pct_for_dirty 的计算结果。
innodb_adaptive_flushing_lwm 代码中对应变量:srv_adaptive_flushing_lwm,当 age (所有脏页占用的 lsn 大小) 小于 log_capacity 的srv_adaptive_flushing_lwm 比例,pc_for_lsn 为 0,也就是不启用 redo 自适应模式刷脏。
innodb_adaptive_flushing 代码中对应变量:srv_adaptive_flushing,是否使用 redo 自适应模式刷脏,如果为 OFF, 只有 age 大于 log_sys->max_modified_age_async 才会采用 redo 自适应模式刷脏,如果为 ON, 满足 srv_adaptive_flushing_lwm 条件就采用 redo 自适应模式刷脏。
#define PCT_IO(p) ((ulong)(srv_io_capacity * ((double)(p) / 100.0)))
innodb_io_capacity_max 代码中对应变量:srv_max_io_capacity, 表示系统中每次能刷的最大值。会影响 pc_for_lsn 的算法。
innodb_flushing_avg_loops 代码中对应变量:srv_flushing_avg_loops,计算 lsn_avg_rate 和 avg_page_rate 的频率,为了让刷脏尽可能的平缓,默认 30s 更新一次。lsn_avg_rate 将会影响 pages_for_lsn 的计算,avg_page_rate 直接参数最终的 n_pages 计算。
innodb_flush_sync 代码中对应变量:srv_flush_sync,是否触发激烈刷脏,如果是 sync_flush 的话,系统刷脏不受 srv_io_capacity 和 srv_max_io_capacity 控制,而是刷脏页到一个指定的 lsn。 checkpoint 线程会不断检测是否需要 sync_flush, 如果当前的 lsn 和 log.available_for_checkpoint_lsn 差距超过 log.max_modified_age_sync 或者有其它指定刷脏的请求(requested_checkpoint_lsn),就尝试激烈刷脏。