数据库内核月报

数据库内核月报 - 2022 / 01

MySQL · 源码分析 · TABLE信息的生命周期

Author: 桐铭

MySQL通过TABLE对象进行表的读写等操作,对于构建TABLE对象所需的表定义相关信息,MySQL会通过Dictionary_client与DD模块进行交互。DD模块通过多级缓存的结构提供了高效而安全的DD信息访问方式,具体介绍见http://mysql.taobao.org/monthly/2021/08/02/。本文旨在讨论TABLE对象及其用到的DD模块中TableImpl信息的生命周期,包括他们的构建、缓存、清理和移除。 ​

##相关数据结构: TABLE结构描述表的相关信息,其中表定义相关的一些DD信息,如包含的字段等,记录在TABLE_SHARE中,handler是该表所使用的存储引擎接口,两个TABLE指针串起了正在操作这个TABLE对象的THD所控制的所有TABLE对象,即THD::opened_tables。另外,TABLE结构中还有字段信息、保存操作过程中读写的数据的内存区域等,与本文主题无关。

struct TABLE {
  TABLE_SHARE *s{nullptr};
  handler *file{nullptr};
  TABLE *next{nullptr}, *prev{nullptr};

 private:
  TABLE *cache_next{nullptr}, **cache_prev{nullptr};

  /*
    Give table_cache_element access to the above two members to allow
    using them for linking TABLE objects in a list.
  */
  friend class table_cache_element;

 public:
 
  THD *in_use{nullptr};   /* Which thread uses this */
  Field **field{nullptr}; /* Pointer to fields */

  bool m_needs_reopen{false};

 }

table_cache_element是table cache中的一个元素,它表示了一张表在线程对应的table cache中构建的所有TABLE实例,其中有一些位于thd->opened_tables中正在使用(used_tables),也有一些已经被释放,缓存供后续使用(free_tables)。这些表共用一个TABLE_SHARE对象维护dd信息等。

/**
  Element that represents the table in the specific table cache.
  Plays for table cache instance role similar to role of TABLE_SHARE
  for table definition cache.

  It is an implementation detail of table_cache and is present
  in the header file only to allow inlining of some methods.
*/

class table_cache_element {
  TABLE_list used_tables;
  TABLE_list free_tables;
  TABLE_SHARE *share;
};

table_cache维护了相关的几个THD所使用的所有table_cache_element,即所有正在使用或曾经打开过的TABLE对象。m_unused_tables是所有table_cache_element中free_tables的并集。

class table_cache {
  /**
    The hash of table_cache_element objects, each table/table share that
    has any TABLE object in the table_cache has a table_cache_element from
    which the list of free TABLE objects in this table cache AND the list
    of used TABLE objects in this table cache is stored.
    We use table_cache_element::share::table_cache_key as key for this hash.
  */
  std::unordered_map<std::string, std::unique_ptr<table_cache_element>> m_cache;

  /**
    List that contains all TABLE instances for tables in this particular
    table cache that are in not use by any thread. Recently used TABLE
    instances are appended to the end of the list. Thus the beginning of
    the list contains which have been least recently used.
  */
  TABLE *m_unused_tables;

  /**
    Total number of TABLE instances for tables in this particular table
    cache (both in use by threads and not in use).
    This value summed over all table caches is accessible to users as
    Open_tables status variable.
  */
  uint m_table_count;
};

table_cache_manager是所有table cache的集合,该对象有一个全局单例。需要操作table cache保存或者借用空闲TABLE对象时,根据THD ID访问对应的table cache并进行相应操作。

/**
  Container class for all table cache instances in the system.
*/

class table_cache_manager {
  /**
    An array of table_cache instances.
    Only the first table_cache_instances elements in it are used.
  */
  table_cache m_table_cache[MAX_table_cacheS];
};

extern table_cache_manager table_cache_manager;

##DD信息创建:create table create table时会通过open_table调用check_if_table_exists判断表是否存在,以便在表存在时报错。这里会产生一次空读,这次空读会完整读穿DD模块直至底层存储引擎,不过由于此时相关DD信息尚未构建,暂时略去。之后在rea_create_base_table时才会构建DD信息并保存到存储引擎InnoDB。 ​

static bool rea_create_base_table(...) {

// 构造dd::Table对象,并用create_info中的信息填充
  std::unique_ptr<dd::Table> table_def_res =
      dd::create_table(thd, sch_obj, table_name, create_info, create_fields,
                       key_info, keys, keys_onoff, fk_key_info, fk_keys, file);

  if (do_not_store_in_dd) {
    ......
  } else {
  
  	// Storage_adapter::store(m_thd, object) 保存在Storage_adapter(innoDB)中。然后调用register_uncommitted_object将table_def_res的clone()保存在Dictionary_client的m_registry_uncommitted中。
    bool result = thd->dd_client()->store(table_def_res.get());

    ......

		// 获取Dictionary_client的m_registry_uncommitted中dd::Table的clone() 【为啥不直接用table_def_res,反正存进去的也是clone】
    if (thd->dd_client()->acquire_for_modification(db, table_name, &table_def))
      DBUG_RETURN(true);
  }

...... (根据前面的table_def信息构造TABLE_SHARE并让存储引擎完成实际的table create,与本文主题无关)

}

前面的rea_create_base_table中,表定义信息被保存到了Dictionary_clientm_registry_uncommitted中,即保存在多级缓存中的局部缓存中,并保存在了存储引擎,但是并不会放置到共享缓存Shared_dictionary_cache层中。接下来,事务提交过程会在将dd表信息落盘的同时清理之前放入缓存的表信息:

bool trans_commit_implicit(THD *thd, bool ignore_global_read_lock) {

	......
  
  // 调用各个dd数据类型的remove_uncommitted_objects清理dd cache中的Abstract_table、Schema、Tablespace等各种缓存
  thd->dd_client()->commit_modified_objects();
}

void Dictionary_client::remove_uncommitted_objects(
    bool commit_to_shared_cache) {

  if (commit_to_shared_cache) {
    typename Multi_map_base<typename T::Cache_partition>::Const_iterator it;
    
    // 防止Shared_dictionary_cache中残留有新建的表的信息,为m_registry_uncommitted中的每一个表信息调用invalidate,以清理Shared_dictionary_cache中相应的表信息(如果有)
    for (it = m_registry_uncommitted.begin<typename T::Cache_partition>();
         it != m_registry_uncommitted.end<typename T::Cache_partition>();
         it++) {
      typename T::Cache_partition *uncommitted_object =
          const_cast<typename T::Cache_partition *>(it->second->object());
      DBUG_ASSERT(uncommitted_object != nullptr);

      // 防止Shared_dictionary_cache中残留有新建的表的信息
      invalidate(uncommitted_object);
    }

    // 仅对于bootstrap过程中initialize的dd表(mysql.tables等),将uncommitted cache中的表信息转移到Shared_dictionary_cache和m_registry_committed中
    if (m_thd->is_dd_system_thread() &&
        bootstrap::DD_bootstrap_ctx::instance().get_stage() <
            bootstrap::Stage::FINISHED) {
      // We must do this in two iterations to handle situations where two
      // uncommitted objects swap names.
      for (it = m_registry_uncommitted.begin<typename T::Cache_partition>();
           it != m_registry_uncommitted.end<typename T::Cache_partition>();
           it++) {
        typename T::Cache_partition *uncommitted_object =
            const_cast<typename T::Cache_partition *>(it->second->object());
        DBUG_ASSERT(uncommitted_object != nullptr);

        Cache_element<typename T::Cache_partition> *element = NULL;

        // In put, the reference counter is stepped up, so this is safe.
        Shared_dictionary_cache::instance()->put(
            static_cast<const typename T::Cache_partition *>(
                uncommitted_object->clone()),
            &element);

        m_registry_committed.put(element);
        // Sign up for auto release.
        m_current_releaser->auto_release(element);
      }
    }
  }  // commit_to_shared_cache

	// 清理需要清理的m_registry_uncommitted和m_registry_dropped中的信息
  m_registry_uncommitted.erase<typename T::Cache_partition>();
  m_registry_dropped.erase<typename T::Cache_partition>();
}

以上,create user table的过程中不会遗留任何类型的DD cache信息,因为对所有类型的cache均调用了remove_uncommitted_objects函数,保证临时产生的dd::Table对象不会残留在缓存中,而仅仅是进行了落盘。但这并不代表建表过程完全不会污染内存缓存信息,比如,建表过程中创建tablespace所调用的fil_space_create会将表空间信息保存在全局变量fil_system对应的Fil_shard中。 ​

##DD信息读取和TABLE对象构建:open_table

所有的DML、DQL在访问表数据之前都要先访问表的Table_Impl等DD信息,以便读取、解析或写入相关的数据。这一访问过程在开表过程中构造TABLE_SHARE时实现。

TABLE_SHARE *get_TABLE_SHARE(THD *thd, const char *db, const char *table_name,
                             const char *key, size_t key_length, bool open_view,
                             bool open_secondary) {
  ......
  /*
    Read table definition from the cache. If the share is being opened,
    wait for the appropriate condition. The share may be destroyed if
    open fails, so after cond_wait, we must repeat searching the
    hash table.
  */
  for (;;) {
    auto it = table_def_cache->find(string(key, key_length));
    // table_def_cache中找不到,确认加了schema的MDL锁之后建立新的TABLE_SHARE
    if (it == table_def_cache->end()) {
      if (thd->mdl_context.owns_equal_or_stronger_lock(
              MDL_key::SCHEMA, db, "", MDL_INTENTION_EXCLUSIVE)) {
        break;
      }
      mysql_mutex_unlock(&LOCK_open);

      if (dd::mdl_lock_schema(thd, db, MDL_TRANSACTION)) {
        // Lock LOCK_open again to preserve function contract
        mysql_mutex_lock(&LOCK_open);
        DBUG_RETURN(nullptr);
      }

      mysql_mutex_lock(&LOCK_open);
      // Need to re-try the find after getting the mutex again
      continue;
    }
    share = it->second.get();
    // m_open_in_progress说明有其他线程创建了TABLE_SHARE并放进table_def_cache,但此时尚未完成信息填充,于是等待创建线程填充信息完成后通过COND_open通知,然后再次进入循环查询。
    if (!share->m_open_in_progress)
      DBUG_RETURN(process_found_TABLE_SHARE(thd, share, open_view));

    DEBUG_SYNC(thd, "get_share_before_COND_open_wait");
    mysql_cond_wait(&COND_open, &LOCK_open);
  }

  /*
    申请新的TABLE_SHARE
  */
  if (!(share = alloc_TABLE_SHARE(db, table_name, key, key_length,
                                  open_secondary))) {
    DBUG_RETURN(NULL);
  }

  /*
    We assign a new table id under the protection of LOCK_open.
    We do this instead of creating a new mutex
    and using it for the sole purpose of serializing accesses to a
    static variable, we assign the table id here. We assign it to the
    share before inserting it into the table_def_cache to be really
    sure that it cannot be read from the cache without having a table
    id assigned.

    CAVEAT. This means that the table cannot be used for
    binlogging/replication purposes, unless get_TABLE_SHARE() has been
    called directly or indirectly.
  */
  assign_new_table_id(share);

	// 将TABLE_SHARE放入table_def_cache,此时另一个线程如果进入前面的循环,他就已经能读到了,但由于信息还未完成填充,所以还不能用。
  table_def_cache->emplace(to_string(share->table_cache_key),
                           unique_ptr<TABLE_SHARE, TABLE_SHARE_deleter>(share));

  /*
    We must increase ref_count prior to releasing LOCK_open
    to keep the share from being deleted in tdc_remove_table()
    and TABLE_SHARE::wait_for_old_version. We must also set
    m_open_in_progress to indicate allocated but incomplete share.
  */
  share->increment_ref_count();      // Mark in use
  share->m_open_in_progress = true;  // Mark being opened

  /*
    Temporarily release LOCK_open before opening the table definition,
    which can be done without mutex protection.
  */
  mysql_mutex_unlock(&LOCK_open);

#if defined(ENABLED_DEBUG_SYNC)
  if (!thd->is_attachable_ro_transaction_active())
    DEBUG_SYNC(thd, "get_share_before_open");
#endif

  {
    // We must make sure the schema is released and unlocked in the right order.
    dd::cache::Dictionary_client::Auto_releaser releaser(thd->dd_client());
    const dd::Schema *sch = nullptr;
    const dd::Abstract_table *abstract_table = nullptr;
    
    if (thd->dd_client()->acquire(share->db.str, &sch) ||
        thd->dd_client()->acquire(share->db.str, share->table_name.str,
                                  &abstract_table)) {
    }
    ......
    // 通过刚刚读到的abstract_table 填充TABLE_SHARE中的一些细节信息
  }

  /*
    Get back LOCK_open before continuing. Notify all waiters that the
    opening is finished, even if there was a failure while opening.
  */
  
  // 设置m_open_in_progress表示信息已完成填充,并通过COND_open通知可能在等待的其他线程。
  mysql_mutex_lock(&LOCK_open);
  share->m_open_in_progress = false;
  mysql_cond_broadcast(&COND_open);

......
  DBUG_RETURN(share);
}

以上,get_TABLE_SHARE过程中会通过thd->dd_client()->acquire()获取Table_Impl信息并保存到abstract_table中。同时TABLE_SHARE会被缓存在table_def_cache中,以避免重复构建TABLE_SHARE的开销。TABLE_SHARE为所有TABLE对象所共有,直到TABLE被删除或因开表过多需要关掉时才会释放。

bool Dictionary_client::acquire(const K &key, const T **object,
                                bool *local_committed,
                                bool *local_uncommitted) {
  ......
  
  // 首先尝试是否有线程本地的uncommited cache,一般存在于create过程中
  acquire_uncommitted(key, &uncommitted_object, &dropped);
  if (uncommitted_object || dropped) {
   ......
    return false;
  }
  
  // 本地的uncommited cache中没有时,访问本地commited cache,一般存在于当前THD正在处理表操作时
  m_registry_committed.get(key, &element);
  if (element) {
    ......
    return false;
  }

  // The element is not present locally.
  *local_committed = false;


  // 查找共享缓存或存储系统中是否有表信息
  if (Shared_dictionary_cache::instance()->get(m_thd, key, &element)) {
    DBUG_ASSERT(m_thd->is_system_thread() || m_thd->killed ||
                m_thd->is_error());
    return true;
  }

  // Add the element to the local registry and assign the output object.
  if (element) {
    ......
  }
  return false;
}


// 查找共享缓存对应dd信息类型的字典,找不到时会查找存储引擎,并保存在相应字典中。
bool Shared_dictionary_cache::get(THD *thd, const K &key,
                                  Cache_element<T> **element) {
  bool error = false;
  DBUG_ASSERT(element);
  if (m_map<T>()->get(key, element)) {
    // Handle cache miss.
    const T *new_object = NULL;
    error = get_uncached(thd, key, ISO_READ_COMMITTED, &new_object);

    // Add the new object, and assign the output element, even in the case of
    // a miss error (needed to remove the missed key).
    m_map<T>()->put(&key, new_object, element);
  }
  return error;
}


// 访问存储引擎查找dd信息
bool Shared_dictionary_cache::get_uncached(THD *thd, const K &key,
                                           enum_tx_isolation isolation,
                                           const T **object) const {
  DBUG_ASSERT(object);
  bool error = Storage_adapter::get(thd, key, isolation, false, object);
  DBUG_ASSERT(!error || thd->is_system_thread() || thd->killed ||
              thd->is_error());

  return error;
}

一般而言,在第一次访问表的Table_Impl信息时(包括create table后、宕机重启后等情况下),Table_Impl信息不在dd cache中,于是acquire函数会通过访问Storage_adapter获取并保存到Shared_multi_mapcommitted registry 中。这里还会有set_missed机制防止多线程同时访问存储引擎 前面提到的月报文章中已有介绍。之后自己或其他工作线程再需要访问dd信息时(如新的SQL需要开表或show create时),就可以从共享缓存Shared_multi_map中得到TableImpl信息。 ​

获取TableImpl并构建和缓存TABLE_SHARE后,mysql需要获取TABLE对象以完成开表操作。这一获取操作有两种方式: 1、该TABLE对象已经在thd->open_tables中,根据匹配的TABLE对象持有的锁等级选择best_table

for (table = thd->open_tables; table; table = table->next) {
	// find best_table
}

table_list->table = best_table

2、thd->open_tables中不存在时,相应TABLE对象已经在thd对应的table_cache中一定不存在used_tables中,但当free_tables中存在时:从table_cache中读取,并加入el->used_tables

table = tc->get_table(thd, key, key_length, &share);

TABLE *tableTABLE\_SHAREcache::get_table(THD *thd, const char *key, size_t key_length,
                              TABLE_SHARE **share) {
  ......
  const auto el_it = m_cache.find(key_str);
  if (el_it == m_cache.end()) return NULL;
  tableTABLE\_SHAREcache_element *el = el_it->second.get();

  if ((table = el->free_tables.front())) {
    DBUG_ASSERT(!table->in_use);

    el->free_tables.remove(table);

    el->used_tables.push_front(table);

    table->in_use = thd;
  }

  return table;
}

3、get_table发现该TABLE对象不在table_cache中,通过open_table_from_share构建对象,然后tc->add_used_table放入used table cache ​

获取到TABLE对象后,该对象也已经缓存在table_cache的used_tables中,接下来通过将thd->set_open_tables(table)对象记录在thd->open_tables中,保证前面提到的,thd->open_tables中不存在时,相应TABLE对象已经在thd对应的table_cache中一定不存在于used_tables中 这一约束成立。 ​

##清理:close_table 关表操作一般发生于完成一些操作后需要释放资源时,mysql_execute_command结束后close_thread_tables(thd);关闭所有thd->open_tables,然后清理相应的table cache:一般而言都是把table cache中in use的TABLE对象置为free,当缓存对象数过多时,也会进行清理(free_unused_tables_if_necessary)。 ​

##删除:drop table drop table时会清空前面提到的一切数据,包括清空表相关的table cache、释放TABLE_SHARE并从table_def_cache中移除、删除DD信息并清理dd信息的多级缓存等。 ​

首先是清空dd cache。相关函数为table_cache_manager::free_table。该函数会遍历所有table_cache并删除相匹配的item,同时降低TABLE_SHARE的引用计数。其调用栈如下:

#0  tableTABLE\_SHAREcache_manager::free_table
#1  0x0000000003338b39 in remove_table
#2  0x0000000003338dc0 in tdc_remove_table
#3  0x00000000034b3d32 in drop_base_table
#4  0x00000000034b523d in mysql_rm_table_no_locks
#5  0x00000000034b105f in mysql_rm_table

TABLE_SHARE的引用计数降为0(即所有TABLE对象释放完毕)后就可以释放了。 ​ tdc_remove_table函数清理完table_cache和table_def_cache后,dd::drop_table函数负责调用dd::cache::Dictionary_client::drop<dd::Table>清理dd cache和删除DD信息,主要分为两步操作: 1、Storage_adapter::drop从存储引擎删除持久化的DD信息 2、invalidate(object);将DD信息标记为已删除,这样本线程一旦再需要访问或验证成功删除时,就不会产生读穿透至存储引擎的情况。

以上,给予MySQL 8.0.13我们分析的TABLE信息和相关的DD信息的正常流程下的生命周期,至于其他流程,包括超过容量之后的淘汰、TABLE的reopen机制等,我们以后再行研究。