数据库内核月报

数据库内核月报 - 2018 / 08

MySQL · RocksDB · MemTable的写入逻辑

Author: diaoliang

简介

在之前的文章中我们知道RocksDB每一次写入,都是先写WAL,然后写Memtable,这次我们就来分析下MemTable的实现.

在RocksDB中,每个ColumnFamily都有自己的Memtable,互不影响.而在RocksDB中Memtable有多种实现(SkipList/HashSkipList/HashLinkList/Vector),具体的区别可以看这里,我们这次主要来分析默认的实现skiplist(只有skiplist是可以并发插入的).

实现

首先从创建Memtable开始,Memtable的创建(ColumnFamilyData::CreateNewMemtable)是在创建ColumnFamily(VersionSet::CreateColumnFamily)的时候创建的.这里就是创建memtable,然后设置到ColumnFamilyData的mem_域中.

MemTable* ColumnFamilyData::ConstructNewMemtable(
    const MutableCFOptions& mutable_cf_options, SequenceNumber earliest_seq) {
  return new MemTable(internal_comparator_, ioptions_, mutable_cf_options,
                      write_buffer_manager_, earliest_seq, id_);
}
void ColumnFamilyData::CreateNewMemtable(
    const MutableCFOptions& mutable_cf_options, SequenceNumber earliest_seq) {
  if (mem_ != nullptr) {
    delete mem_->Unref();
  }
  SetMemtable(ConstructNewMemtable(mutable_cf_options, earliest_seq));
  mem_->Ref();
}

上面所提及的,RocksDB有多种MemTable的实现,那么它是如何来做的呢,RocksDB通过memtable_factory来根据用户的设置来创建不同的memtable.这里要注意的是核心的memtable实现是在MemTable这个类的table_域中.

MemTable::MemTable:
      table_(ioptions.memtable_factory->CreateMemTableRep(
          comparator_, &arena_, ioptions.prefix_extractor, ioptions.info_log,
          column_family_id)),


class MemTableRepFactory {
 public:
  virtual ~MemTableRepFactory() {}

  virtual MemTableRep* CreateMemTableRep(const MemTableRep::KeyComparator&,
                                         Allocator*, const SliceTransform*,
                                         Logger* logger) = 0;
  virtual MemTableRep* CreateMemTableRep(
      const MemTableRep::KeyComparator& key_cmp, Allocator* allocator,
      const SliceTransform* slice_transform, Logger* logger,
      uint32_t /* column_family_id */) {
    return CreateMemTableRep(key_cmp, allocator, slice_transform, logger);
  }
........................

然后最后会调用对应的实现的CreateMemTableRep方法,这里我们就来看SkipList的实现.

MemTableRep* SkipListFactory::CreateMemTableRep(
    const MemTableRep::KeyComparator& compare, Allocator* allocator,
    const SliceTransform* transform, Logger* /*logger*/) {
  return new SkipListRep(compare, allocator, transform, lookahead_);
}

最终就是创建SkipListRep对象,在这个对象里面会创建SkipList(class InlineSkipList).

class SkipListRep : public MemTableRep {
  InlineSkipList<const MemTableRep::KeyComparator&> skip_list_;
...................................
public:
 explicit SkipListRep(const MemTableRep::KeyComparator& compare,
                      Allocator* allocator, const SliceTransform* transform,
                      const size_t lookahead)
     : MemTableRep(allocator),
       skip_list_(compare, allocator),
       cmp_(compare),
       transform_(transform),
       lookahead_(lookahead) {}

这里SkipList就不分析实现了,如果对这个数据结构不了解的,可以去WIKI看一下.这里我们只需要知道最终所有的memtable数据都是保存在SkipList中就可以了.

在之前的分析中我们知道Memtable的插入是通过WriteBatch然后遍历ColumnFamily来插入的,而最终则是会调用MemTable::Add这个函数.

bool MemTable::Add(SequenceNumber s, ValueType type,
                   const Slice& key, /* user key */
                   const Slice& value, bool allow_concurrent,
                   MemTablePostProcessInfo* post_process_info) {
bool res = table->InsertKeyConcurrently(handle);
    if (UNLIKELY(!res)) {
      return res;
    }
..............................
                   }

最终会调用InlineSkipList来对数据进行插入.

template <class Comparator>
bool InlineSkipList<Comparator>::InsertConcurrently(const char* key) {
  Node* prev[kMaxPossibleHeight];
  Node* next[kMaxPossibleHeight];
  Splice splice;
  splice.prev_ = prev;
  splice.next_ = next;
  return Insert<true>(key, &splice, false);
}

看到这里或许会有疑问了,那就是skiplist里面只有key,而RocksDB是一个KV存储,那么这个KV是如何存储的呢,这里是这样的,RocksDB会将KV打包成一个key传递给SkipList, 对应的KEY的结构是这样的.

  // Format of an entry is concatenation of:
  //  key_size     : varint32 of internal_key.size()
  //  key bytes    : char[internal_key.size()]
  //  value_size   : varint32 of value.size()
  //  value bytes  : char[value.size()]

而数据的格式化就在之前的MemTable::Add中实现的.

uint32_t key_size = static_cast<uint32_t>(key.size());
  uint32_t val_size = static_cast<uint32_t>(value.size());
  uint32_t internal_key_size = key_size + 8;
  const uint32_t encoded_len = VarintLength(internal_key_size) +
                               internal_key_size + VarintLength(val_size) +
                               val_size;
  char* buf = nullptr;
  std::unique_ptr<MemTableRep>& table =
      type == kTypeRangeDeletion ? range_del_table_ : table_;
  KeyHandle handle = table->Allocate(encoded_len, &buf);

  char* p = EncodeVarint32(buf, internal_key_size);
  memcpy(p, key.data(), key_size);
  Slice key_slice(p, key_size);
  p += key_size;
  uint64_t packed = PackSequenceAndType(s, type);
  EncodeFixed64(p, packed);
  p += 8;
  p = EncodeVarint32(p, val_size);
  memcpy(p, value.data(), val_size);

而对于真正的KEY的解析是在SkipList的Comparator中实现的(compare_).下面的代码片段可以看到会解析出来真正的key,然后再进行查找以及插入.

bool InlineSkipList<Comparator>::Insert(const char* key, Splice* splice,
                                        bool allow_partial_splice_fix) {
  Node* x = reinterpret_cast<Node*>(const_cast<char*>(key)) - 1;
  const DecodedKey key_decoded = compare_.decode_key(key);
...............................
                                        }