数据库内核月报

数据库内核月报 - 2020 / 04

MySQL · 引擎特性 · 手动分析InnoDB B+Tree结构

Author: Ming Lin

说明

本文用查找一条数据库表记录的例子来分析InnoDB B+Tree的结构

先用sysbench插入一千万条记录:

sysbench --db-driver=mysql --mysql-user=username --mysql-password=password --mysql-db=sbtest \
	--table_size=10000000 --tables=1 oltp_read_write  --mysql-host=127.0.0.1 prepare

生成的sbtest1.ibd大小为2.3G,用hexdump来查看内容

随意选一条记录,比如id=3905000,通过手动找到这条记录来分析B+Tree的结构。

mysql> select * from sbtest1 where id=3905000 \G
*************************** 1. row ***************************
 id: 3905000
  k: 7152454
  c: 33061989913-01978996152-96897051302-66804054532-36658200903-75265952777-90162670547-62113775555-84037309450-68725639441
pad: 93314475890-72810819110-74153294523-75348581725-15287112137
1 row in set (0.00 sec)

Page format

详细请查看: https://dev.mysql.com/doc/internals/en/innodb-page-overview.html

Name Size(bytes)
File Header 38
Page Header 56
Infimum + Supremum Records 16
User Records 不定
Free Space 不定
Page Directory 不定
Fil Trailer 8

查看B+Tree root page

page 3是root page,先用hexdump查看file header

page 3的offset是3*16*1024 = 49152,file header size是38 bytes

$hexdump -s 49152 -n 38 -C sbtest1.ibd
0000c000  58 0f 5c bd 00 00 00 03  ff ff ff ff ff ff ff ff  |X.\.............|
0000c010  00 00 00 00 99 4a f4 5a  45 bf 00 00 00 00 00 00  |.....J.ZE.......|
0000c020  00 00 00 00 00 20                                 |..... |

page type是0x45bf,表示B+Tree节点

再看page header,offset是3161024 + 38 = 49190,page header size是56 bytes

$hexdump -s 49190 -n 56 sbtest1.ibd
0000c026  00 1d 06 4f 80 75 00 00  00 00 06 47 00 02 00 72  |...O.u.....G...r|
0000c036  00 73 00 00 00 00 00 00  00 00 00 02 00 00 00 00  |.s..............|
0000c046  00 00 00 38 00 00 00 20  00 00 00 02 00 f2 00 00  |...8... ........|
0000c056  00 20 00 00 00 02 00 32                           |. .....2|

page directory slots数目为29(0x001d)

page level为2(0x0002),这个是根节点。叶子节点level总是为0,所以这棵B+Tree深度为3。

root page directory

每个slot占2 bytes,29个slot共58 bytes。 所以root page directory offset为 3(161024) - 58 - 8(trailer) = 65470

$hexdump -s 65470 -n 58 -C sbtest1.ibd
0000ffbe  00 70 05 ec 05 b8 05 84  05 50 05 1c 04 e8 04 b4  |.p.......P......|
0000ffce  04 80 04 4c 04 18 03 e4  03 b0 03 7c 03 48 03 14  |...L.......|.H..|
0000ffde  02 e0 02 ac 02 78 02 44  02 10 01 dc 01 a8 01 74  |.....x.D.......t|
0000ffee  01 40 01 0c 00 d8 00 a4  00 63                    |.@.......c|

page directory按primary key逆序排列,0x0063是infimum record offset,0x0070是supremum record offset

$hexdump -s 49246 -n 12 -C sbtest1.ibd
0000c05e  01 00 02 00 1a 69 6e 66  69 6d 75 6d              |.....infimum|

“01 00 02 00 1a” 是record header,最后1个byte 0x1a表示next record offset

3161024+0x63+0x1a=49277,找到第1条记录

$hexdump -s 49277 -n 8 sbtest1.ibd
0000c07d  80 00 00 01 00 00 00 24                           |.......$|
slot offset record(hex) primary key page no
0 0x0063 69 6e 66 69 6d 75 6d 00 1 36
1 0x00a4 80 03 59 53 00 00 00 27 219475 39
2 0x00d8 80 08 b5 7f 00 00 00 2b 570751 43
3 0x010c 80 0e 11 ab 00 00 00 2f 922027 47
4 0x0140 80 13 6d d7 00 00 00 33 1273303 51
5 0x0174 80 18 ca 03 00 00 00 37 1624579 55
6 0x01a8 80 1e 26 2f 00 00 00 3b 1975855 59
7 0x01dc 80 23 82 5b 00 00 00 3f 2327131 63
8 0x0210 80 28 de 87 00 00 90 00 2678407 36864
9 0x0244 80 2e 3a b3 00 00 90 04 3029683 36868
10 0x0278 80 33 96 df 00 00 90 08 3380959 36872
11 0x02ac 80 38 f3 0b 00 00 90 0c 3732235 36876
12 0x02e0 80 3e 4f 37 00 00 90 10 4083511 36880
13 0x0314 80 43 ab 63 00 00 90 14 4434787 36884
14 0x0348 80 49 07 8f 00 00 90 18 4786063 36888
15 0x037c 80 4e 63 bb 00 00 90 1c 5137339 36892
16 0x03b0 80 53 bf e7 00 00 90 20 5488615 36896
17 0x03e4 80 59 1c 13 00 00 90 24 5839891 36900
18 0x0418 80 5e 78 3f 00 00 90 28 6191167 36904
19 0x044c 80 63 d4 6b 00 00 90 2c 6542443 36908
20 0x0480 80 69 30 97 00 00 90 30 6893719 36912
21 0x04b4 80 6e 8c c3 00 00 90 34 7244995 36916
22 0x04e8 80 73 e8 ef 00 00 90 38 7596271 36920
23 0x051c 80 79 45 1b 00 00 90 3c 7947547 36924
24 0x0550 80 7e a1 47 00 01 be 00 8298823 114176
25 0x0584 80 83 fd 73 00 01 be 04 8650099 114180
26 0x05b8 80 89 59 9f 00 01 be 08 9001375 114184
27 0x05ec 80 8e b5 cb 00 01 be 0c 9352651 114188
28 0x0070 73 75 70 72 65 6d 75 6d    

我们要找的记录primary key为3905000,在以上29个slots中做binary search,发现可能在slot 11和slot 12之间 slot 11的primary key为3732235,并不是我们要找的3905000,需要继续在slot 11的record list中查找。

查看page 0 slot 11的record list

record header共5个bytes,第5个byte记录了next record的相对于current record的offset, 即next record offset = current record offset + current record header的第5个byte 非叶子节点record body为8 bytes

offset record(hex) primary key page no
684 80 38 f3 0b 00 00 90 0c 3732235 36876
697 80 3a 4a 16 00 00 90 0d 3820054 36877
710 80 3b a1 21 00 00 90 0e 3907873 36878
723 80 3c f8 2c 00 00 90 0f 3995692 36879
736 80 3e 4f 37 00 00 90 10 4083511 36880
1529 80 90 0c d6 00 01 be 0d 9440470 114189
1542 80 91 63 e1 00 01 be 0e 9528289 114190
1555 80 92 ba ec 00 01 be 0f 9616108 114191
1568 80 94 11 f7 00 01 be 10 9703927 114192
1581 80 95 69 02 00 01 be 11 9791746 114193
1594 80 96 c0 0d 00 01 be 12 9879565 114194
1607 80 98 17 18 00 01 be 13 9967384 114195

到此发现我们要找的记录(primary key 3905000)可能在offset 697所指向的page 36877

查看page 36877

先看file header, offset = 36877 * (16*1024) = 604192768, size为38 bytes

$hexdump -s 604192768 -n 38 -C sbtest1.ibd
24034000  d1 d6 61 2a 00 00 90 0d  00 00 90 0c 00 00 90 0e  |..a*............|
24034010  00 00 00 00 3b ff 8b 03  45 bf 00 00 00 00 00 00  |....;...E.......|
24034020  00 00 00 00 00 20                                 |..... |

page type是0x45bf,确认了是B+Tree节点

再看page header, offset = 36877 * (16*1024) + 38 = 604192806, size为56 bytes

$hexdump -s 604192806 -n 56 -C sbtest1.ibd
24034026  01 2d 3d 8f 84 b5 00 00  00 00 3d 87 00 02 04 b2  |.-=.......=.....|
24034036  04 b3 00 00 00 00 00 00  00 00 00 01 00 00 00 00  |................|
24034046  00 00 00 38 00 00 00 00  00 00 00 00 00 00 00 00  |...8............|
24034056  00 00 00 00 00 00 00 00                           |........|

page directory slots数目为301(0x012d)

page level为1(0x0001),还不是叶子节点。

再看page diretory 每个slot占2 bytes,301个slot共602 bytes。 所以page directory offset为 36877(161024) - 602 - 8(trailer) = 604208542

$hexdump -s 604208542 -n 602 -C sbtest1.ibd
24037d9e  00 70 3d 2c 3c f8 3c c4  3c 90 3c 5c 3c 28 3b f4  |.p=,<.<.<.<\<(;.|
24037dae  3b c0 3b 8c 3b 58 3b 24  3a f0 3a bc 3a 88 3a 54  |;.;.;X;$:.:.:.:T|
24037dbe  3a 20 39 ec 39 b8 39 84  39 50 39 1c 38 e8 38 b4  |: 9.9.9.9P9.8.8.|
24037dce  38 80 38 4c 38 18 37 e4  37 b0 37 7c 37 48 37 14  |8.8L8.7.7.7|7H7.|
24037dde  36 e0 36 ac 36 78 36 44  36 10 35 dc 35 a8 35 74  |6.6.6x6D6.5.5.5t|
24037dee  35 40 35 0c 34 d8 34 a4  34 70 34 3c 34 08 33 d4  |5@5.4.4.4p4<4.3.|
24037dfe  33 a0 33 6c 33 38 33 04  32 d0 32 9c 32 68 32 34  |3.3l383.2.2.2h24|
24037e0e  32 00 31 cc 31 98 31 64  31 30 30 fc 30 c8 30 94  |2.1.1.1d100.0.0.|
24037e1e  30 60 30 2c 2f f8 2f c4  2f 90 2f 5c 2f 28 2e f4  |0`0,/./././\/(..|
24037e2e  2e c0 2e 8c 2e 58 2e 24  2d f0 2d bc 2d 88 2d 54  |.....X.$-.-.-.-T|
24037e3e  2d 20 2c ec 2c b8 2c 84  2c 50 2c 1c 2b e8 2b b4  |- ,.,.,.,P,.+.+.|
24037e4e  2b 80 2b 4c 2b 18 2a e4  2a b0 2a 7c 2a 48 2a 14  |+.+L+.*.*.*|*H*.|
24037e5e  29 e0 29 ac 29 78 29 44  29 10 28 dc 28 a8 28 74  |).).)x)D).(.(.(t|
24037e6e  28 40 28 0c 27 d8 27 a4  27 70 27 3c 27 08 26 d4  |(@(.'.'.'p'<'.&.|
24037e7e  26 a0 26 6c 26 38 26 04  25 d0 25 9c 25 68 25 34  |&.&l&8&.%.%.%h%4|
24037e8e  25 00 24 cc 24 98 24 64  24 30 23 fc 23 c8 23 94  |%.$.$.$d$0#.#.#.|
24037e9e  23 60 23 2c 22 f8 22 c4  22 90 22 5c 22 28 21 f4  |#`#,"."."."\"(!.|
24037eae  21 c0 21 8c 21 58 21 24  20 f0 20 bc 20 88 20 54  |!.!.!X!$ . . . T|
24037ebe  20 20 1f ec 1f b8 1f 84  1f 50 1f 1c 1e e8 1e b4  |  .......P......|
24037ece  1e 80 1e 4c 1e 18 1d e4  1d b0 1d 7c 1d 48 1d 14  |...L.......|.H..|
24037ede  1c e0 1c ac 1c 78 1c 44  1c 10 1b dc 1b a8 1b 74  |.....x.D.......t|
24037eee  1b 40 1b 0c 1a d8 1a a4  1a 70 1a 3c 1a 08 19 d4  |.@.......p.<....|
24037efe  19 a0 19 6c 19 38 19 04  18 d0 18 9c 18 68 18 34  |...l.8.......h.4|
24037f0e  18 00 17 cc 17 98 17 64  17 30 16 fc 16 c8 16 94  |.......d.0......|
24037f1e  16 60 16 2c 15 f8 15 c4  15 90 15 5c 15 28 14 f4  |.`.,.......\.(..|
24037f2e  14 c0 14 8c 14 58 14 24  13 f0 13 bc 13 88 13 54  |.....X.$.......T|
24037f3e  13 20 12 ec 12 b8 12 84  12 50 12 1c 11 e8 11 b4  |. .......P......|
24037f4e  11 80 11 4c 11 18 10 e4  10 b0 10 7c 10 48 10 14  |...L.......|.H..|
24037f5e  0f e0 0f ac 0f 78 0f 44  0f 10 0e dc 0e a8 0e 74  |.....x.D.......t|
24037f6e  0e 40 0e 0c 0d d8 0d a4  0d 70 0d 3c 0d 08 0c d4  |.@.......p.<....|
24037f7e  0c a0 0c 6c 0c 38 0c 04  0b d0 0b 9c 0b 68 0b 34  |...l.8.......h.4|
24037f8e  0b 00 0a cc 0a 98 0a 64  0a 30 09 fc 09 c8 09 94  |.......d.0......|
24037f9e  09 60 09 2c 08 f8 08 c4  08 90 08 5c 08 28 07 f4  |.`.,.......\.(..|
24037fae  07 c0 07 8c 07 58 07 24  06 f0 06 bc 06 88 06 54  |.....X.$.......T|
24037fbe  06 20 05 ec 05 b8 05 84  05 50 05 1c 04 e8 04 b4  |. .......P......|
24037fce  04 80 04 4c 04 18 03 e4  03 b0 03 7c 03 48 03 14  |...L.......|.H..|
24037fde  02 e0 02 ac 02 78 02 44  02 10 01 dc 01 a8 01 74  |.....x.D.......t|
24037fee  01 40 01 0c 00 d8 00 a4  00 63                    |.@.......c|

解析page directory

slot offset record(hex) primary key page no
1 0x00a4 80 3a 4a f1 00 00 cd 8d 3820273 52621
2 0x00d8 80 3a 4c 15 00 00 cd 91 3820565 52625
3 0x010c 80 3a 4d 39 00 00 cd 95 3820857 52629
4 0x0140 80 3a 4e 5d 00 00 cd 99 3821149 52633
5 0x0174 80 3a 4f 81 00 00 cd 9d 3821441 52637
6 0x01a8 80 3a 50 a5 00 00 cd a1 3821733 52641
7 0x01dc 80 3a 51 c9 00 00 cd a5 3822025 52645
8 0x0210 80 3a 52 ed 00 00 cd a9 3822317 52649
9 0x0244 80 3a 54 11 00 00 cd ad 3822609 52653
10 0x0278 80 3a 55 35 00 00 cd b1 3822901 52657
288 0x3af0 80 3b 92 4d 00 00 d2 09 3904077 53769
289 0x3b24 80 3b 93 71 00 00 d2 0d 3904369 53773
290 0x3b58 80 3b 94 95 00 00 d2 11 3904661 53777
291 0x3b8c 80 3b 95 b9 00 00 d2 15 3904953 53781
292 0x3bc0 80 3b 96 dd 00 00 d2 19 3905245 53785
293 0x3bf4 80 3b 98 01 00 00 d2 1d 3905537 53789
294 0x3c28 80 3b 99 25 00 00 d2 21 3905829 53793
295 0x3c5c 80 3b 9a 49 00 00 d2 25 3906121 53797
296 0x3c90 80 3b 9b 6d 00 00 d2 29 3906413 53801
297 0x3cc4 80 3b 9c 91 00 00 d2 2d 3906705 53805
298 0x3cf8 80 3b 9d b5 00 00 d2 31 3906997 53809
299 0x3d2c 80 3b 9e d9 00 00 d2 35 3907289 53813

在以上slots中做binary search,发现我们要找的记录(primary key 3905000)可能在slot 291和slot 292之间 继续查看slot 291的第1个record所指向的page 53781

查看page 53781

先看file header, offset = 53781 * (16*1024) = 881147904, size为38 bytes

$hexdump -s 881147904 -n 38 -C sbtest1.ibd
34854000  09 d0 e4 a7 00 00 d2 15  00 00 d2 14 00 00 d2 16  |................|
34854010  00 00 00 00 3b f4 ac 4b  45 bf 00 00 00 00 00 00  |....;..KE.......|
34854020  00 00 00 00 00 20                                 |..... |

page type是0x45bf,确认了是B+Tree节点

再看page header, offset = 53781 * (16*1024) + 38 = 881147942, size为56 bytes

$hexdump -s 881147942 -n 56 -C sbtest1.ibd
34854026  00 13 3b c8 80 4b 00 00  00 00 3a ff 00 02 00 48  |..;..K....:....H|
34854036  00 49 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |.I..............|
34854046  00 00 00 38 00 00 00 00  00 00 00 00 00 00 00 00  |...8............|
34854056  00 00 00 00 00 00 00 00                           |........|

page directory slots数目为19(0x0013)

page level为0(0x0000),这是叶子节点了,是真正存数据的page

再看page diretory 每个slot占2 bytes,19个slot共38 bytes。 所以page directory offset为 53781(161024) - 38 - 8(trailer) = 881164242

$hexdump -s 881164242 -n 38 -C sbtest1.ibd
34857fd2  00 70 36 ef 33 af 30 6f  2d 2f 29 ef 26 af 23 6f  |.p6.3.0o-/).&.#o|
34857fe2  20 2f 1c ef 19 af 16 6f  13 2f 0f ef 0c af 09 6f  | /.....o./.....o|
34857ff2  06 2f 02 ef 00 63                                 |./...c|

解析page directory

slot | offset | record(hex) | primary key | page no
----- | ------ | ------ | -------- | --------
1 | 0x02ef | 80 3b 95 bc 00 00 00 00 | 3904956 | 0
2 | 0x062f | 80 3b 95 c0 00 00 00 00 | 3904960 | 0
3 | 0x096f | 80 3b 95 c4 00 00 00 00 | 3904964 | 0
4 | 0x0caf | 80 3b 95 c8 00 00 00 00 | 3904968 | 0
5 | 0x0fef | 80 3b 95 cc 00 00 00 00 | 3904972 | 0
6 | 0x132f | 80 3b 95 d0 00 00 00 00 | 3904976 | 0
7 | 0x166f | 80 3b 95 d4 00 00 00 00 | 3904980 | 0
8 | 0x19af | 80 3b 95 d8 00 00 00 00 | 3904984 | 0
9 | 0x1cef | 80 3b 95 dc 00 00 00 00 | 3904988 | 0
10 | 0x202f | 80 3b 95 e0 00 00 00 00 | 3904992 | 0
11 | 0x236f | 80 3b 95 e4 00 00 00 00 | 3904996 | 0
12 | 0x26af | 80 3b 95 e8 00 00 00 00 | 3905000 | 0
13 | 0x29ef | 80 3b 95 ec 00 00 00 00 | 3905004 | 0
14 | 0x2d2f | 80 3b 95 f0 00 00 00 00 | 3905008 | 0
15 | 0x306f | 80 3b 95 f4 00 00 00 00 | 3905012 | 0
16 | 0x33af | 80 3b 95 f8 00 00 00 00 | 3905016 | 0
17 | 0x36ef | 80 3b 95 fc 00 00 00 00 | 3905020 | 0

OK,在slot 12找到了primary key 3905000

查看primary key 3905000的记录

mysql> select * from sbtest1 where id=3905000 \G
*************************** 1. row ***************************
 id: 3905000
  k: 7152454
  c: 33061989913-01978996152-96897051302-66804054532-36658200903-75265952777-90162670547-62113775555-84037309450-68725639441
pad: 93314475890-72810819110-74153294523-75348581725-15287112137
1 row in set (0.00 sec)

打印256 bytes验证下是不是我们要找的记录 offset为 53781(161024)+0x26af=881157807 id = 3905000 = 0x3b95e8 k = 7152454 = 0x6d2346 剩下的是c和pad的字符串,内容是对的。

hexdump -s 881157807 -n 256 -C /home/ming.lin/one_key_env_57/run_primary/dbs/testdb/sbtest1.ibd
348566af  80 3b 95 e8 00 00 00 00  0e d3 cb 00 00 00 0f 24  |.;.............$|
348566bf  ec 80 6d 23 46 33 33 30  36 31 39 38 39 39 31 33  |..m#F33061989913|
348566cf  2d 30 31 39 37 38 39 39  36 31 35 32 2d 39 36 38  |-01978996152-968|
348566df  39 37 30 35 31 33 30 32  2d 36 36 38 30 34 30 35  |97051302-6680405|
348566ef  34 35 33 32 2d 33 36 36  35 38 32 30 30 39 30 33  |4532-36658200903|
348566ff  2d 37 35 32 36 35 39 35  32 37 37 37 2d 39 30 31  |-75265952777-901|
3485670f  36 32 36 37 30 35 34 37  2d 36 32 31 31 33 37 37  |62670547-6211377|
3485671f  35 35 35 35 2d 38 34 30  33 37 33 30 39 34 35 30  |5555-84037309450|
3485672f  2d 36 38 37 32 35 36 33  39 34 34 31 20 39 33 33  |-68725639441 933|
3485673f  31 34 34 37 35 38 39 30  2d 37 32 38 31 30 38 31  |14475890-7281081|
3485674f  39 31 31 30 2d 37 34 31  35 33 32 39 34 35 32 33  |9110-74153294523|
3485675f  2d 37 35 33 34 38 35 38  31 37 32 35 2d 31 35 32  |-75348581725-152|
3485676f  38 37 31 31 32 31 33 37  20 3c 78 00 01 90 00 d0  |87112137 <x.....|
3485677f  80 3b 95 e9 00 00 00 00  0e d3 cb 00 00 00 0f 24  |.;.............$|
3485678f  f9 80 57 82 ca 37 38 33  32 36 37 32 37 31 32 39  |..W..78326727129|
3485679f  2d 32 30 30 39 36 39 37  37 37 38 30 2d 38 34 31  |-20096977780-841

至此我们手动找到了primary key为3905000的记录。

小结

page directory很关键,先在page directory中做binary search,定位到某个slot,再遍历slot上的record list