[MDEV-33067]  SCN(Sequence Commit Number) based MVCC Created: 2023-12-19  Updated: 2024-01-08

Status: Open
Project: MariaDB Server
Component/s: None
Fix Version/s: None

Type: New Feature Priority: Major
Reporter: zhangyuan Assignee: Marko Mäkelä
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Blocks
is blocked by MDEV-33066 Tencent requested features Open
Relates
relates to MDEV-12288 Reset DB_TRX_ID when the history is r... Closed
relates to MDEV-15104 Remove trx_sys_t::rw_trx_ids and trx_... Closed
relates to MDEV-15158 On commit, do not write to the TRX_SY... Closed
relates to MDEV-17598 InnoDB index option for per-record tr... Open
relates to MDEV-20630 lf_hash get performance regression si... Confirmed
relates to MDEV-21423 lock-free trx_sys get performance reg... Stalled

 Description   

SCN is a sequence number which was generated while committing transaction, and it should always be increased. With SCN, the snapshot for MVCC can only use a version number to compare with SCN.

origin commit from zhai weixiang: https://bugs.mysql.com/bug.php?id=109599



 Comments   
Comment by Marko Mäkelä [ 2023-12-19 ]

In MariaDB Server 10.3, the MVCC implementation as well as the data structures for transactions were changed in MDEV-12288, MDEV-15104, and MDEV-15158.

Which types of operations is this expected to improve and how? I think that currently the largest problem with MVCC are secondary indexes. That would be fixed by MDEV-17598.

We already have a monotonically increasing sequence number that is stored in each data page at FIL_PAGE_LSN. I have previously encountered the SCN (System Change Number) in Oracle documentation, and I was under the impression that it corresponds to the InnoDB Log Sequence Number (LSN).

Comment by Yin Peng [ 2023-12-20 ]

I have created a pull request on github: https://github.com/MariaDB/server/pull/2955

The purpose of this patch is to reduce the overhead of creating a snapshot(readview) by avoiding the traversal of active transaction IDs. In our testing scenario, we observed an approximately 5% improvement in QPS for read-write workload. However, there was a decrease of approximately 3% in the write workload.

Comment by Marko Mäkelä [ 2024-01-02 ]

zhangyuan, thank you. I posted some initial review comments.

I would like to know more about the benchmarks that you ran. The read performance bug could be what we already know as MDEV-20630 and MDEV-21423. I have not studied the lock-free hash table code in deep enough detail to be able to do something to it. There could be some low-hanging fruit in that code, such as replacing some of the sequential consistent memory ordering with release-acquire.

Comment by Yin Peng [ 2024-01-03 ]

Hello Marko, thanks for your reply. Our test configurations are:

sysbench

tables=250
table_size=25000

mariadb server

max_connections=2048
max_prepared_stmt_count=1638200000
loose-innodb_data_file_path=ibdata2:12M:autoextend
loose-innodb_buffer_pool_size=16G
loose-innodb_write_io_threads=12
loose-innodb_read_io_threads=12
loose-innodb_log_buffer_size=1G
loose-innodb_log_file_size=1G
loose-innodb_log_files_in_group=4
default-storage-engine=innodb
loose-sync_binlog=0
loose-innodb_flush_log_at_trx_commit=0
loose-innodb_use_scn=ON

The result of oltp_read_only:

threads 11.4 11.4 + scn ON percentage
256 790449 795916 0.69%
512 811825 817796 0.74%

The result of oltp_read_write:

threads 11.4 11.4 + scn ON percentage
128 678023 680049 0.30%
256 760962 774024 1.72%
512 747245 762342 2.02%
1024 618596 682468 10.33%

The result of oltp_write_only:

threads 11.4 11.4 + scn ON percentage
32 397994 395770 -0.56%
64 454040 456447 0.53%
96 428168 424995 -0.74%
128 398931 399046 0.03%
256 333296 335572 0.68%

The result of oltp_update_index:

threads 11.4 11.4 + scn ON percentage
32 208738 211640 1.39%
64 195723 191863 -1.97%
96 183197 184787 0.87%

We also conducted some mixed tests using two instances of sysbench.
oltp_update_non_index(128 threads) + oltp_point_select(128 threads):

bench_type 11.4 11.4 + scn ON percentage
oltp_update_non_index 263333 264891 0.59%
oltp_point_select 269005 557427 107.22%

oltp_update_non_index(256 threads) + oltp_point_select(256 threads):

bench_type 11.4 11.4 + scn ON percentage
oltp_update_non_index 229164 229081 -0.04%
oltp_point_select 247809 621881 150.95%

oltp_write_only(128 threads) + oltp_read_only(128 threads):

bench_type 11.4 11.4 + scn ON percentage
oltp_write_only 391662 356476 -8.98%
oltp_read_only 387675 383129 -1.17%

oltp_write_only(128 threads) + oltp_read_only(128 threads):

bench_type 11.4 11.4 + scn ON percentage
oltp_write_only 342268 322135 -5.88%
oltp_read_only 393421 391040 -0.61%

Note: All tests were conducted with the patch in MDEV-19749 applied.

I will make the modifications according to your review comments and then do the performance testing again.

Comment by Marko Mäkelä [ 2024-01-04 ]

Thank you, zhangyuan. For the oltp_read_only test, if the history list was empty at the start of the test, all DB_TRX_ID should be 0 and any transaction ID lookups should be avoided. In this type of scenario I would expect no big difference in throughput.

I have filed https://github.com/akopytov/sysbench/pull/419 for enabling the use of MDEV-515 and MDEV-24621 (and disabling row-level undo logging) during sysbench prepare. It could be useful to apply that patch, to keep the history list small. Starting with MDEV-12288 in MariaDB Server 10.3, also INSERT operations will result in a purge of history (resetting the DB_TRX_ID).

It would be good to specify the exact commit hash when not testing a tagged release. The pull request appears to be based on 11.4 as of this commit, which does not include a merge of MDEV-32050. The current head of 11.4 should be more interesting to compare. The last big InnoDB change was MDEV-29694 in 11.0; since then we had some fairly small changes such as MDEV-14795 and MDEV-32452. There have been no InnoDB changes in the 11.3 or 11.4 branches.

Comment by Yin Peng [ 2024-01-08 ]

Hello Marko, I have made some modifications according to your review comments. Here are some answers to the questions mentioned in the pull request.

1 How older versions would react to undo logs that exist in this format?
Older versions can't not recognize the undo log generated WITH innodb_use_scn=ON.

2. Did all test suites pass when running ./mtr -mysqld=-loose-innodb-use-scn=on?
Mostly test cases passed with -mysqld=-loose-innodb-use-scn=on except for: innodb.instant_alter_debug, innodb.temp_truncate, innodb.sys_truncate_debug and innodb.sys_truncate_shutdown_debug
I don't think these failed test cases are significant, as there is a reported mismatch:

@@ -29,7 +29,7 @@
SET GLOBAL INNODB_TRUNCATE_TEMPORARY_TABLESPACE_NOW= 1;
SELECT NAME, FILE_SIZE FROM INFORMATION_SCHEMA.INNODB_SYS_TABLESPACES WHERE SPACE= 4294967294;
NAME FILE_SIZE
-innodb_temporary 7340032
+innodb_temporary 8388608
connection default;
COMMIT;

3. What is the test suite execution time with both values of the parameter?
innodb_use_scn=OFF: Spent 13604.633 of 2496 seconds executing testcases
innodb_use_scn=ON: Spent 13438.079 of 2480 seconds executing testcases

4. Has this been tested with Galera or with replication? Do all nodes have to use the same value of the parameter? How does the SCN affect checks for implicit locks?
All replication test cases have passed. We did not test it with Galera, but in theory, SCN does not have any side effects on logical replication. It is not necessary for different nodes to use the same parameter. Additionally, SCN does not affect implicit locks. If a transaction is active, the DB_TRX_ID field in the record inserted by the transaction represents the transaction's ID. Write-back operation occures only when the transaction has been committed.

5. All builders (including ASAN, MSAN, UBSAN) need to pass all tests in both modes.
When building the current version 11.4 (commit id: 3fad2b115569864d8c1b7ea90ce92aa895cfef08) with options -DWITH_ASAN=ON, -DWITH_MSAN=ON, -DWITH_UBSAN=ON, the mtr encountered a failure during bootstrap:

==6377==ERROR: AddressSanitizer: use-after-poison on address 0x629000034e60 at pc 0x0000016cbcff bp 0x7ffe8a43a9a0 sp 0x7ffe8a43a990
READ of size 8 at 0x629000034e60 thread T0
#0 0x16cbcfe in MYSQL_DML_DONE /data/mariadb-server/sql/sql_select.cc:33425
#1 0x16d0650 in Sql_cmd_dml::execute(THD*) /data//mariadb-server/sql/sql_select.cc:33599
#2 0x1230496 in mysql_execute_command(THD*, bool) /data/mariadb-server/sql/sql_parse.cc:4375
#3 0x1275b37 in mysql_parse(THD*, char*, unsigned int, Parser_state*) /data/mariadb-server/sql/sql_parse.cc:7798
#4 0x11ea5fe in bootstrap(st_mysql_file*) /data//mariadb-server/sql/sql_parse.cc:1080
#5 0x83ddbf in mysqld_main(int, char**) /data/mariadb-server/sql/mysqld.cc:5925
#6 0x818531 in main /data/mariadb-server/sql/main.cc:34
#7 0x7f26c89c8554 in __libc_start_main (/lib64/libc.so.6+0x22554)
#8 0x81846e (/data/mariadb-server/bld-debug-asan/sql/mariadbd+0x81846e)

0x629000034e60 is located 11360 bytes inside of 16516-byte region [0x629000032200,0x629000036284)
allocated by thread T0 here:
#0 0x7f26caa44747 in malloc (/lib64/libasan.so.6+0xb4747)
#1 0x566da15 in sf_malloc /data/mariadb-server/mysys/safemalloc.c:126
#2 0x561e402 in my_malloc /data/mariadb-server/mysys/my_malloc.c:93
#3 0x55e666d in root_alloc /data/mariadb-server/mysys/my_alloc.c:66
#4 0x55e828c in reset_root_defaults /data/mariadb-server/mysys/my_alloc.c:244
#5 0xe410ab in THD::init_for_queries() /data/mariadb-server/sql/sql_class.cc:1394
#6 0x11e8d83 in bootstrap(st_mysql_file*) /data/mariadb-server/sql/sql_parse.cc:1006
#7 0x83ddbf in mysqld_main(int, char**) /data/mariadb-server/sql/mysqld.cc:5925
#8 0x818531 in main /data/mariadb-server/sql/main.cc:34
#9 0x7f26c89c8554 in __libc_start_main (/lib64/libc.so.6+0x22554)

6. Could we have a cmake option for enabling this feature, similar to PLUGIN_PERFSCHEMA, WITH_WSREP, or WITH_INNODB_AHI?
We added a cmake option: WITH_INNODB_SCN.

I have rebased the code to current 11.4(commit id: 7ee16b1e2978d6b73377d9a5bf7d9fd83c841f92), but following tests are done based on commit 3fad2b115569864d8c1b7ea90ce92aa895cfef08.

The test result of oltp_point_select:

threads 11.4 11.4 + scn on percentage 11.4 + scn off percentage
128 1007549 1000366 -0.71% 984842 -2.25%
256 1177827 1162645 -1.29% 1110086 -5.75%
512 1223943 1210108 -1.13% 1164133 -4.89%

The test result of oltp_read_only:

threads 11.4 11.4 + scn on percentage 11.4 + scn off percentage
256 800743 789337 -1.42% 787678 -1.63%
512 821005 819189 -0.22% 801275 -2.40%

The test result of oltp_read_write:

threads 11.4 11.4 + scn on percentage 11.4 + scn off percentage
128 660901 651862 -1.37% 661919 0.15%
256 724236 719448 -0.66% 729603 0.74%
512 708273 708818 0.08% 703146 -0.72%
1024 584583 627046 7.26% 567876 -2.86%

The test result of oltp_write_only:

threads 11.4 11.4 + scn on percentage 11.4 + scn off percentage
32 328882 327872 -0.31% 300172 -8.73%
64 284478 282361 -0.74% 270617 -4.87%
96 271194 270448 -0.28% 256589 -5.39%
128 265477 264196 -0.48% 250707 -5.56%
256 249800 249494 -0.12% 236870 -5.18%

The test result of oltp_update_index:

threads 11.4 11.4 + scn on percentage 11.4 + scn off percentage
32 146077 139539 -4.48% 137842 -5.64%
64 122567 121918 -0.53% 118597 -3.24%
96 119691 119429 -0.22% 112644 -5.89%

oltp_update_non_index(128 threads) + oltp_point_select(128 threads):

bench_type 11.4 11.4 + scn on percentage 11.4 + scn off percentage
oltp_update_non_index 201938 188100 -6.85% 177045 -12.33%
oltp_point_select 196865 575019 192.09% 216122 9.78%

oltp_update_non_index(256 threads) + oltp_point_select(256 threads):

bench_type 11.4 11.4 + scn on percentage 11.4 + scn off percentage
oltp_update_non_index 174668 181753 4.06% 162980 -6.69%
oltp_point_select 169829 648777 282.02% 279503 64.58%

oltp_write_only(128 threads) + oltp_read_only(128 threads):

bench_type 11.4 11.4 + scn on percentage 11.4 + scn off percentage
oltp_write_only 174668 247804 -7.19% 254360 -4.74%
oltp_read_only 169829 373375 5.93% 371413 5.37%

oltp_write_only(128 threads) + oltp_read_only(128 threads):

bench_type 11.4 11.4 + scn on percentage 11.4 + scn off percentage
oltp_write_only 244829 227786 -6.96% 237613 -2.95%
oltp_read_only 363598 397488 9.32% 393230 8.15%

According to these tests for mariadb, SCN doesn't make much sense for most sysbench scenarios. However, it does show promising improvements for the oltp_point_select workload when combined with other write operations.

Generated at Thu Feb 08 10:36:08 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.