[MDEV-32014] When binlogging enabled, committing a large transaction will freeze all other transactions until completed Created: 2023-08-25  Updated: 2024-01-24

Status: Confirmed
Project: MariaDB Server
Component/s: Replication, Server
Affects Version/s: 10.6.14
Fix Version/s: 10.6, 10.11, 11.0, 11.1

Type: Bug Priority: Critical
Reporter: Rick Pizzi Assignee: Andrei Elkin
Resolution: Unresolved Votes: 7
Labels: None

Attachments: PNG File screenshot-1.png    
Issue Links:
Blocks
is blocked by MDEV-32020 XA transaction replicates incorrectly... In Progress

 Description   

When a large transaction is committed (eg: an INSERT INTO..... SELECT) and binlogging is enabled, the entire workload is frozen until the large transaction has finished committing.

This was noticed on a production server where running
insert into <empty table> select * from <unused table>
created an outage by locking the entire workload for long time.

At first glance, this seems due to the large transaction holding an exclusive lock on binary log for write - hence, no other transaction is able to proceed until this finishes.

Even in absence of large transactions, this seems quite a performance and scalability killer....

To reproduce, on a server with binlogging enabled, just create a large table, then try to insert into a new table selecting from this one. When the insert... select will begin committing, you will see no other DML will be able to proceed.

Confirmed community and enterprise 10.6 (info below from enterprise version).

Processlist showing a 2nd transaction hanging:

MariaDB [(none)]> show processlist;
+-----+------+-----------+------+---------+------+----------+----------------------------------------------------+----------+
| Id  | User | Host      | db   | Command | Time | State    | Info                                               | Progress |
+-----+------+-----------+------+---------+------+----------+----------------------------------------------------+----------+
| 142 | root | localhost | rick | Query   |  165 | Commit   | insert into bucket (junk) select junk from bucket1 |    0.000 |
| 234 | root | localhost | NULL | Query   |    0 | starting | show processlist                                   |    0.000 |
| 349 | root | localhost | NULL | Query   |   24 | Commit   | insert into rick.junk values()                     |    0.000 |
+-----+------+-----------+------+---------+------+----------+----------------------------------------------------+----------+
3 rows in set (0.000 sec)

Relevant threads at the time of the commit:

Thread 15 (Thread 0x7f08c807b700 (LWP 375122)):
#0  0x00007f08ccc5d82d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f08ccc56ad9 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x000055fd5509d1d7 in inline_mysql_mutex_lock (src_file=0x55fd556e7a00 "/home/jenkins/workspace/Build-Package/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_ON_ES_BACKUP_DEBUGSOURCE/sql/log.cc", src_line=8282, that=0x55fd55ddd2d0 <mysql_bin_log+16>) at /usr/src/debug/MariaDB-/src_0/include/mysql/psi/mysql_thread.h:752
#3  MYSQL_BIN_LOG::trx_group_commit_leader (this=0x55fd55ddd2c0 <mysql_bin_log>, leader=0x7f08c80791b0) at /usr/src/debug/MariaDB-/src_0/sql/log.cc:8282
#4  0x000055fd5509d924 in MYSQL_BIN_LOG::write_transaction_to_binlog_events (this=0x55fd55ddd2c0 <mysql_bin_log>, entry=0x7f08c80791b0) at /usr/src/debug/MariaDB-/src_0/sql/log.cc:8145
#5  0x000055fd5509de52 in MYSQL_BIN_LOG::write_transaction_to_binlog (this=0x55fd55ddd2c0 <mysql_bin_log>, thd=0x7f0870000c58, cache_mngr=0x7f087002d958, end_ev=<optimized out>, all=<optimized out>, using_stmt_cache=<optimized out>, using_trx_cache=true, is_ro_1pc=false) at /usr/src/debug/MariaDB-/src_0/sql/log.cc:7742
#6  0x000055fd5509f2b7 in binlog_flush_cache (is_ro_1pc=false, using_trx=true, using_stmt=true, all=false, end_ev=0x7f08c80792d0, cache_mngr=0x7f087002d958, thd=0x7f0870000c58) at /usr/src/debug/MariaDB-/src_0/sql/log.cc:1780
#7  binlog_commit_flush_xid_caches (xid=295, all=false, cache_mngr=0x7f087002d958, thd=0x7f0870000c58) at /usr/src/debug/MariaDB-/src_0/sql/log.cc:1930
#8  MYSQL_BIN_LOG::log_and_order (this=<optimized out>, thd=0x7f0870000c58, xid=295, all=<optimized out>, need_prepare_ordered=<optimized out>, need_commit_ordered=<optimized out>) at /usr/src/debug/MariaDB-/src_0/sql/log.cc:10334
#9  0x000055fd54f9277e in ha_commit_trans (thd=thd@entry=0x7f0870000c58, all=all@entry=false) at /usr/src/debug/MariaDB-/src_0/sql/handler.cc:1926
#10 0x000055fd54e78f13 in trans_commit_stmt (thd=0x7f0870000c58) at /usr/src/debug/MariaDB-/src_0/sql/transaction.cc:472
#11 0x000055fd54d3f729 in mysql_execute_command(THD*, bool) () at /usr/src/debug/MariaDB-/src_0/sql/sql_parse.cc:6090
#12 0x000055fd54d44b91 in mysql_parse(THD*, char*, unsigned int, Parser_state*) () at /usr/src/debug/MariaDB-/src_0/sql/sql_parse.cc:8063
#13 0x000055fd54d47c75 in dispatch_command(enum_server_command, THD*, char*, unsigned int, bool) () at /usr/src/debug/MariaDB-/src_0/sql/sql_class.h:1391
#14 0x000055fd54d49ae2 in do_command (thd=0x7f0870000c58, blocking=blocking@entry=true) at /usr/src/debug/MariaDB-/src_0/sql/sql_parse.cc:1416
#15 0x000055fd54e68437 in do_handle_one_connection (connect=<optimized out>, put_in_cache=true) at /usr/src/debug/MariaDB-/src_0/sql/sql_connect.cc:1416
#16 0x000055fd54e6877d in handle_one_connection (arg=arg@entry=0x55fd56efb188) at /usr/src/debug/MariaDB-/src_0/sql/sql_connect.cc:1318
#17 0x000055fd551d1842 in pfs_spawn_thread (arg=0x55fd56e6f6f8) at /usr/src/debug/MariaDB-/src_0/storage/perfschema/pfs.cc:2201
#18 0x00007f08ccc541cf in start_thread () from /lib64/libpthread.so.0
#19 0x00007f08cbfa4e73 in clone () from /lib64/libc.so.6
 
Thread 14 (Thread 0x7f08c80c6700 (LWP 375105)):
#0  0x00007f08ccc5da17 in write () from /lib64/libpthread.so.0
#1  0x000055fd554c814d in my_write (Filedes=Filedes@entry=23, Buffer=Buffer@entry=0x55fd56d6b6d8 "567890123456789012345678901234567890123456789012345678901234567890123456789\370\023\323k", Count=Count@entry=65042, MyFlags=MyFlags@entry=564) at /usr/src/debug/MariaDB-/src_0/mysys/my_write.c:49
#2  0x000055fd554abe17 in inline_mysql_file_write (src_file=0x55fd5577a0b8 "/home/jenkins/workspace/Build-Package/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_ON_ES_BACKUP_DEBUGSOURCE/mysys/mf_iocache.c", flags=<optimized out>, count=65042, buffer=0x55fd56d6b6d8 "567890123456789012345678901234567890123456789012345678901234567890123456789\370\023\323k", file=23, src_line=1526) at /usr/src/debug/MariaDB-/src_0/include/mysql/psi/mysql_file.h:1176
#3  _my_b_cache_write (info=0x55fd55ddd5e0 <mysql_bin_log+800>, Buffer=0x55fd56d6b6d8 "567890123456789012345678901234567890123456789012345678901234567890123456789\370\023\323k", Count=65042) at /usr/src/debug/MariaDB-/src_0/mysys/mf_iocache.c:1526
#4  0x000055fd554ada82 in my_b_flush_io_cache (info=0x55fd55ddd5e0 <mysql_bin_log+800>, need_append_buffer_lock=0) at /usr/src/debug/MariaDB-/src_0/mysys/mf_iocache.c:1727
#5  0x000055fd554ae242 in _my_b_write (info=0x55fd55ddd5e0 <mysql_bin_log+800>, Buffer=0x7f086c06e90a "67890123456789012345678901234567890123456789012345678901234567890123456789\370F\324k", Count=3239) at /usr/src/debug/MariaDB-/src_0/mysys/mf_iocache.c:559
#6  0x000055fd550ad506 in Log_event_writer::write_internal (this=0x7f08c80c3fd0, pos=<optimized out>, len=8047) at /usr/src/debug/MariaDB-/src_0/sql/log_event_server.cc:820
#7  0x000055fd55092cae in CacheWriter::write (len=8047, pos=<optimized out>, this=0x7f08c80c3fd0) at /usr/src/debug/MariaDB-/src_0/sql/log.cc:7316
#8  MYSQL_BIN_LOG::write_cache (this=<optimized out>, thd=<optimized out>, cache=0x7f086c05e4d0) at /usr/src/debug/MariaDB-/src_0/sql/log.cc:7480
#9  0x000055fd5509341d in MYSQL_BIN_LOG::write_transaction_or_stmt (this=this@entry=0x55fd55ddd2c0 <mysql_bin_log>, entry=entry@entry=0x7f08c80c41b0, commit_id=commit_id@entry=0) at /usr/src/debug/MariaDB-/src_0/sql/log.cc:8633
#10 0x000055fd5509d310 in MYSQL_BIN_LOG::trx_group_commit_leader (this=0x55fd55ddd2c0 <mysql_bin_log>, leader=0x7f08c80c41b0) at /usr/src/debug/MariaDB-/src_0/sql/log.cc:8353
#11 0x000055fd5509d924 in MYSQL_BIN_LOG::write_transaction_to_binlog_events (this=0x55fd55ddd2c0 <mysql_bin_log>, entry=0x7f08c80c41b0) at /usr/src/debug/MariaDB-/src_0/sql/log.cc:8145
#12 0x000055fd5509de52 in MYSQL_BIN_LOG::write_transaction_to_binlog (this=0x55fd55ddd2c0 <mysql_bin_log>, thd=0x7f086c000c58, cache_mngr=0x7f086c05e3a8, end_ev=<optimized out>, all=<optimized out>, using_stmt_cache=<optimized out>, using_trx_cache=true, is_ro_1pc=false) at /usr/src/debug/MariaDB-/src_0/sql/log.cc:7742
#13 0x000055fd5509f2b7 in binlog_flush_cache (is_ro_1pc=false, using_trx=true, using_stmt=true, all=false, end_ev=0x7f08c80c42d0, cache_mngr=0x7f086c05e3a8, thd=0x7f086c000c58) at /usr/src/debug/MariaDB-/src_0/sql/log.cc:1780
#14 binlog_commit_flush_xid_caches (xid=14, all=false, cache_mngr=0x7f086c05e3a8, thd=0x7f086c000c58) at /usr/src/debug/MariaDB-/src_0/sql/log.cc:1930
#15 MYSQL_BIN_LOG::log_and_order (this=<optimized out>, thd=0x7f086c000c58, xid=14, all=<optimized out>, need_prepare_ordered=<optimized out>, need_commit_ordered=<optimized out>) at /usr/src/debug/MariaDB-/src_0/sql/log.cc:10334
#16 0x000055fd54f9277e in ha_commit_trans (thd=thd@entry=0x7f086c000c58, all=all@entry=false) at /usr/src/debug/MariaDB-/src_0/sql/handler.cc:1926
#17 0x000055fd54e78f13 in trans_commit_stmt (thd=0x7f086c000c58) at /usr/src/debug/MariaDB-/src_0/sql/transaction.cc:472
#18 0x000055fd54d3f729 in mysql_execute_command(THD*, bool) () at /usr/src/debug/MariaDB-/src_0/sql/sql_parse.cc:6090
#19 0x000055fd54d44b91 in mysql_parse(THD*, char*, unsigned int, Parser_state*) () at /usr/src/debug/MariaDB-/src_0/sql/sql_parse.cc:8063
#20 0x000055fd54d47c75 in dispatch_command(enum_server_command, THD*, char*, unsigned int, bool) () at /usr/src/debug/MariaDB-/src_0/sql/sql_class.h:1391
#21 0x000055fd54d49ae2 in do_command (thd=0x7f086c000c58, blocking=blocking@entry=true) at /usr/src/debug/MariaDB-/src_0/sql/sql_parse.cc:1416
#22 0x000055fd54e68437 in do_handle_one_connection (connect=<optimized out>, put_in_cache=true) at /usr/src/debug/MariaDB-/src_0/sql/sql_connect.cc:1416
#23 0x000055fd54e6877d in handle_one_connection (arg=arg@entry=0x55fd56edcbb8) at /usr/src/debug/MariaDB-/src_0/sql/sql_connect.cc:1318
#24 0x000055fd551d1842 in pfs_spawn_thread (arg=0x55fd56e6f1b8) at /usr/src/debug/MariaDB-/src_0/storage/perfschema/pfs.cc:2201
#25 0x00007f08ccc541cf in start_thread () from /lib64/libpthread.so.0
#26 0x00007f08cbfa4e73 in clone () from /lib64/libc.so.6



 Comments   
Comment by Andrei Elkin [ 2023-08-31 ]

rpizzi salve!
In just two thread stacks provided the 2nd does not really hang at

#0  0x00007f08ccc5da17 in write () from /lib64/libpthread.so.0

while most probably the 1st just waits for it (which is normal).

So this is not descriptive, hopefully you can come up with all thread stacks.

Cheers!

Comment by Andrei Elkin [ 2023-08-31 ]

I've not tried reproducing it yet. 5 mins sounds pretty encouraging! Thanks for the pointing-out!
In the dependency that I assume there's no ordering implying. It could be as well 'the 2nd' waits for 'the 1st' (as it is supposed to be ).

Comment by Andrei Elkin [ 2023-08-31 ]

Roel, could you please confirm this description. It might be a (possibly long running from big table) SELECT can just sleep (SELECT sleep(1),a-col-from-table * for a little time per each record to emulate the big table.

Comment by Roel Van de Paar [ 2023-09-02 ]

rpizzi Can you provide a testcase please? I tried the following without success:

# mysqld options required for replay:  --log_bin
-- session 1
CREATE TABLE t1 (c1 INT NOT NULL AUTO_INCREMENT, PRIMARY KEY (c1)) AUTO_INCREMENT=1 ENGINE=InnoDB; 
INSERT INTO t1 SELECT 0 FROM seq_1_to_10000000;  # ~27 Seconds
INSERT INTO t1 SELECT 0 FROM seq_1_to_15000000;  # While this statement is processing (~40 sec runtime), open session 2 and execute:
-- session 2
CREATE TABLE t2 (c1 INT NOT NULL AUTO_INCREMENT, PRIMARY KEY (c1)) AUTO_INCREMENT=1 ENGINE=InnoDB; 
INSERT INTO t2 SELECT * FROM t1;
-- session 3
SHOW PROCESSLIST; 

On both optimized and debug builds of CS, in 10.6 and in 11.3. I also tested ES, 10.6.14, optimized.
The SHOW PROCESSLIST only shows 'Sending data' state.
I also tried using a transaction:

# mysqld options required for replay:  --log_bin
-- session 1
CREATE TABLE t1 (c1 INT NOT NULL AUTO_INCREMENT, PRIMARY KEY (c1)) AUTO_INCREMENT=1 ENGINE=InnoDB; 
INSERT INTO t1 SELECT 0 FROM seq_1_to_10000000;  # ~27 Seconds, wait
-- session 2
CREATE TABLE t2 (c1 INT NOT NULL AUTO_INCREMENT, PRIMARY KEY (c1)) AUTO_INCREMENT=1 ENGINE=InnoDB; 
BEGIN;
INSERT INTO t2 SELECT * FROM t1;  # While this statement is processing (~27 sec runtime), go to session 1 and execute below SQL
COMMIT;  
-- session 1 
INSERT INTO t1 SELECT 0 FROM seq_1_to_10000000;  # ~27 Seconds, check processlist in session 3:
-- session 3  # Check processlist regularly
SHOW PROCESSLIST; 

However, the commit is immediate and the INSERT again only shows a 'Sending data' state.
I never saw a 'Commit' state (only 'Sending data' for one or two threads), so an essential element must be missing.

Comment by Roel Van de Paar [ 2023-09-12 ]

rpizzi In my case the INSERT returns immediately, irrespective of what order I follow:

10.6.16 961b96a5e0dd40512b8fff77dcec273187ccc9fd (Debug)

10.6.16-dbg>INSERT INTO t2 VALUES();
Query OK, 1 row affected (0.010 sec)

I also tested with 10.6.14, dbg, commit c271057288f71746d1816824f338f2d9c47f67c1, with the same result.

Can you please provide a detailed step-by-step testcase listing all steps required, as well as the exact commit (or binary link) you tested with? Thank you

Comment by Rick Pizzi [ 2023-09-12 ]

Maybe I was not clear enough in explaining the issue.
If you run the insert into t2 while the insert... select is still running it will return immediately, because the issue will only happens when the insert.. select is committing, and locking the binlog in the process. So you have to wait for the commit moment.

You can use this bash one-liner to continuously insert into t2 at 1-second interval:

while true; do echo "select now() as started; insert into t2 values (); select now() as completed;" ; sleep 1; done | mysql -A 

You will see that when the insert... select commits, the insert into t2 will freeze until commit completes.
I just retested in my lab and it froze for about 5 seconds.

The insert... select:

MariaDB [rick]> select now() as started; insert into bucket select * from bucket1; select now() as completed;
+---------------------+
| started             |
+---------------------+
| 2023-09-12 09:51:44 |
+---------------------+
1 row in set (0.000 sec)
 
Query OK, 15632316 rows affected (2 min 32.083 sec)
Records: 15632316  Duplicates: 0  Warnings: 0
 
+---------------------+
| completed           |
+---------------------+
| 2023-09-12 09:54:16 |
+---------------------+
1 row in set (0.000 sec)

The insert into t2 (bash one liner above running in different window):

started
2023-09-12 09:54:09
completed
2023-09-12 09:54:09
started
2023-09-12 09:54:10
completed
2023-09-12 09:54:10
started
2023-09-12 09:54:11
completed
2023-09-12 09:54:16

Comment by Rick Pizzi [ 2023-09-12 ]

see comment

Comment by Rick Pizzi [ 2023-09-19 ]

We just had another occurrence of this bug.
Not only it freezes other DMLs but also any binlog related command like SHOW MASTER STATUS.

Comment by Marko Mäkelä [ 2023-10-02 ]

For the record, MDEV-27774 allows multiple InnoDB mtr_t::commit() to concurrently write to the shared log_sys.buf, which will then be durably written to the persistent log. Already before this change, MDEV-21534 implemented a “group commit” logic that would optimize pwrite() and fdatasync() from the log_sys.buf. I think that the binlog writes should borrow some ideas from InnoDB. I also think that we should seriously consider implementing an option where the binlog is being stored in InnoDB tables.

Comment by Rick Pizzi [ 2023-10-02 ]

I'm all in for this idea. This is what xpand is doing already.

Comment by Andrei Elkin [ 2023-10-02 ]

In the description we have only two thread states, one of which is writing to binlog being a group-commit leader. The other is waiting apparently being a member (the leader?) of a following group and apparently at

mysql_mutex_lock(&LOCK_log);

rpizzi, I think we need show create table additionally to the bash scripts that run on it.
Then I'll have a look.
To 45166 seconds of Commit it's sad you did not have thr app all bt snapshot.
Without it that show-processlist report is all but meaningless.

serg, but why MDEV-32020? I did not see piecewise the keyword, sorry. I see what you meant.

Comment by Rick Pizzi [ 2023-10-02 ]

Elkin yes the other thread is waiting. This is what this bug is about. We are serializing commits due to binlog access. This is a scalability killer.
In case of large transactions committing, this will also cause outages.

Comment by Andrei Elkin [ 2023-10-02 ]

rpizzi, serg: another method to improve is to write to binlog in parallel (maybe that what marko means in allows multiple InnoDB mtr_t::commit()). I mean few group-commit leaders can write at a time, a binlog reader (show-binlog-events incl) would see only the total work done or valid old state of it.

Comment by Andrei Elkin [ 2023-10-02 ]

marko, to where the binlog is being stored in InnoDB tables there's a Brandon's benchmarking exposing somewhat a great cost when it's done the table way.

For this bug the parallel write to binlog is the first thing that comes to my mind.

Comment by Marko Mäkelä [ 2023-10-02 ]

Elkin, I only remember a performance test where something resembling binlog events was being written to a single InnoDB table. That would cause obvious bottlenecks on InnoDB locks and buffer page latches. To be scalable, the binlog would obviously have to be written into a number of table partitions or shards.

Comment by Roel Van de Paar [ 2023-10-14 ]

rpizzi I tried again using the information provided, and I remain unable to reproduce this on my end. I tested 10.6.16(opt) and 11.3.0(dbg).
Here is the testcase I used:

mariadb CLI:

# mariadbd options required for replay:  --log_bin
CREATE TABLE t1 (c1 INT NOT NULL AUTO_INCREMENT, PRIMARY KEY (c1)) AUTO_INCREMENT=1 ENGINE=InnoDB; 
CREATE TABLE t2 (c1 INT NOT NULL AUTO_INCREMENT, PRIMARY KEY (c1)) AUTO_INCREMENT=1 ENGINE=InnoDB;
CREATE TABLE t3 (c1 INT NOT NULL AUTO_INCREMENT, PRIMARY KEY (c1)) AUTO_INCREMENT=1 ENGINE=InnoDB;    
INSERT INTO t1 SELECT 0 FROM seq_1_to_15000000;  # ~40 Seconds
INSERT INTO t3 SELECT * FROM t1;  # ~2-5 Minutes. While this statement is processing, execute in a Bash CLI:

Bash CLI:

sudo apt install bc
while true; do START="$(date +'%s.%N')"; echo "INSERT INTO t2 VALUES ();" ; echo "Duration in sec.ns: $(echo "$(date +'%s.%N') - ${START}" | bc) " >&2; sleep 1; done | ./bin/mariadb -A -uroot -S./socket.sock test
# The output will be in seconds.nanoseconds. When the result is <1 second, the output will be .nanosecond only. 

The output of the final insert INSERT INTO T3 never went above ~0.0065 seconds (i.e. readout: .006502037), including at+after the end of the INSERT INTO T3, even though this INSERT INTO t3 took 5 minutes.

If you still run into this, please provide a detailed testcase with all steps required. Please ensure to use a fresh MariaDB server instance which is not otherwise busy with any other work, is not running monyog and has a just-created data directory. Please also provide the exact version and commit used and any configuration parameters set. Thank you

Comment by Andrei Elkin [ 2023-10-14 ]

The case is determined to have been caused by design of serialization of writes to binlog. A big transaction write can take long time to delay next transaction write and commit.
A mention above parallel write needs to complete a design that addresses all recovery concerns, before/if we'll commit to its implementation.

Comment by Andrei Elkin [ 2023-10-14 ]

Roel, rpizzi, thanks guys for discussing it. This helped me to understand what it is and where to go.

Comment by Rick Pizzi [ 2023-10-14 ]

Elkin thank you for acknowledging that there's a design problem here.

Comment by VAROQUI Stephane [ 2023-10-20 ]

To reproduce with employees DB

 mysql  -hdb2.bench.svc.cloud18 -uroot -pmariadb employees -e'drop table if exists stop; create table stop(id serial) engine=innodb; drop table if exists salaries_copy ; create table salaries_copy like salaries ; SET binlog_format="ROW"; BEGIN; insert into salaries_copy select * from salaries; insert  into stop select null; COMMIT; ' &;  while true ;  do  mysql -B -N  -hdb2.bench.svc.cloud18 -uroot -pmariadb employees <update.sql > res.log  & ;  test=$( head -n 1 res.log | grep 1); if [ "$test" = "1" ] ; then  break; fi; sleep 0.2 ;   mysql  -hdb2.bench.svc.cloud18 -uroot -pmariadb  -B -N -e 'select * from information_schema.processlist where INFO like "update salaries%" and TIME_MS >500' >> ps.log;  done ;

using update.sql

set @stop:=0;
select @stop:=id from stop; 
execute immediate if(@stop=1, 'select @stop' , 'update salaries set salary=salary+1 where emp_no=10001 and from_date="1989-06-25" and to_date="1990-06-25"')

cat ps.log

13549	root	10.8.0.50:50977	employees	Query	2	Waiting for semi-sync ACK from slave	update salaries set salary=salary+1 where emp_no=10001 and from_date="1989-06-25" and to_date="1990-06-25"	2986.002	0	0.000	53336	53336	1	931364	update salaries set salary=salary+1 where emp_no=10001 and from_date="1989-06-25" and to_date="1990-06-25"	164257

Comment by VAROQUI Stephane [ 2023-10-20 ]

MariaDB [(none)]> select version();
+-----------------------------------------+
| version()                               |
+-----------------------------------------+
| 10.8.0-MariaDB-1:10.8.0+maria~focal-log |
+-----------------------------------------+
1 row in set (0,008 sec)

According that execution of small trx execution is conditional to the commit of long trx this binlog should have not been written in such trx order

long trx

###   @3='2000:11:29' /* DATE meta=0 nullable=0 is_null=0 */
###   @4='2001:11:29' /* DATE meta=0 nullable=0 is_null=0 */
### INSERT INTO `employees`.`salaries_copy`
### SET
###   @1=499999 /* INT meta=0 nullable=0 is_null=0 */
###   @2=77303 /* INT meta=0 nullable=0 is_null=0 */
###   @3='2001:11:29' /* DATE meta=0 nullable=0 is_null=0 */
###   @4='9999:01:01' /* DATE meta=0 nullable=0 is_null=0 */
# Number of rows: 2844047
# at 42858560
# at 42858612
#231020 16:05:00 server id 52428799  end_log_pos 42858612 CRC32 0xec4cf7b6 	Annotate_rows:
#Q> insert  into stop select null
#231020 16:05:00 server id 52428799  end_log_pos 42858664 CRC32 0xee1dd467 	Table_map: `employees`.`stop` mapped to number 137
# at 42858664
#231020 16:05:00 server id 52428799  end_log_pos 42858706 CRC32 0x4ff8e487 	Write_rows: table id 137 flags: STMT_END_F
### INSERT INTO `employees`.`stop`
### SET
###   @1=1 /* LONGINT meta=0 nullable=0 is_null=0 */
# Number of rows: 1
# at 42858706
#231020 16:05:00 server id 52428799  end_log_pos 42858737 CRC32 0xe24798ba 	Xid = 935778
COMMIT/*!*/;

concurrent small trx that should go before according to commit order using COMMITTED-READ as execute immediate only run if the insert in stop table from previous long trx

# at 42858737
#231020 16:05:00 server id 52428799  end_log_pos 42858779 CRC32 0x88bf42c6 	GTID 1-52428799-2877 trans
/*!100001 SET @@session.gtid_seq_no=2877*//*!*/;
START TRANSACTION
/*!*/;
# at 42858779
# at 42858908
#231020 16:05:00 server id 52428799  end_log_pos 42858908 CRC32 0x30bae1ef 	Annotate_rows:
#Q> update salaries set salary=salary+1 where emp_no=10001 and from_date="1989-06-25" and to_date="1990-06-25"
#231020 16:05:00 server id 52428799  end_log_pos 42858967 CRC32 0x6cc1b3f0 	Table_map: `employees`.`salaries` mapped to number 38
# at 42858967
#231020 16:05:00 server id 52428799  end_log_pos 42859031 CRC32 0xbb7752d5 	Update_rows: table id 38 flags: STMT_END_F
### UPDATE `employees`.`salaries`
### WHERE
###   @1=10001 /* INT meta=0 nullable=0 is_null=0 */
###   @2=69119 /* INT meta=0 nullable=0 is_null=0 */
###   @3='1989:06:25' /* DATE meta=0 nullable=0 is_null=0 */
###   @4='1990:06:25' /* DATE meta=0 nullable=0 is_null=0 */
### SET
###   @1=10001 /* INT meta=0 nullable=0 is_null=0 */
###   @2=69120 /* INT meta=0 nullable=0 is_null=0 */
###   @3='1989:06:25' /* DATE meta=0 nullable=0 is_null=0 */
###   @4='1990:06:25' /* DATE meta=0 nullable=0 is_null=0 */
# Number of rows: 1
# at 42859031
#231020 16:05:00 server id 52428799  end_log_pos 42859062 CRC32 0x43f13db9 	Xid = 936235
COMMIT/*!*/;
DELIMITER ;

Comment by VAROQUI Stephane [ 2023-10-20 ]

reproduced with lasted 10.11

115	root	10.8.0.50:52981	employees	Query	0	Waiting for semi-sync ACK from slave	update salaries set salary=salary+1 where emp_no=10001 and from_date="1989-06-25" and to_date="1990-06-25"	612.957	0	0	0.000	111008	111008	1	342	update salaries set salary=salary+1 where emp_no=10001 and from_date="1989-06-25" and to_date="1990-06-25"	0
115	root	10.8.0.50:52981	employees	Query	0	Waiting for semi-sync ACK from slave	update salaries set salary=salary+1 where emp_no=10001 and from_date="1989-06-25" and to_date="1990-06-25"	908.703	0	0	0.000	111008	111008	1	342	update salaries set salary=salary+1 where emp_no=10001 and from_date="1989-06-25" and to_date="1990-06-25"	0
115	root	10.8.0.50:52981	employees	Query	1	Waiting for semi-sync ACK from slave	update salaries set salary=salary+1 where emp_no=10001 and from_date="1989-06-25" and to_date="1990-06-25"	1675.116	0	0.000	111008	111008	1	342	update salaries set salary=salary+1 where emp_no=10001 and from_date="1989-06-25" and to_date="1990-06-25"	0
115	root	10.8.0.50:52981	employees	Query	2	Waiting for semi-sync ACK from slave	update salaries set salary=salary+1 where emp_no=10001 and from_date="1989-06-25" and to_date="1990-06-25"	2048.023	0	0.000	111008	111008	1	342	update salaries set salary=salary+1 where emp_no=10001 and from_date="1989-06-25" and to_date="1990-06-25"	0

Comment by VAROQUI Stephane [ 2023-10-20 ]

Could not reproduce in MIXED mode

Comment by Kristian Nielsen [ 2023-10-20 ]

stephane@skysql.com , I don't understand why it is wrong that the short transaction is binlogged after the long one.
On the contrary, it looks like the long transaction must come first, since the short transaction can only run after the stop.id gets set to 1 by the long-running transaction?

Comment by VAROQUI Stephane [ 2023-10-20 ]

because the long trx still running is the one that allow execution of a short one after , so short one should not happen after the big one have commit see select @stop:=id from stop; where stop is triigger by the big trx

Comment by VAROQUI Stephane [ 2023-10-20 ]

@*Kristian Nielsen* you raised a good point , long transactions TRX1 enter commit ,and multiple small ones in concurrency have begin before TRX1 rand not each commit point but , they are locked by the end of of TRX1 commit inside the binlog this only make sens in committed-read and STATEMENT based but in ROW based as the result of the insert select is self contained inside the row based view of TRX1 it could be interleaved with the domain that modify the read view as at commit time of TRX1 read view is already persisted inside bin_log_cache and is no more dependent of record changed from small TRX regardless of other transactions have committed after or before TRX1.

Comment by VAROQUI Stephane [ 2023-10-20 ]

Thanks i now understand why artificially changing commit order in binary log file could be a bad idea for fixing this issue and that it could be fixed in some futur releases based on some protocol extension.

Comment by Kristian Nielsen [ 2024-01-03 ]

In general, transactions that modify a large number of rows and binlog in row-mode will cause stalls for other (also small) concurrent transactions. This is because of the limitation in the current binlog format, where the entire transaction is binlogged as a single continuous event group that cannot be interleaved with other transactions.

This limitation would be good to eliminate in new evolutions of the binlog format, eg. when we integrate the binlog into the storage layer of engines such as innodb. The new binlog format should be able to partially binlog row events from transactions exceeding a certain size. Then at commit, most of the event data will have been already binlogged, and there will not be a large stall for concurrent commits. A disadvantage will be that large rollbacks will then incur a cost in binlog size.

Comment by VAROQUI Stephane [ 2024-01-03 ]

My proposal was to introduce a double binlog rotation every time a transaction overflow binlog_cache_size . The new big transaction get fully log in binlog +1 marked as a prepared step the following transactions get persisted in binlog +2 and will also stored a commit step for the prepared big transaction . One direct bonus of this is that any sysadmin with a ls command on binlog directory can instantly identified that some big transactions have been running in the system and try to improve the workload .

I can't tell the number of time i have identified crash or restart of the server just by looking at the binlog directory without having to dig into error logs

Comment by Libing Song [ 2024-01-24 ]

We already implemented a rename solution on MySQL-8.0. We are looking forward to merge the patch into MariaDB. I am discussing the solution with Andrei to check if the solution is fine for MariaDB.

Comment by Andrei Elkin [ 2024-01-24 ]

libing, thanks for this piece! Indeed we should import this method!

Generated at Thu Feb 08 10:28:10 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.