[MDEV-32014] When binlogging enabled, committing a large transaction will freeze all other transactions until completed Created: 2023-08-25 Updated: 2024-01-24 |
|
| Status: | Confirmed |
| Project: | MariaDB Server |
| Component/s: | Replication, Server |
| Affects Version/s: | 10.6.14 |
| Fix Version/s: | 10.6, 10.11, 11.0, 11.1 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Rick Pizzi | Assignee: | Andrei Elkin |
| Resolution: | Unresolved | Votes: | 7 |
| Labels: | None | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
When a large transaction is committed (eg: an INSERT INTO..... SELECT) and binlogging is enabled, the entire workload is frozen until the large transaction has finished committing. This was noticed on a production server where running At first glance, this seems due to the large transaction holding an exclusive lock on binary log for write - hence, no other transaction is able to proceed until this finishes. Even in absence of large transactions, this seems quite a performance and scalability killer.... To reproduce, on a server with binlogging enabled, just create a large table, then try to insert into a new table selecting from this one. When the insert... select will begin committing, you will see no other DML will be able to proceed. Confirmed community and enterprise 10.6 (info below from enterprise version). Processlist showing a 2nd transaction hanging:
Relevant threads at the time of the commit:
|
| Comments |
| Comment by Andrei Elkin [ 2023-08-31 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
rpizzi salve!
while most probably the 1st just waits for it (which is normal). So this is not descriptive, hopefully you can come up with all thread stacks. Cheers! | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2023-08-31 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I've not tried reproducing it yet. 5 mins sounds pretty encouraging! Thanks for the pointing-out! | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2023-08-31 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Roel, could you please confirm this description. It might be a (possibly long running from big table) SELECT can just sleep (SELECT sleep(1),a-col-from-table * for a little time per each record to emulate the big table. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2023-09-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
rpizzi Can you provide a testcase please? I tried the following without success:
On both optimized and debug builds of CS, in 10.6 and in 11.3. I also tested ES, 10.6.14, optimized.
However, the commit is immediate and the INSERT again only shows a 'Sending data' state. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2023-09-12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
rpizzi In my case the INSERT returns immediately, irrespective of what order I follow:
I also tested with 10.6.14, dbg, commit c271057288f71746d1816824f338f2d9c47f67c1, with the same result. Can you please provide a detailed step-by-step testcase listing all steps required, as well as the exact commit (or binary link) you tested with? Thank you | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Rick Pizzi [ 2023-09-12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Maybe I was not clear enough in explaining the issue. You can use this bash one-liner to continuously insert into t2 at 1-second interval:
You will see that when the insert... select commits, the insert into t2 will freeze until commit completes. The insert... select:
The insert into t2 (bash one liner above running in different window):
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Rick Pizzi [ 2023-09-12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
see comment | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Rick Pizzi [ 2023-09-19 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We just had another occurrence of this bug. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-10-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
For the record, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Rick Pizzi [ 2023-10-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I'm all in for this idea. This is what xpand is doing already. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2023-10-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In the description we have only two thread states, one of which is writing to binlog being a group-commit leader. The other is waiting apparently being a member (the leader?) of a following group and apparently at
rpizzi, I think we need show create table additionally to the bash scripts that run on it. serg, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Rick Pizzi [ 2023-10-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Elkin yes the other thread is waiting. This is what this bug is about. We are serializing commits due to binlog access. This is a scalability killer. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2023-10-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
rpizzi, serg: another method to improve is to write to binlog in parallel (maybe that what marko means in allows multiple InnoDB mtr_t::commit()). I mean few group-commit leaders can write at a time, a binlog reader (show-binlog-events incl) would see only the total work done or valid old state of it. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2023-10-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
marko, to where the binlog is being stored in InnoDB tables there's a Brandon's benchmarking exposing somewhat a great cost when it's done the table way. For this bug the parallel write to binlog is the first thing that comes to my mind. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-10-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Elkin, I only remember a performance test where something resembling binlog events was being written to a single InnoDB table. That would cause obvious bottlenecks on InnoDB locks and buffer page latches. To be scalable, the binlog would obviously have to be written into a number of table partitions or shards. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2023-10-14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
rpizzi I tried again using the information provided, and I remain unable to reproduce this on my end. I tested 10.6.16(opt) and 11.3.0(dbg). mariadb CLI:
Bash CLI:
The output of the final insert INSERT INTO T3 never went above ~0.0065 seconds (i.e. readout: .006502037), including at+after the end of the INSERT INTO T3, even though this INSERT INTO t3 took 5 minutes. If you still run into this, please provide a detailed testcase with all steps required. Please ensure to use a fresh MariaDB server instance which is not otherwise busy with any other work, is not running monyog and has a just-created data directory. Please also provide the exact version and commit used and any configuration parameters set. Thank you | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2023-10-14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The case is determined to have been caused by design of serialization of writes to binlog. A big transaction write can take long time to delay next transaction write and commit. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2023-10-14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Roel, rpizzi, thanks guys for discussing it. This helped me to understand what it is and where to go. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Rick Pizzi [ 2023-10-14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Elkin thank you for acknowledging that there's a design problem here. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by VAROQUI Stephane [ 2023-10-20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
To reproduce with employees DB
using update.sql
cat ps.log
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by VAROQUI Stephane [ 2023-10-20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
According that execution of small trx execution is conditional to the commit of long trx this binlog should have not been written in such trx order long trx
concurrent small trx that should go before according to commit order using COMMITTED-READ as execute immediate only run if the insert in stop table from previous long trx
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by VAROQUI Stephane [ 2023-10-20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
reproduced with lasted 10.11
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by VAROQUI Stephane [ 2023-10-20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Could not reproduce in MIXED mode | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-10-20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
stephane@skysql.com , I don't understand why it is wrong that the short transaction is binlogged after the long one. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by VAROQUI Stephane [ 2023-10-20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
because the long trx still running is the one that allow execution of a short one after , so short one should not happen after the big one have commit see select @stop:=id from stop; where stop is triigger by the big trx | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by VAROQUI Stephane [ 2023-10-20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
@*Kristian Nielsen* you raised a good point , long transactions TRX1 enter commit ,and multiple small ones in concurrency have begin before TRX1 rand not each commit point but , they are locked by the end of of TRX1 commit inside the binlog this only make sens in committed-read and STATEMENT based but in ROW based as the result of the insert select is self contained inside the row based view of TRX1 it could be interleaved with the domain that modify the read view as at commit time of TRX1 read view is already persisted inside bin_log_cache and is no more dependent of record changed from small TRX regardless of other transactions have committed after or before TRX1. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by VAROQUI Stephane [ 2023-10-20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks i now understand why artificially changing commit order in binary log file could be a bad idea for fixing this issue and that it could be fixed in some futur releases based on some protocol extension. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2024-01-03 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In general, transactions that modify a large number of rows and binlog in row-mode will cause stalls for other (also small) concurrent transactions. This is because of the limitation in the current binlog format, where the entire transaction is binlogged as a single continuous event group that cannot be interleaved with other transactions. This limitation would be good to eliminate in new evolutions of the binlog format, eg. when we integrate the binlog into the storage layer of engines such as innodb. The new binlog format should be able to partially binlog row events from transactions exceeding a certain size. Then at commit, most of the event data will have been already binlogged, and there will not be a large stall for concurrent commits. A disadvantage will be that large rollbacks will then incur a cost in binlog size. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by VAROQUI Stephane [ 2024-01-03 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
My proposal was to introduce a double binlog rotation every time a transaction overflow binlog_cache_size . The new big transaction get fully log in binlog +1 marked as a prepared step the following transactions get persisted in binlog +2 and will also stored a commit step for the prepared big transaction . One direct bonus of this is that any sysadmin with a ls command on binlog directory can instantly identified that some big transactions have been running in the system and try to improve the workload . I can't tell the number of time i have identified crash or restart of the server just by looking at the binlog directory without having to dig into error logs | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Libing Song [ 2024-01-24 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We already implemented a rename solution on MySQL-8.0. We are looking forward to merge the patch into MariaDB. I am discussing the solution with Andrei to check if the solution is fine for MariaDB. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2024-01-24 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
libing, thanks for this piece! Indeed we should import this method! |