[MDEV-28681] bin log interrupted and hung for minutes after transaction commit Created: 2022-05-27 Updated: 2022-08-29 Resolved: 2022-08-29 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Data Manipulation - Update, Platform RedHat, Storage Engine - InnoDB |
| Affects Version/s: | 10.4.21 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Critical |
| Reporter: | Andy Liu | Assignee: | Unassigned |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Description |
|
*Hi, In our production lab we found one issue that the bin log update got interrupted and hung there for minutes after C++ API mysql_query with "commit" param returned success. Timeline:* 1. Transaction committed by calling of mysql_query with "commit" param which returned success (product log which confirmed successful return of mysql_query): +++ 2022/05/26 11:32:46.065 DB MEDIUM remotedb:11310 E:105110 S:2657 (RemotedbNotify.cpp 1026 v-0001 169.254.64.1 oam-a) 2. bin log started to be updated, but only updated the first 3 records:
*3. About >6m later, it started to write bin log for the remaining records for that transaction (Note in-between there were bin logs writes for other transactions): *
Then finally the DB contents were overwritten incorrectly by the delayed bin log updates causing product failures. *Mariadb Server version: 10.4.21-MariaDB-log Source distribution Attached config file my.cnf and bin log as well as decoded bin log. Your advice is highly appreciated.* |
| Comments |
| Comment by Andrei Elkin [ 2022-05-27 ] |
|
> it started to write bin log for the remaining records for that transaction I don't see what you call p.3 binlog snippet relevant to p.2 fully logged trx. Could you please explain what do you mean? |
| Comment by Andrei Elkin [ 2022-05-27 ] |
|
> , it started to write bin log for the remaining records for that transaction 'That' transaction is fully committed in binlog, having thread_id 11783. In the follow up that the thread id is different. |
| Comment by Andy Liu [ 2022-05-30 ] |
|
Thanks for looking at this. So what does thread_id in binlog signify and under what condition should it be changed? Does it mean new connection initiated from client? Here are details of what was going on (note they were all running under single product thread context): 1. Start one transaction and loop reading "db01host.notify_remdb_operation" table by calling C++ mysql_real_query with statement param "select * from db01host.notify_remdb_operation where remote_ip = x'00a9fe4001000000000000000000000000000000' and process_name = 'SSPcntl_agt'" 2. Also start another transaction to handle all tuples read from p.1. For each tuple, we just reset its retry_count field to 0 and write back to DB via calling mysql_real_query with update statement. 3. After all tuples matching specified "remote_ip" and "process_name" were handled, the first transaction got "rollback" and the second transaction got "commit" via mysql_query which returned success as confirmed from product logs. The binlog snippets (both p.2 and p.3 in description) should all belong to the same transaction (the second) because they all have the same process_id = 11312 which was the old one. In between p.2 and p.3 snippets, the process_id had been updated to the new one "process_id = 64602". |
| Comment by Andrei Elkin [ 2022-05-30 ] |
|
> The binlog snippets (both p.2 and p.3 in description) should all belong to the same transaction (the second) because they all have the same process_id = 11312 which was the old one. This assumption bases on your application logics. But indeed we see in the database different connection identifiers (that's what thread_id is) for blocks of sql code that you think one transaction. A transaction can be associated with at most one connection. |
| Comment by Elena Stepanova [ 2022-07-25 ] |
|
One way to reconcile expectations and reality here would be to temporarily enable the general log and then to match it with the binary log, to see which updates come from where. |
| Comment by Andy Liu [ 2022-07-26 ] |
|
Thanks for the advice. |
| Comment by Elena Stepanova [ 2022-08-24 ] |
|
liberiul, any new discoveries? |
| Comment by Andy Liu [ 2022-08-25 ] |
|
It's found that the replication got delayed significantly under extreme high CPU usage. The statement executed on master host was not applied to slave host until minutes later. The delayed replication events finally overrode the statements queried to slave and caused the issue. Would you advise: |
| Comment by Andrei Elkin [ 2022-08-25 ] |
|
liberiul, to the 2nd of the two, it looks like you'd be interested in master_wait_gtid. That is the B's updater would invoke the function prior to submit the update. That obviously requires A and B updaters coordination on the matter of the sequence number of replication events. (B must be aware of the last A's update's gtid). To the 1st question, it looks like the request is about the slave threads scheduling prioritization (which we don't provide unfortunately) unless your whole slave server was busy. |
| Comment by Andy Liu [ 2022-08-29 ] |
|
Thanks for the explanation. This ticket can be closed. |