[MDEV-26473] mysqld got exception 0xc0000005 (rpl_slave_state/rpl_load_gtid_slave_state) Created: 2021-08-24 Updated: 2022-05-20 Resolved: 2022-04-25 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Affects Version/s: | 10.4.18, 10.5.12, 10.5.15, 10.6.7 |
| Fix Version/s: | 10.4.25, 10.5.16, 10.6.8, 10.7.4, 10.8.3 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Pat K | Assignee: | Brandon Nesterenko |
| Resolution: | Fixed | Votes: | 2 |
| Labels: | replication | ||
| Environment: |
Windows Server 2012, 2016 |
||
| Attachments: |
|
| Description |
|
Our custom app went through an install on 2021-07-29 where we dumped the master DB (with master info/pos included), imported it into the new slave, and proceeded to run with replication - this started in line 87 of the attached file. On 2021-08-04, we upgraded our custom app (does NOT upgrade MariaDB) which runs the following commands between the times shown: 2021-08-04 10:01:39 stop slave; 2021-08-04 10:01:52 And then stops the service (2021-08-04 10:01:57), and restarts it (2021-08-04 10:02:03). This resulted in the following error, which has NOT been readily reproducible: ntdll.dll!RtlpUnWaitCriticalSection() |
| Comments |
| Comment by Pat K [ 2021-08-25 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
It's worth noting that any restart attempts of the affected MariaDB service resulted in the same error. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alice Sherepa [ 2021-08-30 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Is it suitable for you to add 'master_binlog_Main_gm_a.000399' ( from the log - Slave SQL thread exiting, replication stopped in log 'master_binlog_Main_gm_a.000399' at position 12351739) - to see what queries there were causing the crash? | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pat K [ 2021-08-30 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
A snippet of the binlog (washed thru mysqlbinlog.exe) has been uploaded. The service fails with the same error when the 'skip-slave-start' flag is specified in the service config file. I'm reproducing the error fairly consistently when having the master constantly create/delete tables: create table if not exists xyz (id int); drop table if exists xyz; And with the slave looping through the commands in the Description. It appears this file might be part of the problem: mysql/gtid_slave_pos.ibd If I replace the problematic instance's ibd file with the default one from the ProgramData/MariaDB folder, the service then starts - also attached the problematic 'gtid_slave_pos.ibd' file if needed. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pat K [ 2021-10-07 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Does anyone know the anatomy of a 'gtid_slave_pos.ibd' well enough to determine why the attached files would cause the error in the Description?: gtid_slave_pos.ibd.6408264d725618fd8dd40a14df42d5ee Also, would copying the default/pristine 'gtid_slave_pos.ibd' from ProgramData/MariaDB cause issues moving forward? | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tania S Engel [ 2022-02-18 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
*VERSION: 10.6.4 * We too were not able to resolve it with the skip-slave-start flag. We did the following to resolve it: DELETE THESE FILES MARIADB will start but logs will complain use mysql; However, now we have an empty select @@global.gtid_current_pos. Is there a way to force a write to the gtid so we can have it for configuring our slaves without having to make a mutation? We really would like to upgrade our MariaDB but had to rollback due to this bug. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pat K [ 2022-02-22 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Right, it may have been introduced in 10.4.18, and possibly any minor versions from the '2021-02-22' update may be affected. I just saw this again in 10.4.22, and it doesn't appear that the 10.4.24 changelog is addressing anything related to it. For now, it seems copying the default/pristine 'gtid_slave_pos.ibd' from ProgramData/MariaDB 'mysql' directory into the affected service directory solves the problem, but obviously this is quite the hack. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pat K [ 2022-02-22 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Tania, were doing anything relating to 'CHANGE MASTER...' before you experienced the problem? Or was it just a routine service stop/start? | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tania S Engel [ 2022-02-22 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Our heavy test automation has one master and 2 slaves and it frequently changes the master. The crash does happen with a stop/start, however after running for a bit after changing the master. It is possible we didn't have any MySQL mutations against the replicated database in that time period. We ask master to generate a dump and get the gtid under the global readonly lock (FLUSH TABLES WITH READ LOCK; SET GLOBAL read_only = ON) We then have our slaves import that dump and run: CHANGE MASTER TO MASTER_HOST=' {master.IpAddress}', '," MASTER_PASSWORD=' {master.SlavePassword}', MASTER_USE_GTID = slave_pos SET GLOBAL gtid_slave_pos =' {master.Gtid}' If hit this bug, once we apply the workaround, we are stuck with no knowledge of the gtid and we need one to be generated in order to retrieve it when this node is becoming the master and must generate the dump. This is where we aren't sure how to proceed. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Juan [ 2022-03-23 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Hello TSass - to get the data from this table, you can copy the problematic ibd file to a server not having the issue (take care to user a server with a similar version number, because this won't work when the structure is different, and the structure of this table does change in 10.5) and "select *" from it. The original gtid_slave_pos.ibd attached to this case on August 30, for example, loads fine on 10.4.18 servers running in CentOS and shows the information. To do this without restarting a server, you can use transportable table space to do the import: https://mariadb.com/kb/en/innodb-file-per-table-tablespaces/#copying-transportable-tablespaces | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Juan [ 2022-03-26 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
TSass here's a better workaround since this is still happening & appears to affect various versions of MariaDB server on Windows Server 2012, 2016, and 2020: 1. delete mysql/gtid_slave_pos.* from the replica's data directory.
4. back on the primary server, run the following command:
**note that I am using the values retrieved from the replica when running this query on the primary, and getting a gtid position back. 5. going back to the replica, create the missing table:
6. set the global slave position from the retrieved position above:
7. if replication was already configured to use slave_pos, you can skip this step. Otherwise, you now need to define the master:
8. you can now start replication:
...and after a second you should see replication progressing normally with "show slave status\G" | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Juan [ 2022-03-26 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Reproduced consistently: 2x Windows Server 2016 Standard Build 14393.4169 VM w. 2 i7 cores ea. @ 2.60GHz, 4G RAM & 128G NVMe ea., CygWin, OpenSSH, & NT Resource Kit tools:
-Configure replication using gtid_slave_pos -Tail error log on replica -Create simple table:
-Start update streams in screens or separate shells. I tested with 8 simultaneously running this same loop:
-On replica in a Windows command shell, create and run onoff.bat:
Within 10-100 iterations you should see the corrupted gtid-slave_pos problem preventing the replica from starting.
| |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2022-03-29 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
juan.vera, salute! Thanks for exploring it! If I got the issue correctly, there's a crash with binlog_background_thread caused by a corrupted gtid_slave_pos. Also you still have a error log with the stack, e.g MariaDB 10.6.7, it'd be helpful to see one. Cheers, Andrei | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Juan [ 2022-03-30 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Hello Elkin! Attached cs0390600- **Note these were generated with gtid_slave_pos table converted to MyISAM. I don't know why, but of all the tested engine types only MyISAM successfully extends the interval between these crashes, approximately doubling it, but ultimately does not prevent it. Please let me know if this works for you, and if not I'll convert the table back to InnoDB & generate a new error log & dump file for you. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2022-03-31 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
juan.vera, it looks very probably from the sources code that raising gtid-cleanup-batch-size value, say by twice, may eliminate the crash. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Dim [ 2022-04-08 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Hi, @Elkin I also have that issue when stop/start the MariaDB, I have 3 server running MariaDB 10.5.8 with Windows 10, setup with replication ring: SV1 > SV2 > SV3 > SV1, for the bug in 10.5.8 I need upgrade to 10.5.9 few weeks ago, first I turn off three server and start update using .MSI and then run 3 server no error happen then I start partition two table for three server, I spent 3 days for partition first table by run the partition command on SV1 and let it replicate to other server, this take 3 days to complete for three servers and no issue, then few days later I start partition table two at the same time instead sequense like the first one, on three servers I using mysql command line to log in and set `SET sql_log_bin = OFF` and start partition command on three servers at the same time and forgot to turn off replicate (STOP SLAVE), take 1 day to complete then I restart the MariaDB three servers, SV1 and SV2 start but SV3 showing the error same this issue, then I remove SV3 out of replicate and setup SV1 and SV2 replicate each others, work fine in 1 or 2 weeks but this morning the SV2 can't start, have the same issue as SV3, I have implement workaround by running STOP SLAVE on SV1 and deleteting mysql/gtid_slave_pos.* on SV2, start MariaDB on SV2, run `STOP SLAVE` and then DROP *and *re-create table gtid_slave_pos follow by @Juan (except step 6 and 7 because I don't use `master_use_gtid=slave_pos;`), then run `START SLAVE` on SV2, and go back to SV1 run `CHANGE MASTER TO` update the `MASTER_LOG_FILE` and `MASTER_LOG_POS` of SV2 so it won't replicate the DROP *and *re-create table gtid_slave_pos from SV2. the server-id I also set follow the server number, SV1 server-id=1, SV2 = server-id = 2 and SV3 so on (with log_slave_updates = ON) but the gtid_domain_id I don't set for three servers, does this affect also? I wonder this issue sometime happen because I turn off/on three servers every day | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2022-04-08 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
The info from Juan must be sufficient for the engineering to start on. I am thus 'providing feedback', to make this ticket into the weekly priority list. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Brandon Nesterenko [ 2022-04-13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Andrei! This is ready for review. My analysis: The patch forces the mysql handle manager to initialize/start before the binlog Patch: 95825c5 | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Juan [ 2022-04-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Elkin Thank you both! | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2022-04-22 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Review notes are made on GH. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2022-04-26 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
bnestere: I think | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Dim [ 2022-04-27 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
@Andrei if I set gtid_cleanup_batch_size=1024 like @Juan mention does it prevent the crash completely? | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Brandon Nesterenko [ 2022-05-05 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Hi juan.vera, That is correct. And for completeness, this bug should also exist in all released versions of 10.6, 10.7, and 10.8. That is, you won't be able to downgrade anything 10.6+ to circumvent this bug. 10.5.8 is the "most recent" unaffected version.
|