[MDEV-26473] mysqld got exception 0xc0000005 (rpl_slave_state/rpl_load_gtid_slave_state) Created: 2021-08-24  Updated: 2022-05-20  Resolved: 2022-04-25

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.4.18, 10.5.12, 10.5.15, 10.6.7
Fix Version/s: 10.4.25, 10.5.16, 10.6.8, 10.7.4, 10.8.3

Type: Bug Priority: Critical
Reporter: Pat K Assignee: Brandon Nesterenko
Resolution: Fixed Votes: 2
Labels: replication
Environment:

Windows Server 2012, 2016


Attachments: File cs0390600-MDEV-26473-20220329-WinServ2016Std-1607-MDB-10-6-7-W2016-88.dmp     File cs0390600-MDEV-26473-20220329-WinServ2016Std-1607-MDB-10-6-7-W2016-88.err     File gtid_slave_pos.ibd     File gtid_slave_pos.ibd.30d6c68c3137b307b8428df46a6aca07     File gtid_slave_pos.ibd.5d875c467ca8835cc92261e52836fa75     File gtid_slave_pos.ibd.6408264d725618fd8dd40a14df42d5ee     Text File master_binlog_Main_gm_a.000399_short.txt     Text File mysqld_crash_Aug04.txt    

 Description   

Our custom app went through an install on 2021-07-29 where we dumped the master DB (with master info/pos included), imported it into the new slave, and proceeded to run with replication - this started in line 87 of the attached file.

On 2021-08-04, we upgraded our custom app (does NOT upgrade MariaDB) which runs the following commands between the times shown:

2021-08-04 10:01:39

stop slave;
CHANGE MASTER TO MASTER_CONNECT_RETRY = 1, MASTER_HEARTBEAT_PERIOD = 90, MASTER_USER = 'mvp_repl_slave', MASTER_PASSWORD = '####';
start slave;
DELETE FROM user WHERE !((User='root' AND Host='localhost') OR (User='mariadb.sys' AND Host='localhost'));
FLUSH PRIVILEGES;
GRANT SELECT, INSERT, UPDATE, DELETE, EXECUTE, CREATE, DROP, CREATE VIEW, SHOW VIEW, FILE, SUPER, REPLICATION CLIENT ON . to mvp_local@'localhost' IDENTIFIED BY '####';
GRANT SELECT, EXECUTE, SUPER, REPLICATION CLIENT ON . to mvp_peer@'192.168.2.2' IDENTIFIED BY '####';
GRANT SELECT, EXECUTE, SUPER, REPLICATION CLIENT ON . to mvp_peer@'192.168.2.3' IDENTIFIED BY '####';
FLUSH PRIVILEGES;

2021-08-04 10:01:52

And then stops the service (2021-08-04 10:01:57), and restarts it (2021-08-04 10:02:03).

This resulted in the following error, which has NOT been readily reproducible:

ntdll.dll!RtlpUnWaitCriticalSection()
ntdll.dll!RtlEnterCriticalSection()
ntdll.dll!RtlEnterCriticalSection()
mysqld.exe!mysql_manager_submit()[sql_manager.cc:51]
mysqld.exe!rpl_slave_state::update()[rpl_gtid.cc:358]
mysqld.exe!rpl_load_gtid_slave_state()[rpl_rli.cc:1930]
mysqld.exe!binlog_background_thread()[log.cc:10026]
mysqld.exe!pthread_start()[my_winthread.c:62]
ucrtbase.dll!o_realloc_base()
KERNEL32.DLL!BaseThreadInitThunk()
ntdll.dll!RtlUserThreadStart()



 Comments   
Comment by Pat K [ 2021-08-25 ]

It's worth noting that any restart attempts of the affected MariaDB service resulted in the same error.

Comment by Alice Sherepa [ 2021-08-30 ]

Is it suitable for you to add 'master_binlog_Main_gm_a.000399' ( from the log - Slave SQL thread exiting, replication stopped in log 'master_binlog_Main_gm_a.000399' at position 12351739) - to see what queries there were causing the crash?
You could try to restart the server with --skip-slave-start option (https://mariadb.com/kb/en/mysqld-options/#-skip-slave-)

Comment by Pat K [ 2021-08-30 ]

A snippet of the binlog (washed thru mysqlbinlog.exe) has been uploaded.

The service fails with the same error when the 'skip-slave-start' flag is specified in the service config file.

I'm reproducing the error fairly consistently when having the master constantly create/delete tables:

create table if not exists xyz (id int); drop table if exists xyz;

And with the slave looping through the commands in the Description.

It appears this file might be part of the problem:

mysql/gtid_slave_pos.ibd

If I replace the problematic instance's ibd file with the default one from the ProgramData/MariaDB folder, the service then starts - also attached the problematic 'gtid_slave_pos.ibd' file if needed.

Comment by Pat K [ 2021-10-07 ]

Does anyone know the anatomy of a 'gtid_slave_pos.ibd' well enough to determine why the attached files would cause the error in the Description?:

gtid_slave_pos.ibd.6408264d725618fd8dd40a14df42d5ee
gtid_slave_pos.ibd.5d875c467ca8835cc92261e52836fa75
gtid_slave_pos.ibd.30d6c68c3137b307b8428df46a6aca07

Also, would copying the default/pristine 'gtid_slave_pos.ibd' from ProgramData/MariaDB cause issues moving forward?

Comment by Tania S Engel [ 2022-02-18 ]

*VERSION: 10.6.4 *
OS : Windows 10 Enterprise LTSC and Windows Server 2016
Thank you Pat K for opening this issue! We had this same exception 4 times in 100s of master/slave testing scenarios. We did not see this problem in 10.4.12. We do a lot of testing and that sheer amount indicates to us that this problem was not in 10.4.12. Given this issue definition, it seems this bug was introduced between 10.4.12 and 10.4.18 and remains an issue in 10.6.4. We always see it on an appliance that was a slave. We always see it after stopping mariadb and then attempting to start it. We stop it using windows service control manager issuing a stop (no kill). The exception stack is the same as shown in this issue.

We too were not able to resolve it with the skip-slave-start flag. We did the following to resolve it:

DELETE THESE FILES
C:\MariaDb\data\mysq\gtid_slave_pos.frm
C:\MariaDb\data\mysq\gtid_slave_pos.ibd

MARIADB will start but logs will complain
CONNECT TO MYSQL AND RUN:

use mysql;
drop table gtid_slave_pos;
CREATE TABLE `gtid_slave_pos` (
`domain_id` int(10) unsigned NOT NULL,
`sub_id` bigint(20) unsigned NOT NULL,
`server_id` int(10) unsigned NOT NULL,
`seq_no` bigint(20) unsigned NOT NULL,
PRIMARY KEY (`domain_id`,`sub_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='Replication slave GTID position';

However, now we have an empty select @@global.gtid_current_pos. Is there a way to force a write to the gtid so we can have it for configuring our slaves without having to make a mutation? We really would like to upgrade our MariaDB but had to rollback due to this bug.

Comment by Pat K [ 2022-02-22 ]

Right, it may have been introduced in 10.4.18, and possibly any minor versions from the '2021-02-22' update may be affected.

I just saw this again in 10.4.22, and it doesn't appear that the 10.4.24 changelog is addressing anything related to it.

For now, it seems copying the default/pristine 'gtid_slave_pos.ibd' from ProgramData/MariaDB 'mysql' directory into the affected service directory solves the problem, but obviously this is quite the hack.

Comment by Pat K [ 2022-02-22 ]

Tania, were doing anything relating to 'CHANGE MASTER...' before you experienced the problem?

Or was it just a routine service stop/start?

Comment by Tania S Engel [ 2022-02-22 ]

Our heavy test automation has one master and 2 slaves and it frequently changes the master. The crash does happen with a stop/start, however after running for a bit after changing the master. It is possible we didn't have any MySQL mutations against the replicated database in that time period. We ask master to generate a dump and get the gtid under the global readonly lock (FLUSH TABLES WITH READ LOCK; SET GLOBAL read_only = ON)

We then have our slaves import that dump and run:

CHANGE MASTER TO MASTER_HOST='

{master.IpAddress}

',
MASTER_PORT=3306," MASTER_USER='

{master.SlaveUser}

'," MASTER_PASSWORD='

{master.SlavePassword}

', MASTER_USE_GTID = slave_pos

SET GLOBAL gtid_slave_pos ='

{master.Gtid}

'
START SLAVE

If hit this bug, once we apply the workaround, we are stuck with no knowledge of the gtid and we need one to be generated in order to retrieve it when this node is becoming the master and must generate the dump. This is where we aren't sure how to proceed.

Comment by Juan [ 2022-03-23 ]

Hello TSass - to get the data from this table, you can copy the problematic ibd file to a server not having the issue (take care to user a server with a similar version number, because this won't work when the structure is different, and the structure of this table does change in 10.5) and "select *" from it. The original gtid_slave_pos.ibd attached to this case on August 30, for example, loads fine on 10.4.18 servers running in CentOS and shows the information.

To do this without restarting a server, you can use transportable table space to do the import:

https://mariadb.com/kb/en/innodb-file-per-table-tablespaces/#copying-transportable-tablespaces

Comment by Juan [ 2022-03-26 ]

TSass here's a better workaround since this is still happening & appears to affect various versions of MariaDB server on Windows Server 2012, 2016, and 2020:

1. delete mysql/gtid_slave_pos.* from the replica's data directory.
2. start the replica's service
3. on the replica, run "show slave status\G" and record the values of Master_Log_File and Exec_Master_Log_Pos:

MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
                Slave_IO_State:
                   Master_Host: 10.0.0.87
                   Master_User: repl
                   Master_Port: 3306
                 Connect_Retry: 60
               Master_Log_File: MDB-10-5-15-W2016-87-bin.000001
           Read_Master_Log_Pos: 8442835
                Relay_Log_File: MDB-10-5-15-W2016-88-relay-bin.000001
                 Relay_Log_Pos: 4
         Relay_Master_Log_File: MDB-10-5-15-W2016-87-bin.000001
              Slave_IO_Running: No
             Slave_SQL_Running: No
               Replicate_Do_DB:
           Replicate_Ignore_DB:
            Replicate_Do_Table:
        Replicate_Ignore_Table:
       Replicate_Wild_Do_Table:
   Replicate_Wild_Ignore_Table:
                    Last_Errno: 1146
                    Last_Error: Unable to load replication GTID slave state from mysql.gtid_slave_pos: Table 'mysql.gtid_slave_pos' doesn't exist
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 8442835

4. back on the primary server, run the following command:

select binlog_gtid_pos( "MDB-10-5-15-W2016-87-bin.000001", 8442835);
+--------------------------------------------------------------+
| binlog_gtid_pos( "MDB-10-5-15-W2016-87-bin.000001", 8442835) |
+--------------------------------------------------------------+
| 87-87-25161                                                  |
+--------------------------------------------------------------+
1 row in set (0.047 sec
)

**note that I am using the values retrieved from the replica when running this query on the primary, and getting a gtid position back.

5. going back to the replica, create the missing table:

use mysql;
 
drop table if exists gtid_slave_pos; create table gtid_slave_pos(
domain_id int(10) unsigned NOT NULL,
sub_id bigint(20) unsigned NOT NULL, 
server_id int(10) unsigned NOT NULL, 
seq_no bigint(20) unsigned NOT NULL,
PRIMARY KEY (domain_id,sub_id) 
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='Replication slave GTID position';

6. set the global slave position from the retrieved position above:

set global gtid-slave_pos='87-87-25161';

7. if replication was already configured to use slave_pos, you can skip this step. Otherwise, you now need to define the master:

change master to
master_host='10.0.0.87', master_port=3306,
master_user='repl', master_password='repl',
master_use_gtid=slave_pos;

8. you can now start replication:

start slave;

...and after a second you should see replication progressing normally with "show slave status\G"

Comment by Juan [ 2022-03-26 ]

Reproduced consistently: 2x Windows Server 2016 Standard Build 14393.4169 VM w. 2 i7 cores ea. @ 2.60GHz, 4G RAM & 128G NVMe ea., CygWin, OpenSSH, & NT Resource Kit tools:

  • Install MariaDB 10.5.15 x2 with identical configs (except for server ID):

    [mysqld]
    datadir=C:/Program Files/MariaDB 10.5/data
    port=3306
    innodb_buffer_pool_size=2048M
    character-set-server=utf8
     
    log_bin
    innodb_autoinc_lock_mode=2
    innodb_flush_method=normal
    binlog_format=ROW
    binlog_row_image=MINIMAL
    server_id=88
    gtid_domain_id=88
     
    wait_timeout=900
     
    relay_log_purge=ON
    relay_log_recover=ON
     
    [client]
    port=3306
    plugin-dir=C:/Program Files/MariaDB 10.5/lib/plugin
    

-Configure replication using gtid_slave_pos

-Tail error log on replica

-Create simple table:

use test;
create table base (
  id int(10) unsigned not null auto_increment,
  modified timestamp not null default current_timestamp() on update current_timestamp(),
  val int(10) unsigned default null,
  blah varchar(511) default null,
  primary key (id),
  key base_modified (modified),
  key base_blah100 (blah(100)),
  key base_val (val)
) engine=innodb default charset=utf8mb4;

-Start update streams in screens or separate shells. I tested with 8 simultaneously running this same loop:

SECONDS=0; n=0; while true
do
mysql -h 10.0.0.87 -u root -e "insert into base(blah) values( concat( @@hostname, ':',@@gtid_current_pos, ':', now()));" test
let n++
[ $(( $n % 10000 )) -eq 0 ] && echo "$(date), $n rows added, $(( $n / $SECONDS ))tps"
done

-On replica in a Windows command shell, create and run onoff.bat:

set /A i
:a
net stop MariaDB
net start MariaDB
set /A i+=1
echo %i%
timeout /T 5
goto a

Within 10-100 iterations you should see the corrupted gtid-slave_pos problem preventing the replica from starting.

  • Things that don't change the outcome: innodb_flush_log_at_trx_commit, innodb_flush_method, innodb_change_buffering, innodb_adaptive_hash_index, sync_binlog, sync_relay_log, relay_log_recover.
  • Changing gtid_binlog_pos to Aria transactional makes no noticeable change. Using MyISAM for gtid_binlog_pos the timing seems different but ultimately the result is the same.
Comment by Andrei Elkin [ 2022-03-29 ]

juan.vera, salute! Thanks for exploring it! If I got the issue correctly, there's a crash with binlog_background_thread caused by a corrupted gtid_slave_pos.

Also you still have a error log with the stack, e.g MariaDB 10.6.7, it'd be helpful to see one.

Cheers,

Andrei

Comment by Juan [ 2022-03-30 ]

Hello Elkin!

Attached cs0390600-MDEV-26473-20220329-WinServ2016Std-1607-MDB-10-6-7-W2016-88.dmp & cs0390600-MDEV-26473-2022032-WinServ2016Std-1607-MDB-10-6-7-W2016-88.err generated just now with the crash.

**Note these were generated with gtid_slave_pos table converted to MyISAM. I don't know why, but of all the tested engine types only MyISAM successfully extends the interval between these crashes, approximately doubling it, but ultimately does not prevent it.

Please let me know if this works for you, and if not I'll convert the table back to InnoDB & generate a new error log & dump file for you.

Comment by Andrei Elkin [ 2022-03-31 ]

juan.vera, it looks very probably from the sources code that raising gtid-cleanup-batch-size value, say by twice, may eliminate the crash.
I am still looking for the actual root of the issue.

Comment by Dim [ 2022-04-08 ]

Hi, @Elkin I also have that issue when stop/start the MariaDB, I have 3 server running MariaDB 10.5.8 with Windows 10, setup with replication ring: SV1 > SV2 > SV3 > SV1, for the bug in 10.5.8 I need upgrade to 10.5.9 few weeks ago, first I turn off three server and start update using .MSI and then run 3 server no error happen then I start partition two table for three server, I spent 3 days for partition first table by run the partition command on SV1 and let it replicate to other server, this take 3 days to complete for three servers and no issue, then few days later I start partition table two at the same time instead sequense like the first one, on three servers I using mysql command line to log in and set `SET sql_log_bin = OFF` and start partition command on three servers at the same time and forgot to turn off replicate (STOP SLAVE), take 1 day to complete then I restart the MariaDB three servers, SV1 and SV2 start but SV3 showing the error same this issue, then I remove SV3 out of replicate and setup SV1 and SV2 replicate each others, work fine in 1 or 2 weeks but this morning the SV2 can't start, have the same issue as SV3, I have implement workaround by running STOP SLAVE on SV1 and deleteting mysql/gtid_slave_pos.* on SV2, start MariaDB on SV2, run `STOP SLAVE` and then DROP *and *re-create table gtid_slave_pos follow by @Juan (except step 6 and 7 because I don't use `master_use_gtid=slave_pos;`), then run `START SLAVE` on SV2, and go back to SV1 run `CHANGE MASTER TO` update the `MASTER_LOG_FILE` and `MASTER_LOG_POS` of SV2 so it won't replicate the DROP *and *re-create table gtid_slave_pos from SV2. the server-id I also set follow the server number, SV1 server-id=1, SV2 = server-id = 2 and SV3 so on (with log_slave_updates = ON) but the gtid_domain_id I don't set for three servers, does this affect also? I wonder this issue sometime happen because I turn off/on three servers every day

Comment by Andrei Elkin [ 2022-04-08 ]

The info from Juan must be sufficient for the engineering to start on. I am thus 'providing feedback', to make this ticket into the weekly priority list.

Comment by Brandon Nesterenko [ 2022-04-13 ]

Hi Andrei! This is ready for review.

My analysis:
During mysqld initialization, if the number of GTIDs added since that last purge of the mysql.gtid_slave_pos tables is greater than or equal to the –-gtid-cleanup-batch-size value, a race condition can occur. Specifically, the binlog background thread will submit the bg_gtid_delete_pending job to the mysql handle manager; however, the mysql handle manager may not be initialized, leading to crashes.

The patch forces the mysql handle manager to initialize/start before the binlog
background thread is created.

Patch: 95825c5
BB: bb-10.4-MDEV-26473

Comment by Juan [ 2022-04-14 ]

Hi Elkin
Thank you very much for that recommendation. With gtid_cleanup_batch_size=1024 I have been unable to reproduce the crash in over 10,000 restarts of an active slave with 16 threads updating the master. So this very much looks like it was the problem, no doubt Brandon's patch fixes it, and we have an effective workaround as well by simply turning up gtid_cleanup_batch_size until there's a lower possibility of the race-condition.

Thank you both!

Comment by Andrei Elkin [ 2022-04-22 ]

Review notes are made on GH.

Comment by Andrei Elkin [ 2022-04-26 ]

bnestere: I think
> larger numbers would only delay the crash
First I thought it would prevent any crash, but it depends on a number of factors actually which one of them is unpredictable pace of binlog background thread. So in theory it could be lazy at shutdown time while the table size greater than 32K records, and then at restart the initialization time garbage-collection may hit that initialized mutex.

Comment by Dim [ 2022-04-27 ]

@Andrei if I set gtid_cleanup_batch_size=1024 like @Juan mention does it prevent the crash completely?

Comment by Brandon Nesterenko [ 2022-05-05 ]

Hi juan.vera,

That is correct. And for completeness, this bug should also exist in all released versions of 10.6, 10.7, and 10.8. That is, you won't be able to downgrade anything 10.6+ to circumvent this bug. 10.5.8 is the "most recent" unaffected version.

  • Brandon
Generated at Thu Feb 08 09:45:35 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.