Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-26473

mysqld got exception 0xc0000005 (rpl_slave_state/rpl_load_gtid_slave_state)

Details

    Description

      Our custom app went through an install on 2021-07-29 where we dumped the master DB (with master info/pos included), imported it into the new slave, and proceeded to run with replication - this started in line 87 of the attached file.

      On 2021-08-04, we upgraded our custom app (does NOT upgrade MariaDB) which runs the following commands between the times shown:

      2021-08-04 10:01:39

      stop slave;
      CHANGE MASTER TO MASTER_CONNECT_RETRY = 1, MASTER_HEARTBEAT_PERIOD = 90, MASTER_USER = 'mvp_repl_slave', MASTER_PASSWORD = '####';
      start slave;
      DELETE FROM user WHERE !((User='root' AND Host='localhost') OR (User='mariadb.sys' AND Host='localhost'));
      FLUSH PRIVILEGES;
      GRANT SELECT, INSERT, UPDATE, DELETE, EXECUTE, CREATE, DROP, CREATE VIEW, SHOW VIEW, FILE, SUPER, REPLICATION CLIENT ON . to mvp_local@'localhost' IDENTIFIED BY '####';
      GRANT SELECT, EXECUTE, SUPER, REPLICATION CLIENT ON . to mvp_peer@'192.168.2.2' IDENTIFIED BY '####';
      GRANT SELECT, EXECUTE, SUPER, REPLICATION CLIENT ON . to mvp_peer@'192.168.2.3' IDENTIFIED BY '####';
      FLUSH PRIVILEGES;

      2021-08-04 10:01:52

      And then stops the service (2021-08-04 10:01:57), and restarts it (2021-08-04 10:02:03).

      This resulted in the following error, which has NOT been readily reproducible:

      ntdll.dll!RtlpUnWaitCriticalSection()
      ntdll.dll!RtlEnterCriticalSection()
      ntdll.dll!RtlEnterCriticalSection()
      mysqld.exe!mysql_manager_submit()[sql_manager.cc:51]
      mysqld.exe!rpl_slave_state::update()[rpl_gtid.cc:358]
      mysqld.exe!rpl_load_gtid_slave_state()[rpl_rli.cc:1930]
      mysqld.exe!binlog_background_thread()[log.cc:10026]
      mysqld.exe!pthread_start()[my_winthread.c:62]
      ucrtbase.dll!o_realloc_base()
      KERNEL32.DLL!BaseThreadInitThunk()
      ntdll.dll!RtlUserThreadStart()

      Attachments

        Issue Links

          Activity

            juan.vera Juan added a comment - - edited

            Hi Elkin
            Thank you very much for that recommendation. With gtid_cleanup_batch_size=1024 I have been unable to reproduce the crash in over 10,000 restarts of an active slave with 16 threads updating the master. So this very much looks like it was the problem, no doubt Brandon's patch fixes it, and we have an effective workaround as well by simply turning up gtid_cleanup_batch_size until there's a lower possibility of the race-condition.

            Thank you both!

            juan.vera Juan added a comment - - edited Hi Elkin Thank you very much for that recommendation. With gtid_cleanup_batch_size=1024 I have been unable to reproduce the crash in over 10,000 restarts of an active slave with 16 threads updating the master. So this very much looks like it was the problem, no doubt Brandon's patch fixes it, and we have an effective workaround as well by simply turning up gtid_cleanup_batch_size until there's a lower possibility of the race-condition. Thank you both!
            Elkin Andrei Elkin added a comment -

            Review notes are made on GH.

            Elkin Andrei Elkin added a comment - Review notes are made on GH.
            Elkin Andrei Elkin added a comment - - edited

            bnestere: I think
            > larger numbers would only delay the crash
            First I thought it would prevent any crash, but it depends on a number of factors actually which one of them is unpredictable pace of binlog background thread. So in theory it could be lazy at shutdown time while the table size greater than 32K records, and then at restart the initialization time garbage-collection may hit that initialized mutex.

            Elkin Andrei Elkin added a comment - - edited bnestere : I think > larger numbers would only delay the crash First I thought it would prevent any crash, but it depends on a number of factors actually which one of them is unpredictable pace of binlog background thread. So in theory it could be lazy at shutdown time while the table size greater than 32K records, and then at restart the initialization time garbage-collection may hit that initialized mutex.
            dimavn Dim added a comment -

            @Andrei if I set gtid_cleanup_batch_size=1024 like @Juan mention does it prevent the crash completely?

            dimavn Dim added a comment - @Andrei if I set gtid_cleanup_batch_size=1024 like @Juan mention does it prevent the crash completely?

            Hi juan.vera,

            That is correct. And for completeness, this bug should also exist in all released versions of 10.6, 10.7, and 10.8. That is, you won't be able to downgrade anything 10.6+ to circumvent this bug. 10.5.8 is the "most recent" unaffected version.

            • Brandon
            bnestere Brandon Nesterenko added a comment - Hi juan.vera , That is correct. And for completeness, this bug should also exist in all released versions of 10.6, 10.7, and 10.8. That is, you won't be able to downgrade anything 10.6+ to circumvent this bug. 10.5.8 is the "most recent" unaffected version. Brandon

            People

              bnestere Brandon Nesterenko
              paddyK Pat K
              Votes:
              2 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.