Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-5217

Assorted failures with parallel replication on concurrent DML/DDL flow and periodic FLUSH LOGS

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • None
    • 10.0.6
    • None
    • None

    Description

      I've got several types of failures by running the provided test with the same parameters over and over again. Here is the statistics, more details on each type of the failure below:

      Out of 10 consequent runs,

      Error in Log_event::read_log_event(): 'Event too big' 1
      Error in Log_event::read_log_event(): 'read error' 4
      Error in Log_event::read_log_event(): 'Event too small' 2
      Error reading relay log event: Aborting slave SQL thread because of partial event read 1
      Could not execute .... event (Can't find record) 2

      In addition to "Could not execute .... event", I also at least once had a subsequent crash with "safe_mutex: Trying to destroy a mutex LOCK_wait_commit that was locked".

      Examples of error messages:

      131031 11:25:27 [ERROR] Error in Log_event::read_log_event(): 'read error', data_len: 2818048, event_type: 0
      131031 11:25:27 [ERROR] Error reading relay log event: slave SQL thread aborted because of I/O error
      131031 11:25:27 [ERROR] Slave SQL: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Internal MariaDB error code: 1594
      131031 11:25:27 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000019' position 287

      131031 11:19:17 [ERROR] Slave SQL thread: I/O error reading event(errno: 175  cur_log->error: 12)
      131031 11:19:17 [ERROR] Error reading relay log event: Aborting slave SQL thread because of partial event read
      131031 11:19:17 [ERROR] Slave SQL: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Internal MariaDB error code: 1594
      131031 11:19:17 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000048' position 287

      131031 10:54:58 [ERROR] Slave SQL: Could not execute Delete_rows event on table test.DD; Can't find record in 'DD', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000033, end_log_pos 688, Internal MariaDB error code: 1032
      131031 10:54:58 [ERROR] Slave SQL: Can't find record in 'DD', Internal MariaDB error code: 1032
      131031 10:54:58 [Warning] Slave: Can't find record in 'DD' Error_code: 1032
      131031 10:54:58 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000033' position 326

      131031 10:51:49 [ERROR] Error in Log_event::read_log_event(): 'Event too small', data_len: 0, event_type: 0
      131031 10:51:49 [ERROR] Error reading relay log event: slave SQL thread aborted because of I/O error
      131031 10:51:49 [ERROR] Slave SQL: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Internal MariaDB error code: 1594
      131031 10:51:49 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000052' position 326

      Interestingly, in this case slave didn't stop but later also issued another error:

      131031 10:31:22 [ERROR] Error in Log_event::read_log_event(): 'Event too big', data_len: 1987208563, event_type: 100
      131031 10:31:22 [ERROR] Error reading relay log event: slave SQL thread aborted because of I/O error
      131031 10:31:22 [ERROR] Slave SQL: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Internal MariaDB error code: 1594
      131031 10:31:22 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000022' position 4
      131031 10:35:55 [ERROR] Slave SQL: Could not execute Delete_rows event on table test.CC; Can't find record in 'CC', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000081, end_log_pos 1184, Internal MariaDB error code: 1032
      131031 10:35:55 [Warning] Slave: Can't find record in 'CC' Error_code: 1032
      131031 10:35:55 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000081' position 287
      131031 10:35:55 [Note] Slave SQL thread exiting, replication stopped in log 'mysql-bin.000081' at position 326

      safe_mutex: Trying to destroy a mutex LOCK_wait_commit that was locked at /home/elenst/bzr/10.0-knielsen/sql/sql_class.cc, line 5805 at /home/elenst/bzr/10.0-knielsen/sql/sql_class.cc, line 5631
      131031  3:45:35 [ERROR] mysqld got signal 6 ;

      RQG grammar (parallel-replication-6.yy):

      query:
      	dml_ddl | dml_ddl | dml_ddl | dml_ddl | dml_ddl |
      	FLUSH LOGS
      ;
       
      create_drop_table:
      	CREATE TABLE IF NOT EXISTS { 't_'.abs($$) } LIKE _table | 
      	DROP TABLE IF EXISTS { 't_'.abs($$) } ;
       
      dml_ddl:
      	delete | insert | create_drop_table ;
       
      delete:
      	DELETE FROM _table WHERE `col_int_nokey` BETWEEN _tinyint[invariant] AND _tinyint[invariant] + 2 ;
       
      insert:
      	INSERT INTO _table ( `col_int_key` , `col_int_nokey` )   VALUES ( _tinyint, _tinyint ), ( _tinyint, _tinyint ) ;

      RQG command line (works on the standard lp:randgen tree):

      perl ./runall-new.pl --grammar=parallel-replication-6.yy  --threads=12 --queries=100M --duration=300  --mysqld=--slave-parallel-threads=8 --rpl_mode=row --basedir=<your basedir> --vardir=<your vardir>

      revision-id: knielsen@knielsen-hq.org-20131030065230-kp8dykgyeth6ma55
      revno: 3690
      branch-nick: 10.0-knielsen

      Attachments

        Issue Links

          Activity

            elenst Elena Stepanova added a comment - - edited

            I'll run the test again with the bugfix for MDEV-5206, maybe it will help here as well.

            Upd: it didn't help, I'm still getting the errors on revno 3691.

            Upd2: Same happens if I remove FLUSH LOGS from the flow at all, and instead reduce the value of max-binlog-size that that it rotates frequently enough (e.g. set it to 4096).

            elenst Elena Stepanova added a comment - - edited I'll run the test again with the bugfix for MDEV-5206 , maybe it will help here as well. Upd: it didn't help, I'm still getting the errors on revno 3691. Upd2: Same happens if I remove FLUSH LOGS from the flow at all, and instead reduce the value of max-binlog-size that that it rotates frequently enough (e.g. set it to 4096).

            Ok, this was a great testcase, there were a lot of bugs uncovered. Everytime I
            fixed one and re-ran the test, another turned up

            I seem to have fixed all issues seen with this particular test now, at least I
            was able to run the runall command 10 times or so in a row without any
            failures. I also think I have fixes for all the problems you mentioned (or at
            least similar-looking ones):

            • I fixed several things in error handling, where an error stop would be
              handled incorrectly and cause the server to hang or otherwise misbehave.
            • I fixed that an error set by one thread could be cleared by another thread,
              leaving the SQL thread stopped with no apparent error in SHOW SLAVE STATUS
              (this is probably MDEV-5219, but I will check that to be sure).
            • I fixed that the current relay log position for the driver SQL thread was
              not correctly updated, which could cause the SQL thread to start reading
              events at the wrong position around relay log rotates, seen as various
              kinds of corrupt events during event read.
            • I fixed a race in wait_for_commit that could cause unlock of uninitialised
              mutex.
            • I fixed a missing wait between threads that could cause incorrect parallel
              execution of conflicting MyISAM queries (when --log-slave-updates=0), seen
              as "Could not execute .... event (Can't find record)" or similar.
            knielsen Kristian Nielsen added a comment - Ok, this was a great testcase, there were a lot of bugs uncovered. Everytime I fixed one and re-ran the test, another turned up I seem to have fixed all issues seen with this particular test now, at least I was able to run the runall command 10 times or so in a row without any failures. I also think I have fixes for all the problems you mentioned (or at least similar-looking ones): I fixed several things in error handling, where an error stop would be handled incorrectly and cause the server to hang or otherwise misbehave. I fixed that an error set by one thread could be cleared by another thread, leaving the SQL thread stopped with no apparent error in SHOW SLAVE STATUS (this is probably MDEV-5219 , but I will check that to be sure). I fixed that the current relay log position for the driver SQL thread was not correctly updated, which could cause the SQL thread to start reading events at the wrong position around relay log rotates, seen as various kinds of corrupt events during event read. I fixed a race in wait_for_commit that could cause unlock of uninitialised mutex. I fixed a missing wait between threads that could cause incorrect parallel execution of conflicting MyISAM queries (when --log-slave-updates=0), seen as "Could not execute .... event (Can't find record)" or similar.
            j j added a comment -

            Was this applied to MariaDB 5.5.x? If not, see here: https://mariadb.atlassian.net/browse/MDEV-7124

            j j added a comment - Was this applied to MariaDB 5.5.x? If not, see here: https://mariadb.atlassian.net/browse/MDEV-7124

            People

              knielsen Kristian Nielsen
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.