[MDEV-4816] rpl.rpl_trunc_temp fails in 10.0-serg Created: 2013-07-25  Updated: 2014-01-20  Resolved: 2014-01-20

Status: Closed
Project: MariaDB Server
Component/s: None
Affects Version/s: 10.0.4
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Sergei Petrunia Assignee: Sergei Petrunia
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
PartOf
is part of MDEV-4786 merge 10.0-monty → 10.0 Closed

 Description   

rpl.rpl_trunc_temp test fails in 10.0-serg. The failure looks like this:

CURRENT_TEST: rpl.rpl_trunc_temp
mysqltest: In included file "./include/wait_for_binlog_event.inc": 
included from /data0/psergey/dev2/10.0-serg/mysql-test/suite/rpl/t/rpl_trunc_temp.test at line 52:
At line 33: ERROR: failed while waiting for $wait_binlog_event in binlog 

In the log, one can see

**** SHOW BINLOG EVENTS on slave ****
binlog_name = 'slave-bin.000001'
SHOW BINLOG EVENTS IN 'slave-bin.000001';
Log_name        Pos     Event_type      Server_id       End_log_pos     Info
slave-bin.000001        4       Format_desc     2       248     Server ver: 10.0.3-MariaDB-debug-log, Binlog ver: 4
slave-bin.000001        248     Gtid_list       2       273     []
slave-bin.000001        273     Binlog_checkpoint       2       312     slave-bin.000001
slave-bin.000001        312     Gtid    1       350     GTID 0-1-1
slave-bin.000001        350     Query   1       446     use `test`; create temporary table t1 (n int)
 
**** SHOW RELAYLOG EVENTS on slave ****
relaylog_name = 'slave-relay-bin.000002'
SHOW RELAYLOG EVENTS IN 'slave-relay-bin.000002';
Log_name        Pos     Event_type      Server_id       End_log_pos     Info
slave-relay-bin.000002  4       Format_desc     2       248     Server ver: 10.0.3-MariaDB-debug-log, Binlog ver: 4
slave-relay-bin.000002  248     Rotate  1       0       master-bin.000001;pos=4
slave-relay-bin.000002  292     Format_desc     1       248     Server ver: 10.0.3-MariaDB-debug-log, Binlog ver: 4
slave-relay-bin.000002  536     Gtid_list       1       273     []
slave-relay-bin.000002  561     Binlog_checkpoint       1       313     master-bin.000001
slave-relay-bin.000002  601     Gtid    1       351     GTID 0-1-1
slave-relay-bin.000002  639     Query   1       447     use `test`; create temporary table t1 (n int)
slave-relay-bin.000002  735     Gtid    1       485     BEGIN GTID 0-1-2
slave-relay-bin.000002  773     Query   1       572     use `test`; insert into t1 values(1)
slave-relay-bin.000002  860     Query   1       641     COMMIT
slave-relay-bin.000002  929     Gtid    1       679     BEGIN GTID 0-1-3
slave-relay-bin.000002  967     Query   1       756     use `test`; delete from t1
slave-relay-bin.000002  1044    Query   1       825     COMMIT
slave-relay-bin.000002  1113    Gtid    1       863     GTID 0-1-4
slave-relay-bin.000002  1151    Query   1       937     use `test`; truncate t1
slave-relay-bin.000002  1225    Gtid    1       975     GTID 0-1-5
slave-relay-bin.000002  1263    Query   1       1085    use `test`; DROP /*!40005 TEMPORARY */ TABLE IF EXISTS `t1`



 Comments   
Comment by Sergei Petrunia [ 2013-07-25 ]

Debugging, I found out that

  • "CREATE TEMPORARY TABLE t1" is successfully replicated and written to slave's binary log
  • INSERT INTO t1 ... , and other statements are replicated and run on the slave, but they are not written to the binary log
  • Replication doesn't break - if one runs regular updates they get replicated.
  • However, some statements, despite being successfully run on the slave, do not end up in slave's binary log
Comment by Sergei Petrunia [ 2013-07-25 ]

I was debugging commands from rpl_trunc_temp on 10.0 and 10.0-serg.

Breakpoint 3, mysql_execute_command (thd=0xbb9eb90) at /home/psergey/dev2/10.0-serg/sql/sql_parse.cc:2177
$386 = {string =

{str = 0xbb8a2d9 "create temporary table t1 (n int)", length = 33}

, cs = 0x8f715a0}
...

Breakpoint 3, mysql_execute_command (thd=0xbb9eb90) at /home/psergey/dev2/10.0-serg/sql/sql_parse.cc:2177
$395 = {string =

{str = 0x8efabe8 "BEGIN", length = 5}

, cs = 0x8f0f1a0}
...
Breakpoint 3, mysql_execute_command (thd=0xbb9eb90) at /home/psergey/dev2/10.0-serg/sql/sql_parse.cc:2177
$399 = {string =

{str = 0xbb8a2d9 "insert into t1 values(1)", length = 24}, cs = 0x8f715a0}
...
Breakpoint 9, trans_commit_stmt (thd=0xbb9eb90) at /home/psergey/dev2/10.0-serg/sql/transaction.cc:334
$405 = {string = {str = 0xbb8a2d9 "insert into t1 values(1)", length = 24}

, cs = 0x8f715a0}
(gdb) p thd->current_stmt_binlog_format
$406 = BINLOG_FORMAT_STMT
...

Breakpoint 15, THD::decide_logging_format (this=0xbb9eb90, tables=0xaeecd5a8) at /home/psergey/dev2/10.0-serg/sql/sql_class.cc:4906
(gdb) p this->query_string
$408 = {string =

{str = 0xbb8a2d9 "COMMIT", length = 6}

, cs = 0x8f715a0}
(gdb) p this->current_stmt_binlog_format
$409 = BINLOG_FORMAT_STMT

Here, the problems start:
THD::decide_logging_format() sees mysql.gtid_slave_pos as a table that's part of the statement.
This table has

table->table->no_replicate == true

and the code in THD::decide_logging_format() executes these lines for it:

if (lex->is_stmt_unsafe() || lex->is_stmt_row_injection()

(flags_write_all_set & HA_BINLOG_STMT_CAPABLE) == 0) { /* log in row format! */ set_current_stmt_binlog_format_row_if_mixed(); }

after which we have
(gdb) p lex->is_stmt_unsafe()
$418 = true

In 10.0, we have:

(gdb) p lex->is_stmt_unsafe()
$310 = false

Comment by Sergei Petrunia [ 2013-07-25 ]

This difference causes another (in the same function
10.0:
(gdb) p this->current_stmt_binlog_format
$311 = BINLOG_FORMAT_STMT

10.0-serg:
(gdb) p this->current_stmt_binlog_format
$421 = BINLOG_FORMAT_ROW

Comment by Sergei Petrunia [ 2013-07-25 ]

... and the above difference stays. When I run another INSERT statement, at statement start I have the same BINLOG_FORMAT_STMT vs BINLOG_FORMAT_ROW difference.

Comment by Sergei Petrunia [ 2013-07-25 ]

THD::decide_logging_format() eventually executes this code:

if ((replicated_tables_count == 0) || ! is_write)

{ DBUG_PRINT("info", ("decision: no logging, no replicated table affected")); set_binlog_local_stmt_filter(); }
Comment by Sergei Petrunia [ 2013-08-02 ]

Discussed with Sergei Vojtovich. Came up with this exploration patch:

=== modified file 'sql/sql_class.cc'
— sql/sql_class.cc 2013-07-24 10:43:57 +0000
+++ sql/sql_class.cc 2013-08-02 08:37:54 +0000
@@ -5001,7 +5001,7 @@ int THD::decide_logging_format(TABLE_LIS
DBUG_PRINT("info", ("table: %s; ha_table_flags: 0x%llx",
table->table_name, flags));

  • if (table->table->no_replicate)
    + if (table->table->no_replicate && !(table->alias && !strcmp(table->alias, "gtid_slave_pos")))
    {
    /*
    The statement uses a table that is not replicated.

With this patch, the test passes:

==============================================================================

TEST RESULT TIME (ms) or COMMENT
--------------------------------------------------------------------------

worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
rpl.rpl_trunc_temp 'mix' [ pass ] 1497
rpl.rpl_trunc_temp 'row' [ skipped ] Neither MIXED nor STATEMENT binlog format
rpl.rpl_trunc_temp 'stmt' [ pass ] 1321
--------------------------------------------------------------------------
The servers were restarted 2 times
Spent 2.818 of 15 seconds executing testcases

Comment by Sergei Petrunia [ 2013-08-02 ]

Basic idea:

  • introduce a separate flag, "this is a gtid table", set it as appropriate.
  • use it in THD::decide_logging_format()

Also need to:

  • Run the testsuite with the above patch, see if anything shows up.
Comment by Sergei Petrunia [ 2013-08-02 ]

With the above patch (with strcmp) I get the following tests to fail:

unit.dbug
perfschema.relaylog 'mix'
perfschema.relaylog 'row'
perfschema.relaylog 'stmt'

This is a proper subset of the failures one gets without the patch.

Will work on polishing the patch

Comment by Sergei Petrunia [ 2013-08-03 ]

Pushed the fix to 10.0-serg

Comment by Kristian Nielsen [ 2013-09-03 ]

Something has gone wrong with this issue. The proposed patch was rejected in review:

https://lists.launchpad.net/maria-developers/msg05966.html

However, somehow the patch now nevertheless has ended up in main 10.0

Comment by Sergei Petrunia [ 2013-09-03 ]

Kristian, this was intended. See the review request https://lists.launchpad.net/maria-developers/msg05958.html:

> In order to get rid of test failures of 10.0-serg, I have pushed the below patch.

I pushed the patch during your vacation as a way to get the testsuite to pass. After that, you got back and rejected it. I did not act on the rejection, though.

I've checked the current 10.0 - if I revert the patch for mdev-4816 with "patch -R", the testcase starts to fail again.

I could act on your review comments, or you can take over for this issue.

Comment by Kristian Nielsen [ 2013-11-27 ]

I pushed to 10.0-base a patch that corrects the GTID code to clear
the OPTION_BIN_LOG flag rather than set table->no_replicate.

I still think you should revert the original patch in 10.0.

(I tried to merge my 10.0-base patch to 10.0 to check if it solves the
original problem, but there are lots of other unmerged changes that cause a
lot of unrelated conflicts, so I did not manage to complete the merge).

Comment by Sergei Petrunia [ 2013-11-27 ]

Ok. I'm fine with reverting the patch myself, or having someone else revert it.

I've tried "patch -R" on current 10.0, and it applied. Should I push the reverse patch into 10.0?

Comment by Sergei Petrunia [ 2014-01-20 ]

Pushed a change that reverts my original patch.

Generated at Thu Feb 08 06:59:24 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.