[MDEV-9044] Binlog corruption in Galera Created: 2015-10-30  Updated: 2015-12-18  Resolved: 2015-12-18

Status: Closed
Project: MariaDB Server
Component/s: Galera, Replication
Affects Version/s: 10.1.6, 10.1.8
Fix Version/s: 10.1.10

Type: Bug Priority: Blocker
Reporter: Bryan Seitz Assignee: Nirbhay Choubey (Inactive)
Resolution: Fixed Votes: 2
Labels: galera, replication
Environment:

6x Galera Nodes:
Kernel: 3.10.0-229.14.1.el7.x86_64
Disk: Local Virtio 200GB
Hypervisor: CentOS7 KVM
vCPUS: 4
Memory: 16GB

4x Slaves:
Kernel: 3.10.0-229.14.1.el7.x86_64
Disk: Local Virtio 40GB
Hypervisor: CentOS7 KVM
vCPUS: 2
Memory: 4GB

1x Garb:
Kernel: 3.10.0-229.14.1.el7.x86_64
Physical Node

MariaDB Packge: MariaDB-server-10.1.8-1.el7.centos.x86_64

Slave Config:

[unknown] bryan_seitz@v-pdnsauth-001-net:/etc/my.cnf.d$ grep -v ^# server.cnf
 
[server]
 
[mysqld]
server_id=101
datadir=/var/lib/mysql
user=mysql
default_storage_engine=InnoDB
thread_cache_size = 128
max-allowed-packet = 64M
skip-name-resolve
sync_binlog=1
 
log_bin=binlog
log_slave_updates=1
expire_logs_days=7
binlog_format=ROW
replicate_do_db=powerdns
slave_net_timeout=60
relay_log=v-pdnsauth-001-net-relay-bin
 
 
innodb_flush_method = O_DIRECT
innodb_log_files_in_group = 2
innodb_log-buffer-size=64M
innodb_log_file_size = 256M
innodb_flush_log_at_trx_commit = 1
innodb_file_per_table = 1
innodb_buffer_pool_size = 512M
 
[embedded]
 
[mariadb]
 
[mariadb-10.1]

Galera Config:

[mysqld]
server_id=1
datadir=/var/lib/mysql
user=mysql
default_storage_engine=InnoDB
skip-name-resolve
 
### WSRep
wsrep_on=ON
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
wsrep_cluster_address=gcomm://100.65.3.71,100.65.3.72,100.65.3.73,100.65.11.71,100.65.11.72,100.65.11.73,10.51.22.253
wsrep_node_address=100.65.3.71
wsrep_node_name=v-mysql-001-net.ash2.symcpe.net
wsrep_slave_threads=8
wsrep_sst_method=rsync
wsrep_cluster_name=net_cluster_1
wsrep_provider_options="gcache.size=256M"
wsrep_provider_options="evs.suspect_timeout=PT5S"
wsrep_provider_options="evs.inactive_timeout=PT15S"
wsrep_provider_options="evs.send_window=4"
wsrep_provider_options="evs.user_send_window=2"
wsrep_provider_options="evs.inactive_check_period=PT2S"
wsrep_provider_options="gmcast.segment=0"
wsrep_certify_nonPK=1
wsrep_max_ws_rows=131072
wsrep_max_ws_size=1073741824
wsrep_debug=0
wsrep_convert_LOCK_to_trx=0
wsrep_retry_autocommit=5
wsrep_auto_increment_control=1
wsrep_replicate_myisam=0
wsrep_causal_reads=0
gtid_strict_mode=1
gtid-domain-id=2
innodb-force-primary-key=1
enforce-storage-engine=innodb
### Tuning
tmp_table_size=64M
max_allowed_packet=1073741824
key_buffer_size=24M
query_cache_type=0
query_cache_size=0
max_connections=500
thread_cache_size=512
open_files_limit=65534
table_definition_cache=1024
table_open_cache=2048
 
### InnoDB
innodb_flush_method=O_DIRECT
innodb_log_files_in_group=2
innodb-log-buffer-size=64M
innodb_log_file_size=256M
innodb_flush_log_at_trx_commit=0
innodb_file_per_table=1
innodb_buffer_pool_size=1G
innodb-doublewrite=1
innodb-stats-on-metadata=0
innodb_autoinc_lock_mode=2
innodb-thread-concurrency=0
 
 
### Binary logging
binlog_format=ROW
log_bin=/data/mysql-logs/mysql-bin
expire_logs_days=7
sync_binlog=1
binlog_checksum=1
log_slave_updates=1
ignore_db_dirs=.ssh
ignore_db_dirs=lost+found
 
### SST
[sst]
progress=1
 
### EOF


Attachments: PNG File MariaDB.graffle_2015-10-30_01-40-42.png     File node1-binlog-000014.gz     File node3-binlog-000014.gz    
Sprint: 10.1.10

 Description   

Seeing binlog corruption and other errors on slave nodes.

Slave bootstrap procedure:
Dump on cluster node 1:

mysqldump -u root -p -B powerdns -c --add-drop-database --add-drop-table -e --routines --master-data --gtid  > v-mysql-001-net.sql

On each slave node:

mariadb> stop slave
mariadb> reset slave
mysql -u root -p < v-mysql-001-net.sql
mariadb> start slave

Sometimes this works sometimes it does not. If it works, it dies minutes later with errors like:

Binlog issues:

[unknown] root@v-mysql-003-net:/data/mysql-logs# ls -l
total 60736
-rw-rw---- 1 mysql mysql 5366340 Oct 27 10:20 mysql-bin.000001
-rw-rw---- 1 mysql mysql 5455473 Oct 27 11:20 mysql-bin.000002
-rw-rw---- 1 mysql mysql 5342334 Oct 27 12:20 mysql-bin.000003
-rw-rw---- 1 mysql mysql 5757401 Oct 27 13:20 mysql-bin.000004
-rw-rw---- 1 mysql mysql 5632098 Oct 27 14:20 mysql-bin.000005
-rw-rw---- 1 mysql mysql 5740158 Oct 27 15:20 mysql-bin.000006
-rw-rw---- 1 mysql mysql 5720343 Oct 27 16:20 mysql-bin.000007
-rw-rw---- 1 mysql mysql 5474330 Oct 27 17:20 mysql-bin.000008
-rw-rw---- 1 mysql mysql 5891716 Oct 27 18:20 mysql-bin.000009
-rw-rw---- 1 mysql mysql 5670778 Oct 27 19:20 mysql-bin.000010
-rw-rw---- 1 mysql mysql 5481569 Oct 27 20:20 mysql-bin.000011
-rw-rw---- 1 mysql mysql  574991 Oct 27 20:43 mysql-bin.000012
-rw-rw---- 1 mysql mysql     414 Oct 27 20:43 mysql-bin.000013
-rw-rw---- 1 mysql mysql     442 Oct 27 20:43 mysql-bin.index
[unknown] root@v-mysql-003-net:/data/mysql-logs# mysqlbinlog mysql-bin.000011 > /dev/null
ERROR: Error in Log_event::read_log_event(): 'Event truncated', data_len: 858861106, event_type: 49
ERROR: Could not read entry at offset 4421267: Error in log format or read error.
[unknown] root@v-mysql-003-net:/data/mysql-logs#

Slave errors:

                Relay_Log_Pos: 4
        Relay_Master_Log_File: mysql-bin.000078
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
              Replicate_Do_DB: powerdns
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 927253
              Relay_Log_Space: 249
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 2-1-76468, which is no
t in the master's binlog'
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 3
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 2-1-76468
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
1 row in set (0.00 sec)

or

                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'could not find next log; the first event 'mysql-bin.000075' at 415, the last event read from 'mysql-bin.000074' at 41011, the last byte read from 'mysql-bin.000074' at 41042.'
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 3

or

MariaDB [(none)]> show slave status\G;
*************************** 1. row ***************************
               Slave_IO_State:
                  Master_Host: mysql-cluster1-net.ash2.symcpe.net
                  Master_User: slave_user
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000071
          Read_Master_Log_Pos: 458
               Relay_Log_File: v-pdnsauth-001-net-relay-bin.000002
                Relay_Log_Pos: 750
        Relay_Master_Log_File: mysql-bin.000071
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
              Replicate_Do_DB: powerdns
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 458
              Relay_Log_Space: 1065
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'binlog truncated in the middle of event; consider out of disk space on master; the first event 'mysql-bin.000071' at 458, the last event read from 'mysql-bin.000071' at 454149, the last byte read from 'mysql-bin.000071' at 454168.'
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 2-3-69910
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
1 row in set (0.00 sec)

more

                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'binlog truncated in the middle of event; consider out of disk space on master; the first event '.' at 4, the last event read from 'mysql-bin.000001' at 60529261, the last byte read from 'mysql-bin.000001' at 60529590.'
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 5
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 2-3-9495
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
1 row in set (0.00 sec)
 
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'binlog truncated in the middle of event; consider out of disk space on master; the first event '.' at 4, the last event read from 'mysql-bin.000011' at 4420261, the last byte read from 'mysql-bin.000011' at 4420590.'
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 3
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 2-3-9490
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
1 row in set (0.00 sec)
 
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'binlog truncated in the middle of event; consider out of disk space on master; the first event '.' at 4, the last event read from 'mysql-bin.000011' at 4420261, the last byte read from 'mysql-bin.000011' at 4420590.'
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 3
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 2-3-9490
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
1 row in set (0.00 sec)



 Comments   
Comment by Bryan Seitz [ 2015-10-30 ]

Attached architecture diagram.

Each DNS (PowerDNS) slave node async replicates the powerdns database from the Galera cluster.
The Galera cluster consists of 6 nodes + 1 garb between 3 sites.

Comment by Bryan Seitz [ 2015-10-30 ]

Any and all logs are available upon request.

Note: I have uploaded some logs gathered while troubleshooting with mgriffin.

Comment by Nirbhay Choubey (Inactive) [ 2015-10-30 ]

bryan_seitz Can you share the config files? Did you try it with 10.0 galera server?
Wondering if this issue is exclusive to 10.1.

Comment by Bryan Seitz [ 2015-10-30 ]

the config files are in the environment section. I had upgraded this cluster from 10.0 to try and solve another issue I had so it is possible this is 10.1 exclusive.

Comment by Nirbhay Choubey (Inactive) [ 2015-10-30 ]

By config files, I meant mariadb server configuration files (my.cnf).

Comment by Bryan Seitz [ 2015-10-30 ]

In the ticket click on the Environment drop down.

Comment by Nirbhay Choubey (Inactive) [ 2015-10-30 ]

Ok, I see it now.

Comment by Bryan Seitz [ 2015-10-31 ]

A data point to note. I had mysql-zrm running and it was calling flush logs which was rotating them about once an hour. This is what I believe was causing the corruption, mgriffin can correct me if I am wrong. If that is the case, there is definitely a code bug in 10.1.8.

Comment by Bryan Seitz [ 2015-11-03 ]

I have not had a single slave issue / binlog corruption since mysql-zrm is not calling flush logs every hour now. I would definitely investigate the flushing code for an issue though.

Comment by Bryan Seitz [ 2015-11-04 ]

Something flushued my binlogs and the slave issue happened again.

MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State:
                  Master_Host: 100.65.3.7
                  Master_User: slave_user
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000004
          Read_Master_Log_Pos: 32577604
               Relay_Log_File: v-pdnsauth-001-net-relay-bin.000002
                Relay_Log_Pos: 32574849
        Relay_Master_Log_File: mysql-bin.000004
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table: powerdns.%
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 32577474
              Relay_Log_Space: 32575290
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'binlog truncated in the middle of event; consider out of disk space on master; the first event '.' at 4, the last event read from 'mysql-bin.000004' at 32577604, the last byte read from 'mysql-bin.000004' at 32577949.'
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 2
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 2-2-63752
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
1 row in set (0.00 sec)

Also interesting: A restart of the server fixed this but very scary:

[530013.093482] mysqldump[18357]: segfault at 0 ip 00007f9147612494 sp 00007ffdf204b5b0 error 4 in mysqldump[7f91475be000+300000]

Comment by Bryan Seitz [ 2015-11-05 ]

I reinstalled fresh, imported from backup+grants and it's still doing it.

Comment by Bryan Seitz [ 2015-11-05 ]

Unless we can come to some sort of fix/workaround, I will not be using 10.1.8 any longer. Please advise.

Comment by Nirbhay Choubey (Inactive) [ 2015-11-05 ]

bryan_seitz Does it happen even without flush?

Comment by Bryan Seitz [ 2015-11-06 ]

I am not sure, that is the only time I observed the behavior. To note, Percona cluster does NOT exhibit this behavior.

Comment by Nirbhay Choubey (Inactive) [ 2015-11-13 ]

Can't reproduce. Here is what I have tried so far :

  +----------------+       +----------------+         +---------------+
  |MariaDB Galera 1+------->MariaDB Galera 2+--------->MariaDB (slave)|
  +------------+--++       |   (master)     |         +---------------+
               ^  ^        +----^-----------+
+---------+    |  |             |
| Client  |    |  |    +--------+---+
|(INSERTs)+----+  +----+   Client   |
+---------+            |(FLUSH LOGS)|
                       +------------+

Comment by Nirbhay Choubey (Inactive) [ 2015-11-13 ]

bryan_seitz Do you have an affected binary log file that you could share?

Comment by Bryan Seitz [ 2015-11-13 ]

I am pretty sure I FTP'd up the logs for you guys w/ the ticket #. Also I even tried dumping all my DBs, recreating from scratch after removing everything and installing Maria 10.1.8 fresh, re-importing, and it still did it.

Comment by Bryan Seitz [ 2015-11-13 ]

Also not sure if it matters but I had 6 nodes + a garb.

Comment by Karl E. Jørgensen [ 2015-11-25 ]

We are experiencing (what appears to be) the same errors

A couple of observations:

  • We have 3 nodes in the cluster
  • Due to legacy reasons (and MDEV-9037), all write activity is directed towards node #1
  • We have events which (due to replication) run some statements with sql_log_bin=0
  • We get corruptions in the master binary log on node #1 only.
  • Binary log is not corrupted on nodes #2 or #3 - but we cannot use the binary log from there, due to MDEV-9037 ...
  • We do not use FLUSH LOGS...
Comment by Nirbhay Choubey (Inactive) [ 2015-11-25 ]

jorginator Hi! Thank you for reporting this. Could you please share the server logs and possibly the corrupted binary logs (https://mariadb.com/kb/en/mariadb/reporting-bugs/#what-to-do-when-you-get-a-crash-after-installing-a-debug-binary)?

Comment by Karl E. Jørgensen [ 2015-11-25 ]

Hi - thanks for responding so quickly!

Mariadb server logs are boring - no sign of trouble in there. We are currently working around the problem by skipping to the next binary log, and the only messages in there are stuff like

[Note] Start binlog_dump to slave_server(31651), pos(muiddb-cluster-bin.000105, 4)

which occur when we do START SLAVE on the downstream servers.

As for corrupted binary logs, we have quite a few!

They are identifiable on the master server - e.g.:

[root@muiddb01 binary-logs]# mysqlbinlog muiddb-cluster-bin.000121 > /dev/null
ERROR: Error in Log_event::read_log_event(): 'Event truncated', data_len: 4127195136, event_type: 91
ERROR: Could not read entry at offset 79569: Error in log format or read error.

Unfortunately, the application data in the binary logs are proprietary, which prevents me from uploading them to a public forum , so I may email it directly to you instead ...

Comment by Karl E. Jørgensen [ 2015-11-25 ]

Running mysqlbinlog on the corrupt files do not seem to follow any identifiable pattern. The event types listed varies, which makes me suspect that the actual cause is the preceding binary log entry...

Comment by Nirbhay Choubey (Inactive) [ 2015-11-25 ]

Could you tail last few lines (say tail -200) from mysqlbinlog's output and share it?

Comment by Karl E. Jørgensen [ 2015-11-25 ]

I have sent you (via email) an (small-ish) corrupted binlog - hope this helps!

Comment by Karl E. Jørgensen [ 2015-11-25 ]

FYI: In an attempt at avoiding having too many data discrepancies, we have decreased max_binlog_size to 98304. In theory, this should mean that we skip fewer transactions when force a skip to the next binary log. The file I sent (and mentioned above) was produced while this setting was in effect.

Comment by Karl E. Jørgensen [ 2015-11-26 ]

A little bit of progress:

  • Going back through the corrupted binary logs, the corruption appears to most frequently (but not always) occur after a "big" transaction - in our case nearly always a insert of many rows (1200 rows or so). This is a single-statement INSERT INTO .. SELECT ... FROM - we don't know whether 1200 individual single-row INSERT statements in a transaction will produce the same corruption.
  • We switched off the job that does the above inserts, and the no new binary log corruption has occured overnight!

Obviously, we cannot leave this switched off, but it may help guide diagnostics?

Comment by Karl E. Jørgensen [ 2015-11-26 ]

I'm not sure which configuration variables are useful - here are the ones we have which refer to either "log" or "bin":

Variable Value
aria_checkpoint_log_activity 1048576
aria_log_file_size 1073741824
aria_log_purge_type immediate
aria_sync_log_dir NEWFILE
back_log 200
binlog_annotate_row_events OFF
binlog_cache_size 32768
binlog_checksum NONE
binlog_commit_wait_count 0
binlog_commit_wait_usec 100000
binlog_direct_non_transactional_updates OFF
binlog_format ROW
binlog_optimize_thread_scheduling ON
binlog_row_image FULL
binlog_stmt_cache_size 32768
character_set_filesystem binary
encrypt_binlog OFF
expire_logs_days 5
general_log OFF
general_log_file muiddb01.log
gtid_binlog_pos 2-322139-13293205
gtid_binlog_state 2-31670-12668861,2-322141-1109,2-322140-7985379,2-322139-13293205
ignore_db_dirs lost+found,.ssh,backup,innodb-logs,binary-logs,relay-logs,undo
innodb_api_enable_binlog OFF
innodb_encrypt_log OFF
innodb_flush_log_at_timeout 1
innodb_flush_log_at_trx_commit 0
innodb_locks_unsafe_for_binlog OFF
innodb_log_arch_dir /var/lib/mysql/innodb-logs
innodb_log_arch_expire_sec 0
innodb_log_archive OFF
innodb_log_block_size 512
innodb_log_buffer_size 8388608
innodb_log_checksum_algorithm INNODB
innodb_log_compressed_pages OFF
innodb_log_file_size 1073741824
innodb_log_files_in_group 4
innodb_log_group_home_dir /var/lib/mysql/innodb-logs
innodb_mirrored_log_groups 1
innodb_online_alter_log_max_size 134217728
innodb_scrub_log OFF
innodb_scrub_log_speed 256
innodb_undo_logs 128
innodb_use_global_flush_log_at_trx_commit ON
log_bin ON
log_bin_basename /var/lib/mysql/binary-logs/muiddb-cluster-bin
log_bin_index /var/lib/mysql/binary-logs/muiddb-cluster-bin.index
log_bin_trust_function_creators OFF
log_error  
log_output FILE
log_queries_not_using_indexes ON
log_slave_updates ON
log_slow_filter filesort,filesort_on_disk,full_join,tmp_table,tmp_table_on_disk
log_slow_rate_limit 1
log_slow_verbosity innodb,query_plan,explain
log_tc_size 24576
log_warnings 3
max_binlog_cache_size 18446744073709547520
max_binlog_size 98304
max_binlog_stmt_cache_size 18446744073709547520
max_relay_log_size 1073741824
relay_log /var/lib/mysql/relay-logs/log
relay_log_basename /var/lib/mysql/relay-logs/log
relay_log_index /var/lib/mysql/relay-logs/log.index
relay_log_info_file relay-log.info
relay_log_purge ON
relay_log_recovery OFF
relay_log_space_limit 0
slow_query_log ON
slow_query_log_file /var/lib/mysql/slow.log
sql_log_bin ON
sql_log_off OFF
sync_binlog 0
sync_relay_log 10000
sync_relay_log_info 10000
version 10.1.8-MariaDB-log
wsrep_forced_binlog_format NONE
wsrep_log_conflicts ON
wsrep_provider_options base_host = 172.16.18.139; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.causal_keepalive_period = PT1S; evs.debug_log_mask = 0x1; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.info_log_mask = 0; evs.install_timeout = PT7.5S; evs.join_retrans_period = PT1S; evs.keepalive_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.use_aggregate = true; evs.user_send_window = 2; evs.version = 0; evs.view_forget_timeout = P1D; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 4G; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.listen_addr = tcp://0.0.0.0:4567; gmcast.mcast_addr = ; gmcast.mcast_ttl = 1; gmcast.peer_timeout = PT3S; gmcast.segment = 0; gmcast.time_wait = PT5S; gmcast.version = 0; ist.recv_addr = 172.16.18.139; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.linger = PT20S; pc.npvo = false; pc.recovery = true; pc.version = 0; pc.wait_prim = true; pc.wait_prim_timeout = P30S; pc.weight = 1; protonet.backend = asio; protonet.version = 0; repl.causal_read_timeout = PT30S; repl.commit_order = 3; repl.key_format = FLAT8; repl.max_ws_size = 2147483647; repl.proto_max = 7; socket.checksum = 2;
Comment by Karl E. Jørgensen [ 2015-11-26 ]

FYI: It is also worth noting that in our case we do not get MariaDB crashes. The node that produces the corrupted binary log appares to continue on, oblivious to the fact the binary log is corrupted (which makes sense, as never actually reads that log). However, the slave servers do notice the corruption, and their replication screams to a halt then.

Comment by Karl E. Jørgensen [ 2015-12-01 ]

Interesting: In an attempt at working around the problem, we shut down nodes #2 and #3 (leaving us with a one-node cluster), without any configuration changes. So far this seems to have worked - we have been unable to reproduce any binary log corruption since then.

My working theory is that the wsrep code somehow corrupts memory, which then causes binary log corruption. Shutting down the other nodes probably affects code paths in the wsrep code, as it no longer needs to send write sets to other nodes...

Comment by Karl E. Jørgensen [ 2015-12-02 ]

Good news: I have managed to reproduce the problem in my local test cluster, using MariaDB 10.1.8 debian package (compiled from source) and a public data set from https://github.com/datacharmer/test_db . This set does not play games with sql_log_bin like our application - simply loading the dataset (using employees.sql ) causes corruption in the receiving node (node #1 in my case). I suspect the resulting binary logs will be instructive.

I have attached node1-binlog-000014.gz and node3-binlog-000014.gz (gzipped binary logs) - they both cover (almost exactly) the same time period, and the activity on the database is entirely from loading employees.sql - as far as I can see, they line up pretty well.

Comment by Karl E. Jørgensen [ 2015-12-03 ]

I set up a test environment, and I managed to narrow down the offending commit: https://github.com/MariaDB/server/commit/6309a30 : We have binary log corruption on and after this version, but not before.

Comment by Sergei Golubchik [ 2015-12-04 ]

From: nirbhay_c
To: serg
Subject: MDEV-9044 and the new definition of my_b_fill()

I am working on MDEV-9044 (binlog corruption), and it seem related to
recent changes done in my_b_fill() in 10.1. Reverting the following patch
fixes this issue.

https://github.com/MariaDB/server/commit/6309a30

But, I'd assume that reverting this patch would break encryption.

When comparing my_b_fill from 10.0 and 10.1, it looks like while it actually
filled buffer in 10.0, in 10.1, OTOH, its seemingly does nothing
(Buffer/Count being 0).

static inline size_t my_b_fill(IO_CACHE *info)
{
  info->read_pos= info->read_end;
  return _my_b_read(info,0,0) ? 0 : info->read_end - info->read_pos;
}

Comment by Nirbhay Choubey (Inactive) [ 2015-12-04 ]

serg I seem to have found the issue. Its in IO_CACHE. Will submit a patch soon.

Comment by Karl E. Jørgensen [ 2015-12-07 ]

Feel free to poke me when you have a patch ready for testing - we are keen on testing this

Comment by Nirbhay Choubey (Inactive) [ 2015-12-07 ]

http://lists.askmonty.org/pipermail/commits/2015-December/008704.html

Comment by Karl E. Jørgensen [ 2015-12-08 ]

Nice - I applied the patch to mariadb-10.1.8 and built - it works!

Thanks!

Comment by Sergei Golubchik [ 2015-12-18 ]

ok to push

Comment by Nirbhay Choubey (Inactive) [ 2015-12-18 ]

https://github.com/MariaDB/server/commit/58b54b7d1a4ce4e3a9537c058bb5ba300c88c0de
https://github.com/MariaDB/server/commit/5b94ea71c3812a24549485bc8e9f74b52a92b728

Generated at Thu Feb 08 07:31:43 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.