[MDEV-9423] cannot add new node to the cluser: Binlog file '/var/log/mysql/mariadb-bin.000001' not found in binlog index, needed for recovery. Aborting. Created: 2016-01-15 Updated: 2016-08-21 Resolved: 2016-06-29 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera |
| Affects Version/s: | 10.1.10 |
| Fix Version/s: | 10.1.15 |
| Type: | Bug | Priority: | Major |
| Reporter: | Krasimir | Assignee: | Nirbhay Choubey (Inactive) |
| Resolution: | Fixed | Votes: | 4 |
| Labels: | None | ||
| Environment: |
Linux fm-first 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt20-1+deb8u1 (2015-12-14) x86_64 GNU/Linux |
||
| Issue Links: |
|
||||||||
| Sprint: | 10.1.15 | ||||||||
| Description |
|
Looks like the master hasn't been shutdown properly and now the new node that tries to connect to this cluster tries to run a recovery , but the binlog file doesn't exist. starting the daemon with
|
| Comments |
| Comment by Elena Stepanova [ 2016-01-16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
How exactly did you initialize the node? Did you copy the entire data directory from an existing node? If so, did you copy the binary logs and binary log index as well? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Krasimir [ 2016-01-16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
no data copied from the master | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2016-01-16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Your data directory is apparently /var/lib/mysql. Did you also clean up /var/log/mysql, where your binary files are configured to reside? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Krasimir [ 2016-01-16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
both dirs were empty I resolved the issue by running RESET MASTER ; on the main master and then added the node again. So my logic is that the master didn't shutdown cleanly the last time and this affected the new node trying to run the recovery. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2016-01-16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The error log looks strange for a message coming from another node, but I cannot rule it out right away. Assigning to nirbhay_c who can say for sure. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrew Garner [ 2016-04-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I can reproduce this reliably in my environment on MariaDB 10.1.13 / Percona XtraBackup 2.3.4 on Debian8 with wsrep_sst_method=xtrabackup-v2. I just create a simple innodb table and start a loop writing to the cluster from one node and force SST on another node and I get results similar to:
Recovery started from an empty MySQL datadir and empty /var/log/mysql/. I found that commenting out the lines in wsrep_sst_xtrabackup-v2 that copy Percona XtraBackup's archived binary log back to /var/log/mysql/ kept this from happening. I suppose that binlog recovery is initiated by the presence of the binary log which xtrabackup copies from the donor. I'm not sure of the ramifications of skipping that recovery. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Igor Mazur [ 2016-04-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I'm getting an exactly same problem. (also MariaDB 10.1.13) 2016-04-14 6:59:00 140693871286208 [Note] Recovering after a crash using /var/log/mysql/mariadb-bin 2016-04-14 6:59:00 140693871286208 [ERROR] Can't init tc log There is list of log files on donor: And on new node: For some reason during making SST donor rotated a log mariadb-bin.000036 and new log file mariadb-bin.000037 made and copied to new node. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Igor Mazur [ 2016-04-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
May be an important observation: I used: After changing to: I've successfully added a node to cluster twice in a row. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrew Garner [ 2016-04-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Another datapoint is that I see the binary log bundled in the xtrabackup (v2.3.4) output has only a single binlog checkpoint:
Where mysql-bin.000330 corresponds to the previous binary log and the backup image has mysql-bin.000331. wsrep_sst_rsync generate a binary log with two checkpoints:
Where this was a earlier run and 000283 corresponds to the current (copied) binlog and 000282 corresponds to the previous (uncopied) log. I assume this is probably some bug in xtrabackup causing this - manually running FTWRL + FLUSH BINARY LOGS always gives me two checkpoints in the latest log. Maybe there's a race condition here that rsync doesn't trip over (maybe because it's holding the lock longer?) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrew Garner [ 2016-04-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Actually there is a race here that after all. In my environment, I see up to a second of delay between running FLUSH BINARY LOGS/SHOW MASTER STATUS and when the current binary log has its second binlog checkpoint written. Given this python harness:
I see this output:
So under FTWRL, the binary log size is changing after FLUSH BINARY LOGS returns and before we run UNLOCK TABLES but if we read the binary log too fast (even under the ftwrl lock) it is missing some data. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrew Garner [ 2016-04-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Adding FLUSH LOCAL ENGINE LOGS prior to copying the binary log seems to erase this problem. I'm not sure if this is bad MariaDB behavior or if xtrabackup should be expected to flush engine logs here. I opened a bug with Percona against percona-xtrabackup that includes a minor patch against xtrabackup 2.3 that resolved this problem for me: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Aaron Lager [ 2016-04-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I just wanted to add that I'm also having this issue. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Aaron Lager [ 2016-04-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I just reverted back to xtrabackup version 2.2 and I was finally able to get the node to join without error. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrew Garner [ 2016-04-26 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I verified that xtrabackup 2.2 is also affected by this same issue (due to also not running flush engine logs prior to copying the binlog). This is racy so it will sometimes work in the right conditions (under any supported xtrabackup version). | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nirbhay Choubey (Inactive) [ 2016-06-28 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
http://lists.askmonty.org/pipermail/commits/2016-June/009505.html |