|
CrewOne,
Can you paste the binlog event on which it chokes (mysql-bin.056373, end_log_pos 217054452, Gtid 0-765-1299574704), and the output of
USE Voorraadbeheer;
|
SHOW CREATE TABLE InklaarPrijs;
|
SHOW CREATE TABLE InklaarLog;
|
SHOW TRIGGERS LIKE 'InklaarPrijs';
|
SHOW TRIGGERS LIKE 'InklaarLog';
|
and also attach your cnf file(s) from master and slave?
You said the server would not start again – what happens when it attempts to start? And what happens after you delete the relay logs?
Thanks.
|
|
When I delete the relay logs, the server starts. But obviously the replication does not.
I can't restart the replication after that because I do not know the correct position. I can restore the backup again and start the replication - but eventually it will crash. (Usually it will run for a while and replicate a bit before it crashes.)
But this is the second time this happened. Last week we had the exact same situation. This time is the same except with a fresh backup/restore cycle.
I've never managed to get any 10.1 version stable - not as a master and not as a slave (replicating from 10.0.) The idea was to replicate from 10.0 to 10.1 and see how stable 10.1 performs, but so far no luck.
We also tried 10.1 as a master once, but that crashed as well. So we went back to 10.0.
|
|
CREATE TABLE `InklaarPrijs` (
|
`EAN` char(13) NOT NULL,
|
`DT` timestamp(3) NULL DEFAULT NULL,
|
`Listprijs` decimal(7,2) DEFAULT NULL,
|
`Valuta` char(8) DEFAULT NULL,
|
`KortingMP` decimal(5,2) DEFAULT NULL,
|
PRIMARY KEY (`EAN`)
|
) ENGINE=InnoDB DEFAULT CHARSET=utf8
|
|
CREATE TABLE `InklaarLog` (
|
`DT` timestamp(3) NOT NULL DEFAULT CURRENT_TIMESTAMP(3) ON UPDATE CURRENT_TIMESTAMP(3),
|
`EAN` char(13) NOT NULL,
|
`Listprijs` decimal(7,2) DEFAULT NULL,
|
`Valuta` char(8) DEFAULT NULL,
|
`KortingMP` decimal(5,2) DEFAULT NULL,
|
PRIMARY KEY (`DT`,`EAN`)
|
) ENGINE=InnoDB DEFAULT CHARSET=utf8
|
|
Before_Insert_Inkaarlog
|
BEGIN
|
DECLARE kortingmp_ncoi DECIMAL(5,2);
|
IF NEW.KortingMP IS NOT NULL THEN
|
SELECT U.Kortingmp_NCOI INTO kortingmp_ncoi
|
FROM Manpo.Boeken B
|
JOIN Manpo.Uitgevers U USING ( UitgeverID )
|
WHERE B.EAN = NEW.EAN;
|
|
IF kortingmp_ncoi IS NOT NULL AND NEW.KortingMP = kortingmp_ncoi THEN
|
SET NEW.KortingMP = null;
|
END IF;
|
END IF;
|
END
|
|
After_Insert_Inklaarlog
|
BEGIN
|
INSERT INTO Voorraadbeheer.InklaarPrijs ( EAN, DT, Listprijs, Valuta, KortingMP )
|
VALUES ( NEW.EAN, NEW.DT, NEW.Listprijs, NEW.Valuta, NEW.KortingMP )
|
ON DUPLICATE KEY UPDATE DT=VALUES(DT),
|
Listprijs = IFNULL(VALUES(Listprijs), Listprijs),
|
Valuta = IFNULL(VALUES(Valuta), Valuta),
|
KortingMP = IFNULL(VALUES(KortingMP), KortingMP);
|
END
|
|
|
This does not seem to do anything:
mysqlbinlog --stop-position 217054452 mysql-bin.056376 | tail -n 20
|
# at 217040113
|
# at 217041126
|
# at 217042139
|
# at 217043152
|
# at 217044165
|
# at 217045178
|
# at 217046191
|
# at 217047204
|
# at 217048217
|
# at 217049230
|
# at 217050243
|
# at 217051256
|
# at 217052269
|
# at 217053282
|
# at 217054295
|
DELIMITER ;
|
# End of log file
|
ROLLBACK /* added by mysqlbinlog */;
|
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
|
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;
|
|
|
Thanks.
This does not seem to do anything
Just dump the whole log then. Besides, we are more interested in the log 373, that's where the crash happened at the first place. Although, you can dump both (or all four of them).
Please use mysqlbinlog --verbose --base64-output=DECODE-ROWS mysql-bin.056373.
Regarding the structure of tables and triggers, is it certainly the same on master and slave?
|
|
I attached what I hoped is the relevant section. Also the complete, zipped, file which extracts to 2.2Gb of data.
file: (98Mb zipped) http://we.tl/cvkJu6nxr8
mysqlbinlog --stop-position=217064452 --verbose --base64-output=DECODE-ROWS mysql-bin.056373
|
|
|
The slave has been freshly restored and the structure has been verified using SQLyog. Even so, it's preferred that MariaDB does not crash - ever - even if structures aren't the same. [Doh... :)]
|
|
CrewOne,
Different structures are not an excuse for the crash, but a hint for us to reproduce the problem and find the root cause. So, are they actually different?
|
|
Nope, the same as far as I can tell.
|
|
CrewOne,
Thanks for the data. For future, you can upload files to ftp.askmonty.org/private. 2G is not a problem as long as it's okay for your upload. Also, only MariaDB developers have access to the data there, which allows to maintain some privacy.
I can reproduce at least a part of the problem, initially I thought it was an unrelated issue and filed it separately as MDEV-9567, but, now I start thinking it might be the whole reason. However, I'd like to try to check it. Could you please upload the binary log mysql-bin.056373 itself (not decoded)? Maybe I'll be able to get something this way. If you know which values were actually inserted by that unfortunate event, it could also help – since the event appears to be corrupted, there is no saying what was there.
|
|
I will upload a file called MDEV-9560.7z to that FTP shortly, it is the requested binlog file. (from the master - not the slave)
Also, I will have to continue development and that means I have to revert the slave back to a 10.0 server today. Maybe I can setup a 10.1 slave on a seperate machine later, if I can help doing that or running a debug build, let me know.
|
|
I have reinstalled 10.0.23 - but after I start it, it seems to crash with more or less the same error.
From gdb:
#0 0x0000555555bbd7b6 in unpack_row(rpl_group_info*, TABLE*, unsigned int, unsigned char const*, st_bitmap const*, unsigned char const**, unsigned long*, unsigned char const*) ()
|
#1 0x0000555555bb99dd in Update_rows_log_event::do_exec_row(rpl_group_info*) ()
|
#2 0x0000555555bab865 in Rows_log_event::do_apply_event(rpl_group_info*) ()
|
#3 0x00005555558d760d in apply_event_and_update_pos(Log_event*, THD*, rpl_group_info*, rpl_parallel_thread*) ()
|
#4 0x00005555558da06e in handle_slave_sql ()
|
#5 0x00007ffff7570182 in start_thread (arg=0x7ffff60c3700) at pthread_create.c:312
|
#6 0x00007ffff645547d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
|
I had to erase all logs and database files, I hope replication will work now.
|
|
CrewOne,
Thanks for the data.
I can now reproduce the crash, but there is something that does not add up.
The only way I know to get the error log that you pasted in the description is to set slave_skip_errors=1105 on slave; but it's not in the slave cnf file.
Is it possible that you have another cnf somewhere which is also used by the slave, or that you set this option on mysqld command line?
|
|
Indeed, you are right. I'm sorry, but the slave.cnf you have seems to be missing a part. (I edited out the top bit of comments, but seem to have been a bit too enthousiastic.)
replicate-wild-ignore-table = tmp.%
|
replicate-wild-ignore-table = Voorraadbeheer.VBMIH_tekst_cache
|
replicate-wild-ignore-table = Managementboek_Klanten.OrderschermRegels
|
replicate-wild-ignore-table = Voorraadbeheer.GeprinteVoorraad
|
replicate-wild-ignore-table = php_sessions.%
|
replicate-ignore-db=Computerboek_Nieuwsbrieven
|
replicate-ignore-db=Juridischboek_Nieuwsbrieven
|
|
relay-log-space-limit=5G
|
slave-parallel-threads=4
|
slave-skip-errors=1105
|
transaction-isolation = READ-COMMITTED
|
That particular line, skipping 1105 was introduced when we tried replicating but it stopped a lot with that error. I remember reading about that somehwere, but honestly... it has been a long time ago. I think it was when 10.1 hit the first GA release.
Are you saying now that if I remove that line, replication will work from 10.0 -> 10.1?
|
|
Okay, thanks, now it makes sense.
Are you saying now that if I remove that line, replication will work from 10.0 -> 10.1?
I don't think so (will tell you for sure later today), but I think it won't crash, at least not this way, more likely replication will just abort as it usually does.
Skipping the error apparently re-shuffles things a bit and makes the server hit an error which it would otherwise never reach.
|
|
So, here is a summary of what we have here. I can still be wrong in some details, if so, serg will know out the truth as he's investigating now.
Problem 1: corrupt event
It happens when 10.0 server executes the following with the binary log enabled in row format:
SET binlog_format=ROW;
|
DROP TABLE IF EXISTS mdev9560;
|
CREATE TABLE mdev9560 (ts TIMESTAMP(3));
|
INSERT INTO mdev9560 VALUES ('2016-02-15 12:50:06.123');
|
The important part here is TIMESTAMP(3).
When mysqlbinlog attempts to read this binlog, it reports a corrupted event:
*!100001 SET @@session.gtid_seq_no=3*//*!*/;
|
BEGIN
|
/*!*/;
|
# at 648
|
# at 695
|
#160218 19:41:03 server id 1 end_log_pos 695 Table_map: `test`.`mdev9560` mapped to number 173
|
#160218 19:41:03 server id 1 end_log_pos 731 Write_rows: table id 173 flags: STMT_END_F
|
### INSERT INTO `test`.`mdev9560`
|
### SET
|
### @1=1588445526
|
### INSERT INTO `test`.`mdev9560`
|
### SET
|
### @1=***Corrupted replication event was detected. Not printing the value***
|
# at 731
|
#160218 19:41:03 server id 1 end_log_pos 758 Xid = 1190
|
COMMIT/*!*/;
|
# at 758
|
#160218 19:41:09 server id 1 end_log_pos 801 Rotate to mysql-bin.000002 pos: 4
|
DELIMITER ;
|
However, replication to another 10.0 works on some reason.
Problem 2: assertion failure on 10.1 debug server
When 10.1 slave running debug server attempts to replicate the events above from 10.0 master, it goes down with the assertion failure:
|
Stack trace from 10.1 commit d80b8442a68093106e00a9a38b7b2c593002a72c
|
2016-02-18 19:44:01 140700899031808 [ERROR] Slave SQL: Could not read field '' of table 'test.mdev9560', Gtid 0-1-3, Internal MariaDB error code: 1610
|
mysqld: /src/10.1/sql/log_event.cc:11696: virtual int Write_rows_log_event::do_exec_row(rpl_group_info*): Assertion `0' failed.
|
160218 19:44:01 [ERROR] mysqld got signal 6 ;
|
|
#7 0x00007ff778f951d2 in __assert_fail () from /lib64/libc.so.6
|
#8 0x000056524fb022a9 in Write_rows_log_event::do_exec_row (this=0x7ff752487070, rgi=0x7ff752418000) at /src/10.1/sql/log_event.cc:11696
|
#9 0x000056524fafdf5b in Rows_log_event::do_apply_event (this=0x7ff752487070, rgi=0x7ff752418000) at /src/10.1/sql/log_event.cc:9989
|
#10 0x000056524f72a043 in Log_event::apply_event (this=0x7ff752487070, rgi=0x7ff752418000) at /src/10.1/sql/log_event.h:1343
|
#11 0x000056524f71fe00 in apply_event_and_update_pos (ev=0x7ff752487070, thd=0x7ff752422870, rgi=0x7ff752418000, rpt=0x0) at /src/10.1/sql/slave.cc:3374
|
#12 0x000056524f720884 in exec_relay_log_event (thd=0x7ff752422870, rli=0x7ff75b183e60, serial_rgi=0x7ff752418000) at /src/10.1/sql/slave.cc:3704
|
#13 0x000056524f72397f in handle_slave_sql (arg=0x7ff75b182240) at /src/10.1/sql/slave.cc:4773
|
#14 0x00007ff77aede0a4 in start_thread () from /lib64/libpthread.so.0
|
#15 0x00007ff77904c04d in clone () from /lib64/libc.so.6
|
Problem 3: Replication failure on 10.1 release server
When 10.1 slave is running a non-debug server, it does not have the debug assertion, so it shows the replication failure instead:
|
From error log
|
2016-02-18 19:46:15 140401416502016 [ERROR] Slave SQL: Could not read field '' of table 'test.mdev9560', Gtid 0-1-3, Internal MariaDB error code: 1610
|
2016-02-18 19:46:15 140401416502016 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table test.mdev9560; Unknown error, Error_code: 1105; handler error No Error!; the event's master log mysql-bin.000001, end_log_pos 731, Gtid 0-1-3, Internal MariaDB error code: 1105
|
2016-02-18 19:46:15 140401416502016 [Warning] Slave: Unknown error Error_code: 1105
|
2016-02-18 19:46:15 140401416502016 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000001' position 610
|
Problem 4: Hard crash on 10.1 release server running with slave-skip-errors=1105
For this crash to happen, a more complicated scenario needs to be run on the 10.0 master (probably it's not minimal though):
SET binlog_format=ROW;
|
DROP TABLE IF EXISTS mdev9560, mdev9560_log;
|
|
CREATE TABLE mdev9560 (
|
`DT` timestamp(3) NOT NULL,
|
`EAN` char(13) NULL,
|
`Listprijs` decimal(7,2) DEFAULT NULL,
|
`Valuta` char(8) DEFAULT NULL
|
);
|
CREATE TABLE mdev9560_log (
|
`EAN` char(13) NOT NULL,
|
`DT` timestamp(3) NULL,
|
`Listprijs` decimal(7,2) DEFAULT NULL,
|
`Valuta` char(8) DEFAULT NULL,
|
PRIMARY KEY (`EAN`)
|
);
|
|
CREATE TRIGGER tr AFTER INSERT on mdev9560 FOR EACH ROW
|
INSERT INTO mdev9560_log ( EAN, DT, Listprijs, Valuta )
|
VALUES ( NEW.EAN, NEW.DT, NEW.Listprijs, NEW.Valuta )
|
ON DUPLICATE KEY UPDATE DT=VALUES(DT),
|
Listprijs = IFNULL(VALUES(Listprijs), Listprijs),
|
Valuta = IFNULL(VALUES(Valuta), Valuta);
|
|
INSERT INTO mdev9560 VALUES ('2016-02-15 12:50:06.123','9999999999999',99.99,'EUR');
|
INSERT INTO mdev9560 VALUES ('2016-02-15 12:50:07.123','9999999999999',99.99,'EUR');
|
On slave, server crashes with signal 11 and can't start anymore while replication is enabled (replication starts, crashes again and so on):
2016-02-18 19:56:09 139713576487680 [ERROR] Slave SQL: Could not read field 'Valuta' of table 'test.mdev9560', Gtid 0-1-5, Internal MariaDB error code: 1610
|
2016-02-18 19:56:09 139713576487680 [Warning] Slave SQL: Could not execute Write_rows_v1 event on table test.mdev9560; Unknown error, Error_code
|
: 1105; handler error No Error!; the event's master log mysql-bin.000001, end_log_pos 1681, Gtid 0-1-5, Internal MariaDB error code: 1105
|
2016-02-18 19:56:09 139713576487680 [ERROR] Slave SQL: Could not read field 'Valuta' of table 'test.mdev9560_log', Gtid 0-1-5, Internal MariaDB
|
error code: 1610
|
2016-02-18 19:56:09 139713576487680 [Warning] Slave SQL: Could not execute Write_rows_v1 event on table test.mdev9560_log; Unknown error, Error_
|
code: 1105; Unknown error, Error_code: 1105; handler error No Error!; the event's master log mysql-bin.000001, end_log_pos 1739, Gtid 0-1-5, Int
|
ernal MariaDB error code: 1105
|
2016-02-18 19:56:09 139713576487680 [ERROR] Slave SQL: Could not read field 'Valuta' of table 'test.mdev9560', Gtid 0-1-6, Internal MariaDB erro
|
r code: 1610
|
2016-02-18 19:56:09 139713576487680 [Warning] Slave SQL: Could not execute Write_rows_v1 event on table test.mdev9560; Unknown error, Error_code
|
: 1105; handler error No Error!; the event's master log mysql-bin.000001, end_log_pos 1978, Gtid 0-1-6, Internal MariaDB error code: 1105
|
2016-02-18 19:56:09 139713576487680 [ERROR] Slave SQL: Could not read field 'Valuta' of table 'test.mdev9560_log', Gtid 0-1-6, Internal MariaDB
|
error code: 1610
|
160218 19:56:09 [ERROR] mysqld got signal 11 ;
|
|
Stack trace from 10.1.11 binary tarball
|
#2 <signal handler called>
|
#3 unpack_row (rgi=rgi@entry=0x7f116d414600, table=0x7f116d489c08, colcnt=4, row_data=0x0, cols=cols@entry=0x7f116cc1a1e8, current_row_end=current_row_end@entry=0x7f116cc1a260, master_reclength=master_reclength@entry=0x7f116cc1a208, row_end=0x7f116d4ed0f2 "") at /home/buildbot/buildbot/build/sql/rpl_record.cc:226
|
#4 0x00005586fb1a10e8 in unpack_current_row (cols=0x7f116cc1a1e8, rgi=0x7f116d414600, this=0x7f116cc1a148) at /home/buildbot/buildbot/build/sql/log_event.h:4440
|
#5 Update_rows_log_event::do_exec_row (this=0x7f116cc1a148, rgi=0x7f116d414600) at /home/buildbot/buildbot/build/sql/log_event.cc:12463
|
#6 0x00005586fb193c92 in Rows_log_event::do_apply_event (this=0x7f116cc1a148, rgi=0x7f116d414600) at /home/buildbot/buildbot/build/sql/log_event.cc:9989
|
#7 0x00005586faef7ea1 in Log_event::apply_event (this=this@entry=0x7f116cc1a148, rgi=rgi@entry=0x7f116d414600) at /home/buildbot/buildbot/build/sql/log_event.h:1343
|
#8 0x00005586faeedd23 in apply_event_and_update_pos (ev=ev@entry=0x7f116cc1a148, thd=thd@entry=0x7f116d42d008, rgi=rgi@entry=0x7f116d414600, rpt=rpt@entry=0x0) at /home/buildbot/buildbot/build/sql/slave.cc:3374
|
#9 0x00005586faef142a in exec_relay_log_event (serial_rgi=0x7f116d414600, rli=0x7f1174e9e9a0, thd=0x7f116d42d008) at /home/buildbot/buildbot/build/sql/slave.cc:3704
|
#10 handle_slave_sql (arg=0x7f1174e9d000) at /home/buildbot/buildbot/build/sql/slave.cc:4773
|
#11 0x00007f11994180a4 in start_thread () from /lib64/libpthread.so.0
|
Since everything starts with a corrupted event on 10.0 master, which is recognized by mysqlbinlog from the same build, I suppose that the root problem is in 10.0. However, the crash is apparently a separate problem, maybe it can be even triggered by some other scenario, not involving timestamps, microseconds, and such.
As already discussed on IRC, I suggest to try to fix the 10.0 part of the problem before 10.0.24 release, so I raise it to a blocker for now. However, it won't fix already existing binary logs. I don't know why 10.0 => 10.0 replication works in the scenarios above, might be just some luck, in which case it's not reliable; but if there is any real logic behind this ability, it would be very good to keep it (and even have it in 10.1 as well), to allow to replicate from older 10.0 binary logs.
|
Problem 1: corrupt event
This is another manifestation of MDEV-5377. The event is not corrupted, but mysqlbinlog cannot parse it properly, and thinks that the event is corrupted.
Old MariaDB temporal data types with microseconds don't store enough metadata in RBR events, so it's only possible to read these RBR event on the slave that has exactly the same table structure as the master.
Because mysqlbinlog has no tables at all, it cannot parse these events correctly.
|
Problem 2: assertion failure on 10.1 debug server
and
Problem 4: Hard crash on 10.1 release server running with slave-skip-errors=1105
this will be fixed in the next 10.1 release (10.1.12)
|
Problem 3: Replication failure on 10.1 release server
This is a combined effect of MDEV-5377 and MDEV-5528.
Because of MDEV-5377 temporal times could not be always replicated using RBR. And they were incompatible with MySQL temporal types (because MariaDB had them before MySQL). Later we have added MySQL-compatible temporal types and in MDEV-5528 provided an option to use them by default. MySQL-compatible types can be safely replicated using RBR events.
So your 10.0 server creates old MariaDB timestamp(3) field. While your 10.1 server creates new MariaDB timestamp(3). If you use RBR, old timestamp(3) can be only replicated to identically defined table. But your 10.0 and 10.1 tables aren't identical, that's why the replication fails.
I'll see if we can recognize this situation on the slave and automatically create the compatible table structure.
But until then you can start your 10.1 server with --disable-mysql56-temporal-format option. The replication will work.
|
|
CrewOne,
If you start your replication from a clean (empty) datadir, --disable-mysql56-temporal-format mentioned above might be a good workaround for you for now, even before you get fixes for crashes.
If you start replication from an already existing 10.1 directory where tables with microseconds have already been created, the workaround will be a little more complicated, tables will need to be ALTER-ed. If there are many tables, it should be possible to generate ALTER statements automatically based on the contents of INFORMATION_SCHEMA.COLUMNS.
|
|
The patch is Ok to push.
I'd only add tests for the other data types (TIME and DATETIME).
Currently it tests only TIMESTAMP.
|
|
Is it safe to assume a fix will be in the next version? Currently replication is working 10.0->10.0 and I can wait a few weeks before trying 10.1 again.
|
|
Yes, it will be.
By the way, the release schedule is on our jira main page (http://mariadb.org/jira) and next 10.1 is planned for this week.
if you don't see the roadmap, please log in — it's a jira bug JRA-59335 that hopefully will be fixed soon
|