[MDEV-7864] [PATCH] Slave SQL: stopping on non-last RBR event with annotations results in SEGV (signal 11) Created: 2015-03-28 Updated: 2015-10-21 Resolved: 2015-04-28 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Affects Version/s: | 10.0.17 |
| Fix Version/s: | 10.0.18 |
| Type: | Bug | Priority: | Major |
| Reporter: | Daniel Black | Assignee: | Kristian Nielsen |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | replication, verified | ||
| Environment: |
x86_64, RHEL6, 5.5.39 master to 10.0.17 slave on same machine |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
while replicating from a 5.5.39 master to a 10.0.17 slave the following repeatable segfault resulted.
As the error quite rightly points out the roes are missing on the slave
Raw binlog in private uploads |
| Comments |
| Comment by Elena Stepanova [ 2015-03-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I haven't got it reproduced the easy way. Thanks. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2015-03-31 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I had set binlog_commit_wait_count to 0 at runtime when this first occurred (to enable replication catch up), however the same segv occurred on restart after doing a start slave '5.5'.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2015-04-01 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
More trying, still no crash. As I understand, you only start one slave, and it crashes on the very first event it attempts to execute; so, I don't see much room for concurrency there. And at least one crash happened when your slave only had one connection total (apparently the one that started the slave), hence no concurrency from direct slave updates, either. So, I've run out of ideas for tonight. Would you be able to do any of
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2015-04-17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I've got the binlog however don't have the slave dir. bit more info because I didn't have a backtrace
As serial_rgi is the only rpl_group_info in handle_slave_sql. It must be this. The destructor is also called from handle_slave_sql so this is the one delete at the bottom of the handle_slave_sql function. Which variable?... using objdump -C --disassemble mysqld > mysqld.objdump
The backtrace shows this occurred at 0x68e148 setting some address to 0. As this is before call to Deferred_log_events::~Deferred_log_events() the segv is in free_annotate_event
The segv occurs on the assignment of this->thd->variables.binlog_annotate_row_events= 0; this->thd (aka serial_rgi->thd) is an invalid address. And we see in the bottom of handle_slave_sql serial_rgi->thd is set to 0 before the destructor is called. Really bad move since free_annotate_event needs to set it to this->thd->variables.binlog_annotate_row_events to 0. bottom of handle_slave_sql
So any row based event that fails with an annotation should trigger this error. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2015-04-17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In case the event fails, wouldn't m_annotate_event be 0, so this->thd->variables.binlog_annotate_row_events= 0 doesn't really get executed? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2015-04-17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Obviously m_annotate_event not 0 because of the position in the backtrace. The only place this can be set is delete_or_keep_event_post_apply that is the only place where there is a call to set_annotate_event. It isn't set back to 0 in delete_or_keep_event_post_apply by the call to rgi->free_annotate_event() because this only occurs if get_flags(Rows_log_event::STMT_END_F). There are a couple of other places in rpl_parallel.cc where free_annotate_event gets called however since its not 0 at the time of segfault its not relevant. So you'd need an annotate event, followed by more than one binlog event in the group, where it isn't the last one that fails. This matches the contents of the binlog I have where there are over 10 delete rows events in the goup. Another parallel replication possibility, I'm not 100% sure this is possible, is exec_relay_log_event line int res= rli->parallel.do_event(serial_rgi, ev, event_size); was called. Is it possible that the annotate event and the failing delete rows event where executed in different threads and the delete rows was executed first, hence calling delete_or_keep_event_post_apply and clearing the m_annotate_event followed by the annotate event being run, setting m_annotate_event to non-0. Either way, the fix is the same:
serial_rgi isn't used before its deletion so there is no need to set serial_rgi->thd to 0. Alternately call serial_rgi->free_annotate_event before it however it does break the encapsulation of serial_rgi a bit more. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2015-04-18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
working test case: https://github.com/openquery/mariadb-server/tree/MDEV-7864_binlog_annotations_segv | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2015-04-18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
added fix too: https://github.com/openquery/mariadb-server/commit/cb55f425ea36321d2a00326405cd86ae3e108b43.patch cleaned up test cases more. Need to skip check of warnings in error log. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2015-04-23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
changed title because this is the general case when stopping a slave with binlog annotations and the rbr event isn't the last one. To be clear this only affects 10.0, doesn't affect 5.5 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2015-04-25 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
corrected test case. Fixed regression. https://github.com/MariaDB/server/pull/49 test run: https://travis-ci.org/openquery/mariadb-server/jobs/59977445 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2015-04-28 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Pushed into 10.0.18. Thanks, Daniel for the bug analysis and test case! I fixed the bug in a different way. I think the clearing of serial_rgi->thd
So I fixed by simply reverting that part of that change (one line). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2015-04-28 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hm, I now noticed you actually suggested this way to fix in an earlier comment. I still think this is the right fix (just removing the clear of serial_rgi->thd). But do let me know if you think there is a reason that a different fix is needed. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2015-04-28 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
While I agree that what you've done is the simplest fix. I added the condition here in free_annotate_events because I wasn't 100% sure that all the calls to free_annotate event in rpl_parallel.cc occurred when the thread was valid. Are you? Is one extra non-zero check in a thread termination to prevent a future segfault worthwhile ? https://github.com/MariaDB/server/blob/71c5275fbe0a62ea1950695c158cc7256fc7195e/sql/rpl_rli.h#L746 I assumed the reason this bug crept in in the first place is because the ambiguity about whether this was a structure that just stores fields or actually exhibits some abstraction of data responsibility for its on management. To move it more into the latter category is the reason for the other few changes https://github.com/MariaDB/server/pull/49/files |