[MDEV-6718] Server crashed in Gtid_log_event::Gtid_log_event with parallel replication Created: 2014-09-10  Updated: 2014-11-13  Resolved: 2014-11-13

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.0.14
Fix Version/s: 10.0.15

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Kristian Nielsen
Resolution: Fixed Votes: 0
Labels: parallelslave, replication
Environment:

Linux 3.10.0-123.6.3.el7.bz114641.ppc64 #1 SMP


Attachments: File binlogs.tar.gz     HTML File full_threads     HTML File threads    

 Description   

#3  <signal handler called>
#4  0x000000001069f6dc in Gtid_log_event::Gtid_log_event (this=0x3fff7a425c70, buf=0x3fff74b6bdb0 "6m\fT\242\001", event_len=40, description_event=0x3fff73be02b0) at 10.0/sql/log_event.cc:6392
#5  0x0000000010692a58 in Log_event::read_log_event (buf=0x3fff74b6bdb0 "6m\fT\242\001", event_len=40, error=0x3fffb1575bf0, description_event=0x3fff73be02b0, crc_check=1 '\001') at 10.0/sql/log_event.cc:1666
#6  0x0000000010692000 in Log_event::read_log_event (file=0x3fffb1576210, log_lock=0x0, description_event=0x3fff73be02b0, crc_check=1 '\001') at 10.0/sql/log_event.cc:1503
#7  0x000000001048a1e0 in retry_event_group (rgi=0x3fff7ae67800, rpt=0x3fffb0fb8ab0, orig_qev=0x3fff74b26a70) at 10.0/sql/rpl_parallel.cc:373
#8  0x000000001048b63c in handle_rpl_parallel_thread (arg=0x3fffb0fb8ab0) at 10.0/sql/rpl_parallel.cc:779
#9  0x000000001097cd58 in pfs_spawn_thread (arg=0x3fffb0e6ec70) at 10.0/storage/perfschema/pfs.cc:1860
#10 0x00003fffb248c460 in .start_thread () from /lib64/power8/libpthread.so.0

Stack trace from:

revision-id: jplindst@mariadb.org-20140825103533-p9jodvh91r9tueo1
revno: 4362
branch-nick: 10.0

All threads' stack traces are attached.
Binary logs from master are attached.

It happened once so far during tests on Power.
The test was run with ~64 threads on master, SBR, 8 slave parallel threads, thread_handling=pool-of-threads. Here is the complete RQG command line (for informational purposes, it does not necessarily reproduce the issue):

perl ./runall-new.pl \
--queries=100000000 \
--no-mask \
--seed=1410100487 \
--threads=64 \
--duration=600 \
--queries=100M \
--reporters=QueryTimeout,Backtrace,ErrorLog,Deadlock \
--validators=TransformerNoComparator \
--transformers=ExecuteAsPreparedTwice \
--redefine=conf/mariadb/general-workarounds.yy \
--redefine=conf/mariadb/10.0-features-redefine.yy \
--mysqld=--log_output=FILE \
--mysqld=--log_bin_trust_function_creators=1 \
--mysqld=--performance_schema=1 \
--mysqld=--query_cache_size=64M \
--grammar=conf/replication/replication-ddl_sql.yy \
--gendata=conf/replication/replication-ddl_data.zz \
--rpl_mode=statement \
--mysqld=--slave-skip-errors=1054,1317,1049,1305,1539,1505 \
--mysqld=--slave_parallel_threads=8 \
--mysqld=--thread_handling=pool-of-threads \
--mtr-build-thread=300 \
--mask=57406 \
--basedir1=<10.0 basedir>
--vardir1=<vardir>

revision-id: elenst@montyprogram.com-20140905081817-dei34iw9d0dhywd1
revno: 979
branch-nick: randgen-mariadb-patches



 Comments   
Comment by Kristian Nielsen [ 2014-09-29 ]

It is crashing on this line:

uint8 post_header_len= description_event->post_header_len[GTID_EVENT-1];

The context is the Gtid_log_event constructor:

Gtid_log_event::Gtid_log_event(const char *buf, uint event_len,
const Format_description_log_event *description_event)
: Log_event(buf, description_event), seq_no(0), commit_id(0)
{
uint8 header_size= description_event->common_header_len;
uint8 post_header_len= description_event->post_header_len[GTID_EVENT-1];
if (event_len < header_size + post_header_len ||
post_header_len < GTID_HEADER_LEN)
return;

Looks like maybe the description_event pointer is incorrect or something like that?

Comment by Kristian Nielsen [ 2014-09-29 ]

I think I can repeat the problem by running the test case rpl.rpl_parallel_retry a lot of times with --mem --parallel=3 --repeat=1000 (on standard amd64 linux).

At least the problem looks very similar, corrupt description_event below retry_event_group(). It fails in various ways depending on exactly when the corruption triggers, but the root cause seems likely to be the same.

Comment by Kristian Nielsen [ 2014-09-30 ]

Here is a proposed patch:

http://lists.askmonty.org/pipermail/commits/2014-September/006667.html

The problem is incorrect handling of the format description event that is used
when reading events to be retried from the relay log. It was using a
description event owned by another thread (the SQL driver thread), so there
was a rare race where this would change (during relay log rotation) just at
the moment where a worker thread would attempt to simultaneously use it during
retrying a transaction that failed due to a deadlock.

Comment by Kristian Nielsen [ 2014-11-13 ]

Pushed to 10.0.15:

http://lists.askmonty.org/pipermail/commits/2014-November/006974.html

Generated at Thu Feb 08 07:14:04 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.