Details
-
Task
-
Status: Open (View Workflow)
-
Trivial
-
Resolution: Unresolved
-
None
-
None
-
None
Description
This revision adds a _db_flush() call to every DBUG_ASSERT():
revid:monty@mysql.com-20110130104144-v3osvsbyl45zd5o3
|
 |
-#define DBUG_ASSERT(A) assert(A)
|
+#define DBUG_ASSERT(A) do { _db_flush_(); assert(A); } while(0)
|
Unfortunately this causes occasional crashes inside DBUG_ASSERT. I am starting
to hit this a lot in my MDEV-26 tree, but it is also visible elsewhere, like
here:
I suspect that the original change is just wrong and that it needs protection
against concurrent threads updating the DBUG data structures. Unfortunately
this is not entirely trivial to do. I assume adding lock/unlock to every
DBUG_ASSERT would make the test suite a lot slower, which is undesirable. On
the other hand, the obvious
#define DBUG_ASSERT(A) do { if (!A) _db_flush_(); assert(A); } while(0)
|
evaluates A twice, which may cause problems, and assigning A to a temp
variable also is not good, as we want the assert() macro to have access to the
actual expression A to print it in the error log. So I am not sure how to fix
this properly.
For now, I will remove the _db_flush() from DBUG_ASSERT(), so I avoid
crashes. So I filed this task for you in case you want to bring it back in, in
a way that does not cause crash.
Here are details of my analysis of the crash:
#3 <signal handler called>
|
#4 0x0000000000d0aa44 in _db_flush_ () at /home/knielsen/my/10.0/work-10.0-mdev26/dbug/dbug.c:2139
|
#5 0x00000000005b3dd6 in next_event (rli=0x4820038) at /home/knielsen/my/10.0/work-10.0-mdev26/sql/slave.cc:5316
|
#6 0x00000000005ad98e in exec_relay_log_event (thd=0x4859480, rli=0x4820038) at /home/knielsen/my/10.0/work-10.0-mdev26/sql/slave.cc:3042
|
#7 0x00000000005b016f in handle_slave_sql (arg=0x481e940) at /home/knielsen/my/10.0/work-10.0-mdev26/sql/slave.cc:3965
|
#8 0x0000000000ca07fd in pfs_spawn_thread (arg=0x4915890) at /home/knielsen/my/10.0/work-10.0-mdev26/storage/perfschema/pfs.cc:1015
|
#9 0x00007f98cd89cb50 in start_thread (arg=<optimized out>) at pthread_create.c:304
|
The crash is inside an assert:
5316 DBUG_ASSERT(rli->cur_log_fd == -1); // foreign descriptor
|
#define DBUG_ASSERT(A) do { _db_flush_(); assert(A); } while(0)
|
2135 void _db_flush_()
|
2136 {
|
2137 CODE_STATE *cs;
|
2138 get_code_state_or_return;
|
2139 (void) fflush(cs->stack->out_file->file);
|
2140 }
|
The crash happens because cs->stack->out_file is NULL, as shown by
disassembly:
Dump of assembler code for function _db_flush_:
|
0x0000000000d0aa20 <+0>: push %rbp
|
0x0000000000d0aa21 <+1>: mov %rsp,%rbp
|
0x0000000000d0aa24 <+4>: sub $0x10,%rsp
|
0x0000000000d0aa28 <+8>: callq 0xd06f48 <code_state>
|
0x0000000000d0aa2d <+13>: mov %rax,-0x8(%rbp)
|
0x0000000000d0aa31 <+17>: cmpq $0x0,-0x8(%rbp)
|
0x0000000000d0aa36 <+22>: je 0xd0aa4f <_db_flush_+47>
|
0x0000000000d0aa38 <+24>: mov -0x8(%rbp),%rax
|
0x0000000000d0aa3c <+28>: mov 0x20(%rax),%rax
|
0x0000000000d0aa40 <+32>: mov 0x10(%rax),%rax
|
=> 0x0000000000d0aa44 <+36>: mov (%rax),%rax
|
0x0000000000d0aa47 <+39>: mov %rax,%rdi
|
0x0000000000d0aa4a <+42>: callq 0x584280 <fflush@plt>
|
0x0000000000d0aa4f <+47>: leaveq
|
0x0000000000d0aa50 <+48>: retq
|
End of assembler dump.
|
(gdb) info reg
|
rax 0x0 0
|
However, the cs->stack->out_file is not NULL when printed in the
debugger. Maybe because there is a race between different threads?
(gdb) p cs->stack->out_file
|
$19 = (sFILE *) 0x142a5c0
|
(gdb) p *cs->stack->out_file
|
$20 = {file = 0x7f98cc051880, used = 1073741823}
|
The "used" value seems quite suspicious, looks corrupt. Maybe it was free()d?
It looks to me that there is missing protection in _db_flush() against another
thread changing cs->stack->out_file, eg FreeState() sets it to NULL... ?