[MDEV-7026] Race in InnoDB/XtraDB mutex implementation can stall or hang the server Created: 2014-11-05  Updated: 2016-07-11  Resolved: 2014-11-25

Status: Closed
Project: MariaDB Server
Component/s: Platform Power, Storage Engine - InnoDB, Storage Engine - XtraDB
Affects Version/s: 5.5.40, 10.0.14
Fix Version/s: 5.5.41, 10.0.15

Type: Bug Priority: Blocker
Reporter: Kristian Nielsen Assignee: Sergey Vojtovich
Resolution: Fixed Votes: 0
Labels: buildbot

Issue Links:
Duplicate
is duplicated by MDEV-7090 binlog.binlog_checkpoint failed in bu... Closed
Relates
relates to MDEV-6478 MariaDB on Power8 Closed
relates to MDEV-6654 Combine ib_mutex_t::waiters and ib_mu... Closed
relates to MDEV-10341 InnoDB: Failing assertion: mutex_own(... Closed

 Description   

This failure is seen in Buildbot. It is a race that occurs somewhat rarely,
but it can be consistently reproduced manually by running tests in a loop.

The failure can occur in different tests, for example like this:

http://buildbot.askmonty.org/buildbot/builders/bintar-trusty-p8-debug/builds/173

worker[2] mysql-test-run: WARNING: Timeout after mysql-test-run waited 180 seconds for the process [mysqld.2 - pid: 31323, winpid: 31323] to create a pid file.
rpl.rpl_incident 'stmt'                  w2 [ fail ]
        Test ended at 2014-10-31 12:04:27
 
CURRENT_TEST: rpl.rpl_incident
 
 
Failed to start mysqld.2
mysqltest failed but provided no output
The result from queries just before the failure was:
< snip >
call mtr.add_suppression("Slave SQL.*The incident LOST_EVENTS occured on the master.* 1590");
include/wait_for_slave_sql_error.inc [errno=1590]
Last_SQL_Error = 'The incident LOST_EVENTS occured on the master. Message: <none>'
**** On Slave ****
SELECT * FROM t1;
a
1
2
3
SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1;
START SLAVE;
SELECT * FROM t1;
a
1
2
3
4
include/check_slave_is_running.inc
DROP TABLE t1;
include/rpl_end.inc

More failures are seen in a cross-reference search:

http://buildbot.askmonty.org/buildbot/reports/cross_reference#branch=&revision=&platform=-p8&dt=&bbnum=&typ=&info=&fail_name=&fail_variant=&fail_info_short=&fail_info_full=Failed+to+start+mysqld&limit=100

It looks like it only fails in debug builds on Power8, but this is not 100%
confirmed.

Looking into the error log, we consistently see that it hangs during creation
of the Doublewrite buffer:

InnoDB: Doublewrite buffer not found: creating new

Normally there should be a message after this that the doublewrite buffer was
created successfully, but when the error occurs, instead the server hangs at
this point.

Here is a GDB stacktrace from when it hangs:

Thread 1 (Thread 0x3fff82fb5d90 (LWP 74233)):
#0  0x00003fff82f22190 in __pthread_cond_wait (cond=0x10029bf3150, mutex=0x10029bf30c8) at pthread_cond_wait.c:187
#1  0x0000000010b4f0ac in safe_cond_wait (cond=0x10029bf3150, mp=0x10029bf30a0, file=0x10d1a638 "/home/skysql/bzr/knielsen/5.5/storage/xtradb/os/os0sync.c", line=207)
    at /home/skysql/bzr/knielsen/5.5/mysys/thr_mutex.c:493
#2  0x0000000010a7eae4 in os_cond_wait (cond=0x10029bf3150, mutex=0x10029bf30a0) at /home/skysql/bzr/knielsen/5.5/storage/xtradb/os/os0sync.c:207
#3  0x0000000010a7f6d8 in os_event_wait_low (event=0x10029bf30a0, reset_sig_count=2) at /home/skysql/bzr/knielsen/5.5/storage/xtradb/os/os0sync.c:609
#4  0x00000000108f544c in sync_array_wait_event (arr=0x100292584d0, index=0) at /home/skysql/bzr/knielsen/5.5/storage/xtradb/sync/sync0arr.c:459
#5  0x00000000108f9a20 in mutex_spin_wait (mutex=0x10029bf2ed8, file_name=0x10d01150 "/home/skysql/bzr/knielsen/5.5/storage/xtradb/fil/fil0fil.c", line=985)
    at /home/skysql/bzr/knielsen/5.5/storage/xtradb/sync/sync0sync.c:636
#6  0x00000000109e48d4 in mutex_enter_func (mutex=0x10029bf2ed8, file_name=0x10d01150 "/home/skysql/bzr/knielsen/5.5/storage/xtradb/fil/fil0fil.c", line=985)
    at /home/skysql/bzr/knielsen/5.5/storage/xtradb/include/sync0sync.ic:222
#7  0x00000000109e4a18 in pfs_mutex_enter_func (mutex=0x10029bf2ed8, file_name=0x10d01150 "/home/skysql/bzr/knielsen/5.5/storage/xtradb/fil/fil0fil.c", line=985)
    at /home/skysql/bzr/knielsen/5.5/storage/xtradb/include/sync0sync.ic:251
#8  0x00000000109ee980 in fil_mutex_enter_and_prepare_for_io (space_id=0) at /home/skysql/bzr/knielsen/5.5/storage/xtradb/fil/fil0fil.c:985
#9  0x00000000109f97c8 in _fil_io (type=11, sync=0, space_id=0, zip_size=0, block_offset=157, byte_offset=0, len=16384, buf=0x3fff7b538000, message=0x3fff7af542c0, trx=0x0)
    at /home/skysql/bzr/knielsen/5.5/storage/xtradb/fil/fil0fil.c:5446
#10 0x00000000109933fc in buf_flush_write_block_low (bpage=0x3fff7af542c0) at /home/skysql/bzr/knielsen/5.5/storage/xtradb/buf/buf0flu.c:1192
#11 0x0000000010993c64 in buf_flush_page (buf_pool=0x10029c07798, bpage=0x3fff7af542c0, flush_type=BUF_FLUSH_LIST) at /home/skysql/bzr/knielsen/5.5/storage/xtradb/buf/buf0flu.c:1400
#12 0x0000000010994010 in buf_flush_try_neighbors (space=0, offset=157, flush_type=BUF_FLUSH_LIST, n_flushed=48, n_to_flush=18446744073709551614)
    at /home/skysql/bzr/knielsen/5.5/storage/xtradb/buf/buf0flu.c:1541
#13 0x00000000109944c8 in buf_flush_page_and_try_neighbors (bpage=0x3fff7af542c0, flush_type=BUF_FLUSH_LIST, n_to_flush=18446744073709551614, count=0x3fffd3831ed0)
    at /home/skysql/bzr/knielsen/5.5/storage/xtradb/buf/buf0flu.c:1638
#14 0x00000000109949dc in buf_flush_flush_list_batch (buf_pool=0x10029c07798, min_n=18446744073709551614, lsn_limit=18446744073709551615) at /home/skysql/bzr/knielsen/5.5/storage/xtradb/buf/buf0flu.c:1768
#15 0x0000000010994b48 in buf_flush_batch (buf_pool=0x10029c07798, flush_type=BUF_FLUSH_LIST, min_n=18446744073709551614, lsn_limit=18446744073709551615)
    at /home/skysql/bzr/knielsen/5.5/storage/xtradb/buf/buf0flu.c:1848
#16 0x00000000109952a8 in buf_flush_list (min_n=18446744073709551614, lsn_limit=18446744073709551615) at /home/skysql/bzr/knielsen/5.5/storage/xtradb/buf/buf0flu.c:2065
#17 0x0000000010a53a1c in log_preflush_pool_modified_pages (new_oldest=18446744073709551615, sync=1) at /home/skysql/bzr/knielsen/5.5/storage/xtradb/log/log0log.c:1874
#18 0x0000000010a548c8 in log_make_checkpoint_at (lsn=18446744073709551615, write_always=1) at /home/skysql/bzr/knielsen/5.5/storage/xtradb/log/log0log.c:2313
#19 0x000000001091f0e8 in trx_sys_create_doublewrite_buf () at /home/skysql/bzr/knielsen/5.5/storage/xtradb/trx/trx0sys.c:407
#20 0x00000000108f2fbc in innobase_start_or_create_for_mysql () at /home/skysql/bzr/knielsen/5.5/storage/xtradb/srv/srv0start.c:2080
#21 0x0000000010867b90 in innobase_init (p=0x100292581c0) at /home/skysql/bzr/knielsen/5.5/storage/xtradb/handler/ha_innodb.cc:3225
#22 0x000000001047364c in ha_initialize_handlerton (plugin=0x100291a3d68) at /home/skysql/bzr/knielsen/5.5/sql/handler.cc:472
#23 0x0000000010255b68 in plugin_initialize (tmp_root=0x3fffd3833690, plugin=0x100291a3d68, argc=0x11126e70 <remaining_argc>, argv=0x10029128880, options_only=false)
    at /home/skysql/bzr/knielsen/5.5/sql/sql_plugin.cc:1372
#24 0x0000000010256718 in plugin_init (argc=0x11126e70 <remaining_argc>, argv=0x10029128880, flags=0) at /home/skysql/bzr/knielsen/5.5/sql/sql_plugin.cc:1613
#25 0x000000001015a080 in init_server_components () at /home/skysql/bzr/knielsen/5.5/sql/mysqld.cc:4337
#26 0x000000001015b808 in mysqld_main (argc=100, argv=0x10029128880) at /home/skysql/bzr/knielsen/5.5/sql/mysqld.cc:4934
#27 0x00000000101509d8 in main (argc=15, argv=0x3fffd3833f98) at /home/skysql/bzr/knielsen/5.5/sql/main.cc:25

Here is how I reproduced it, running the test in a loop and waiting for it to
hang:

bzr branch lp:maria/5.5
cd 5.5
export PATH=/opt/at7.0/bin:$PATH EXTRA_FLAGS="-O2 -Wuninitialized -DFORCE_INIT_OF_VARS" EXTRA_CONFIGS="--with-embedded-privilege-control" AM_EXTRA_MAKEFLAGS="VERBOSE=1"
cmake . -DCMAKE_BUILD_TYPE=Debug -DMYSQL_MAINTAINER_MODE=ON
time make -j4 package
(cd mysql-test && for i in `seq 40000` ; do echo XXX $i XXX ; time ./mtr --parallel=4 rpl.last_insert_id rpl.rpl_row_blob_innodb --verbose-restart || exit 1 ; done)



 Comments   
Comment by Kristian Nielsen [ 2014-11-05 ]

The problem is apparently that a waiting thread is not woken up when an InnoDB mutes is released:

	while (!event->is_set && event->signal_count == reset_sig_count)
		os_cond_wait(&(event->cond_var), &(event->os_mutex));
 
(gdb) p event->is_set
$3 = 0
(gdb) p event->signal_count
$4 = 2
(gdb) p reset_signal_count
No symbol "reset_signal_count" in current context.
(gdb) p reset_sig_count
$5 = 2
(gdb) p  !event->is_set && event->signal_count == reset_sig_count
$6 = 1

The wakeup condition is not true...

(gdb) p *mutex
$8 = {event = 0x10029bf30a0, lock_word = 0 '\000', waiters = 1, list = {prev = 0x10029bf2f70, next = 0x11953928 <srv_misc_tmpfile_mutex>}, 
  cfile_name = 0x10d01150 "/home/skysql/bzr/knielsen/5.5/storage/xtradb/fil/fil0fil.c", cline = 1673, thread_id = 18446744073709551615, magic_n = 979585, count_os_wait = 1, count_using = 1611, 
  count_spin_loop = 6, count_spin_rounds = 175, count_os_yield = 3, lspent_time = 0, lmax_spent_time = 0, mutex_type = 0, cmutex_name = 0x10d02228 "&fil_system->mutex", pfs_psi = 0x3fff81ff2330}
(gdb) p mutex->lock_word
$9 = 0 '\000'

But the mutex is not locked (lock_word == 0).

Looks like some issue with the InnoDB mutex implementation...

Comment by Sergey Vojtovich [ 2014-11-06 ]

There is a comment in mutex_exit_func():

        /* A problem: we assume that mutex_reset_lock word
        is a memory barrier, that is when we read the waiters
        field next, the read must be serialized in memory
        after the reset. A speculative processor might
        perform the read first, which could leave a waiting
        thread hanging indefinitely.
 
        Our current solution call every second
        sync_arr_wake_threads_if_sema_free()
        to wake up possible hanging threads if
        they are missed in mutex_signal_object. */

Even though we added "os_isync" between "lock_word store" and "waiters load", it is not enough. Only full memory barrier guarantees StoreLoad order.

So we still rely on the error monitor thread to wake up such stale threads. But error monitor thread is using mutexes itself and may get stuck similarly to other threads. To workaround this Monty added log_get_lsn_nowait(), which basically does trylock instead of lock. This is added in revision 4271 (monty@mariadb.org-20140819162835-sorv0ogd39f7mui8) of 5.5.

In this case I suspect error monitor thread is either not started or is stuck on different mutex (other than log_get_lsn mutex).

Comment by Sergey Vojtovich [ 2014-11-06 ]

See also https://mariadb.atlassian.net/browse/MDEV-6654

Comment by Kristian Nielsen [ 2014-11-06 ]

I wrote up a detailed analysis on maria-developers@ of what appears to be the
cause of the bug.

https://lists.launchpad.net/maria-developers/msg07860.html

Incidentally, it appears that this same bug is also what is
causing various strange hangs for a number of users lately.

I agree that in this case, the error monitor thread is not yet started (it was
not seen in the full stack trace). And in the user hang cases, that the error
monitor thread itself gets stuck on a mutex (I saw this in the case I
investigated).

However, I strongly disagree that the whole idea of using error monitor
thread, or trylock, is in any way an acceptable solution. If the mutex
implementation is broken, it should be fixed, not worked around with horrible
hacks.

Comment by Sergey Vojtovich [ 2014-11-12 ]

Kristian, please review fix for this bug.

Comment by Sergey Vojtovich [ 2014-11-12 ]

...second patch:

Date: Wed, 12 Nov 2014 11:49:27 +0000 (UTC)
From: Sergey Vojtovich <svoj@mariadb.org>
To: commits@mariadb.org
Subject: [Commits] Rev 4346: MDEV-7026 - Occasional hang during startup on Power8 in lp:maria/5.5

Comment by Kristian Nielsen [ 2014-11-14 ]

Review sent to maria-developers@:

https://lists.launchpad.net/maria-developers/msg07896.html

Comment by Sergey Vojtovich [ 2014-11-18 ]

Kristian, please review updated patch.

Comment by Kristian Nielsen [ 2014-11-18 ]

New review sent to maria-developers@:

https://lists.launchpad.net/maria-developers/msg07913.html

Comment by Kristian Nielsen [ 2014-11-19 ]

Patch to clarify suggestions in previously made review:

http://lists.askmonty.org/pipermail/commits/2014-November/007050.html
lp:~maria-captains/maria/5.5-mdev7026

Comment by Sergey Vojtovich [ 2014-11-25 ]

Fixed in 10.0.15, 5.5.41:

revno: 4367
revision-id: knielsen@knielsen-hq.org-20141119125646-njq02h7irmu9s71b
parent: sergii@pisem.net-20141118231952-fk3o69075dn3ravv
committer: Kristian Nielsen <knielsen@knielsen-hq.org>
branch nick: mariadb-5.5
timestamp: Wed 2014-11-19 13:56:46 +0100
message:
  MDEV-7026: Race in InnoDB/XtraDB mutex implementation can stall or hang the server.
 
  The bug was that full memory barrier was missing in the code that ensures that
  a waiter on an InnoDB mutex will not go to sleep unless it is guaranteed to be
  woken up again by another thread currently holding the mutex. This made
  possible a race where a thread could get stuck waiting for a mutex that is in
  fact no longer locked. If that thread was also holding other critical locks,
  this could stall the entire server. There is an error monitor thread than can
  break the stall, it runs about once per second. But if the error monitor
  thread itself got stuck or was not running, then the entire server could hang
  infinitely.
 
  This was introduced on i386/amd64 platforms in 5.5.40 and 10.0.13 by an
  incorrect patch that tried to fix the similar problem for PowerPC.
 
  This commit reverts the incorrect PowerPC patch, and instead implements a fix
  for PowerPC that does not change i386/amd64 behaviour, making PowerPC work
  similarly to i386/amd64.

Comment by Stewart Smith [ 2015-07-17 ]

The use of __sync_synchronize() here is incorrect as it gererates a sync() instruction which is a heavyweight barrier which is only needed if you're writing device drivers.

__sync_lock_release() is sufficient

realistically, __atomic_test_and_set() and __atomic_clear() should be used these days.

Comment by Sergey Vojtovich [ 2015-07-17 ]

stewart-ibm, right. Good mutex implementation should never need to issue heavyweight sync.

In 5.5 and 10.0 old mutex implementation was kept intact as much as possible, for which this sync is a must.

In 10.1 there was an aim (MDEV-6654) to reimplement InnoDB mutexes. This was suspended because InnoDB also reimplemented mutexes in MySQL-5.7. We'll probably get back to this when we have InnoDB from 5.7.

Comment by Stewart Smith [ 2015-07-17 ]

You should be able to remove the __sync_synchronize() and everything work. There's no reason for the sync instruction to be there.

Comment by Kristian Nielsen [ 2015-07-17 ]

I do not plan to re-visit this issue myself. But be sure to read the mailing list discussion on this in the archive, there was a very detailed discussion on what kind of barriers are needed in the current implementation, and why.

Comment by Sergey Vojtovich [ 2015-07-17 ]

stewart-ibm, we added __sync_synchronize() to make sure "lock_word" is stored before "waiters" are loaded. That is to ensure StoreLoad order.

You say it's not needed. If so, how can we ensure StoreLoad order in this case?

Generated at Thu Feb 08 07:16:24 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.