[MDEV-6615] Incorrect barrier logic in mutex_exit path causes data corruption on AArch64 - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 5.5.36
Fix Version/s: 5.5.40
Component/s: None
Labels:
None
Environment:
Ubuntu 14.04 - Trusty on AArch64, Juno development board.

Description

Hi,
We've run into an issue with MariaDB when running Sysbench "oltp.lua" test with 8 threads. The server daemon crashed mostly with an assertion failure at storage/xtradb/fil/fil0fil.c:5288:

fil_node_complete_io(

/*=================*/

        fil_node_t*     node,   /*!< in: file node */

        fil_system_t*   system, /*!< in: tablespace memory cache */

        ulint           type)   /*!< in: OS_FILE_WRITE or OS_FILE_READ; marks

                                the node as modified if

                                type == OS_FILE_WRITE */

        ut_ad(node);

        ut_ad(system);

        ut_ad(mutex_own(&(system->mutex)));

        ut_a(node->n_pending > 0); <-- failure point

        node->n_pending--;

An attached debugger gave the following backtrace:

(gdb) bt full

#0  0x0000007fb1d44d18 in __GI_raise (sig=sig@entry=6)

    at ../nptl/sysdeps/unix/sysv/linux/raise.c:56

        _sys_result = 0

        pd = 0x7fa2fff1a0

        pid = <optimised out>

        selftid = 5661

#1  0x0000007fb1d4818c in __GI_abort () at abort.c:89

        save_stage = 2

        act = {__sigaction_handler = {sa_handler = 0x7f00000000,

            sa_sigaction = 0x7f00000000}, sa_mask = {__val = {548445445976,

              404, 1, 404, 0, 366924161824, 548195526816, 366921716804,

              366927454208, 3, 0, 548434850424, 366927869208, 366936283408,

              548195527344, 548434850424}}, sa_flags = 5288,

          sa_restorer = 0xa2fff1a0}

        sigs = {__val = {32, 0 <repeats 15 times>}}

#2  0x000000556e3d1448 in fil_node_complete_io (node=<optimised out>,

    system=<optimised out>, type=<optimised out>)

    at /build/buildd/mariadb-5.5-5.5.36/storage/xtradb/fil/fil0fil.c:5288

No locals.

#3  0x000000556e3db800 in fil_aio_wait (segment=segment@entry=3)

    at /build/buildd/mariadb-5.5-5.5.36/storage/xtradb/fil/fil0fil.c:5705

        ret = <optimised out>

        fil_node = 0x7fb14a0e78

        message = 0x7fa54d4350

        type = 10

        space_id = 0

#4  0x000000556e3592a4 in io_handler_thread (arg=<optimised out>)

    at /build/buildd/mariadb-5.5-5.5.36/storage/xtradb/srv/srv0start.c:486

        segment = 3

#5  0x0000007fb220ae2c in start_thread (arg=0x7fa2fff1a0)

    at pthread_create.c:314

        pd = 0x7fa2fff1a0

        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {548195529120,

                548981639624, 548449456128, 0, 548449452032, 548195529312,

                548195527344, 548443965168, 8388608, 548449472512,

                548195527056, 13770210553321828185, 0, 13770210553602140361,

                0, 0, 0, 0, 0, 0, 0, 0}, mask_was_saved = 0}}, priv = {pad = {

              0x0, 0x0, 0x7fb220ad7c <start_thread>, 0x7fa2fff1a0}, data = {

              prev = 0x0, cleanup = 0x0, canceltype = -1306481284}}}

        not_first_call = 0

        pagesize_m1 = <optimised out>

        sp = <optimised out>

        freesize = <optimised out>

        __PRETTY_FUNCTION__ = "start_thread"

#6  0x0000007fb1dd9c40 in clone ()

    at ../ports/sysdeps/unix/sysv/linux/aarch64/nptl/../clone.S:96

No locals.

Once the daemon crashed we've sometimes been unable to start it again without wiping out the database and re-installing it.

Having done some digging it is apparent that there is a problem in the mutex_exit code path; in particular at:
http://bazaar.launchpad.net/~maria-captains/maria/5.5/view/head:/storage/xtradb/include/sync0sync.ic#L106

A load-acquire is used to exit the mutex rather than a store-release. This leads to unpredictable results for architectures with a weak memory model.

We have the following in program order:

mutex_enter -> load-acquire lock, loop until it is 0, then set to 1 relaxed
protected work
mutex_exit -> load-acquire lock, set it to 0 regardless.

However, the following sequence of events can be observed by another core:

mutex_enter -> load-acquire lock, loop until it is 0, then set to 1 relaxed
some of the protected work
mutex_exit -> load-acquire lock, set it to 0 regardless.
some more of the protected work (not protected).

The above can (and has for our test system) lead to severe data corruption; that prevents the daemon from even re-starting.

I've attached an emergency patch that re-introduces __ sync_lock_release to release the mutex. This fixes the crash and data corruption issues for me, but I understand from comments in the code that there were issues with this function in the past? Could the gcc intrinsics be moved over to the __ atomic_* functions? Ideally:

To acquire the lock:

__atomic_exchange_n(ptr, (byte) new_val, __ATOMIC_ACQUIRE)

To release the lock:

__atomic_store_n(ptr, (byte) new_val, __ATOMIC_RELEASE)

(which also worked on my test system).

I believe this issue may affect other versions of MariaDB, but I've only tested 5.5.36.

Cheers,
–
Steve Capper

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

store-release-on-mutex-exit.patch
3 kB
2014-08-20 15:53

Issue Links

duplicates

MDEV-6450 MariaDB crash on Power8 when built with advance tool chain

Closed

is duplicated by

MDEV-6436 Error Importing MySQLdump File

Closed

is part of

MDEV-6478 MariaDB on Power8

Closed

Activity

People

Assignee:: Sergey Vojtovich

Reporter:: Steve Capper

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2014-08-20 15:53

Updated:: 2019-03-31 17:30

Resolved:: 2014-08-29 17:40

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.