Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-27088

Server crash on ARM (WMM architecture) due to missing barriers in lf-hash

Details

    Description

      MariaDB server crashes on ARM (weak memory model architecture) while concurrently executing l_find to load node->key and add_to_purgatory to store node->key = NULL. l_find then uses key (which is NULL), to pass it to a comparison function.

      The specific problem is the out-of-order execution that happens on a weak memory model architecture. Two essential reorderings are possible, which need to be prevented.

      a) As l_find has no barriers in place between the optimistic read of the key field lf_hash.cc#L117 and the verification of link lf_hash.cc#L124, the processor can reorder the load to happen after the while-loop.

      In that case, a concurrent thread executing add_to_purgatory on the same node can be scheduled to store NULL at the key field lf_alloc-pin.c#L253 before key is loaded in l_find.

      b) A node is marked as deleted by a CAS in l_delete lf_hash.cc#L247 and taken off the list with an upfollowing CAS lf_hash.cc#L252. Only if both CAS succeed, the key field is written to by add_to_purgatory. However, due to a missing barrier, the relaxed store of key lf_alloc-pin.c#L253 can be moved ahead of the two CAS operations, which makes the value of the local purgatory list stored by add_to_purgatory visible to all threads operating on the list. As the node is not marked as deleted yet, the same error occurs in l_find.

      Stack trace

      Thread 179 "lf-t" received signal SIGSEGV, Segmentation fault.
      [Switching to Thread 0xffff9e7cf1d0 (LWP 1256877)]
      0x0000fffff772c8d8 in memcmp () from /lib/aarch64-linux-gnu/libc.so.6
      (gdb) bt
      #0  0x0000fffff772c8d8 in memcmp () from /lib/aarch64-linux-gnu/libc.so.6
      #1  0x0000000000405580 in my_strnncoll_binary (cs=<optimized out>, s=0x0, slen=4,
          t=0xffff9e7ce968 "\234\267\277xz", tlen=<optimized out>, t_is_prefix=<optimized out>)
          at /workspace/mariadb-server/strings/ctype-bin.c:87
      #2  0x0000000000405038 in l_find (head=<optimized out>, cs=<optimized out>,
          hashnr=<optimized out>, key=<optimized out>, keylen=<optimized out>,
          cursor=<optimized out>, pins=<optimized out>, callback=<optimized out>)
          at /workspace/mariadb-server/mysys/lf_hash.c:133
      #3  0x0000000000403bec in l_delete (head=0xffffe8003500, cs=0x434458 <my_charset_bin>,
          hashnr=807645945, key=<optimized out>, keylen=4, pins=<optimized out>)
          at /workspace/mariadb-server/mysys/lf_hash.c:231
      #4  lf_hash_delete (pins=<optimized out>, key=<optimized out>, keylen=4,
          hash=<optimized out>) at /workspace/mariadb-server/mysys/lf_hash.c:455
      #5  test_lf_hash (arg=<optimized out>)
          at /workspace/mariadb-server/unittest/mysys/lf-t.c:155
      #6  0x0000fffff7b17088 in start_thread () from /lib/aarch64-linux-gnu/libpthread.so.0
      #7  0x0000fffff777fffc in ?? () from /lib/aarch64-linux-gnu/libc.so.6
      

      In frame #1 my_strnncoll_binary gets s=0x0 as a source pointer from l_find and hands that to memcmp, which causes a SEGV upon NULL pointer input.

      Reproduce

      The crash can be observed by running the lf-t unittest, as well as the oltp_write sysbench benchmark upon the most recent repository version. The server is build using optimization level -O0 or -O1. No other change to the configuration is done. unittest/mysys/lf-t is run until the crash is triggered.

      The probability of hitting the bug can be increased easily.

      The build script (build.sh) used to generate the faulty binary is attached.
      To increase the probability the following changes can be done.

      Output

      The current repository version produces a crash. (unfixed)

      ./lf-t
      # N CPUs: 96
      1..6
      # Testing lf_hash (without my_thread_init) with 30 threads, 3000000 iterations... 
      Bail out! Signal 11 thrown
      # 6 tests planned,  0 failed,  0 was last executed
      

      The version with the propsed fix finishes as expected.

      ./lf-t
      # N CPUs: 96
      1..6
      # Testing lf_hash (without my_thread_init) with 30 threads, 3000000 iterations... 
      # 0 mallocs, 30 pins in stack, 4096 hash size, 37857309 inserts, 721149 scans
      ok 1 - tested lf_hash (without my_thread_init) in 82.0496 secs (0)
      # 6 tests planned but only 1 executed
      

      Solution

      Find attached a diff that changes three accesses to be atomic.

      1. optimistic read of key in l_find lf_hash.cc#L117
      2. read of link for verification lf_hash.cc#L124
      3. write of key in add_to_purgatory lf_alloc-pin.c#L253

      (1 & 2) The implicit SC fence of the atomic load of key and link disallows reordering the load of key after load of link. (3) The implicit RELEASE fence of the atomic write of key disallows reordering the write of key to happen before the CAS that removed the node from the list.

      Attached is also a minimal working example for the bug, which was used to verify the fixed version with GenMC, a model checker for weak memory models. The model checker can be applied to the attached MWE to get an execution on a weak memory model architecture that leads to the bug. It was also applied to a much more complete extraction of the actual lock-free hash from MariaDB to verify the data structure. However, due to ease of presentation, only the MWE is included here. If there is further interest in the extracted lf-hash and its verification with GenMC, it can be supplied.

      Relation to MDEV-23510

      The problem described in MDEV-23510 is the result of the herein described bug.
      However, the commit attached to it d30c1331a18d875e553f3fcf544997e4f33fb943, is not fixing the bug, but hiding it, partially.

      The bug is not related to the cache line size, or splitting between two cache lines.
      Still the commit affects the crash.
      Moving link and key right next to each other in memory, plus removing the volatile declaration of link allows an optimization to take place during code generation for ARM that combines two load ldr instructions for consecutive memory positions into a single ldp instruction for loading a pair. This instruction loads link and key together in a single instruction. Reordering this instruction to happen after the loop verification (L124) is not possible, as the load of link needs to happen for the verification. As such the load of key cannot be reordered to be executed after verification.

      However, depending on a specific compiler optimization (Aarch64 Load Store Optimization) for correctness of the algorithm is risky and the algorithm is still incorrect. This can easily be seen as optimization levels -O0 and -O1 do not produce assembly containing the necessary pair load ldp of link and key. As such, both versions do crash on ARM. This fragile state might break due to changes in the struct LF_SLIST, changes to the algorithm or compiler changes.

      Attachments

        1. abstracted_crash_mwe.c
          2 kB
          Martin Beck
        2. abstracted_crash_mwe_fixed.c
          2 kB
          Martin Beck
        3. build.sh
          0.3 kB
          Martin Beck
        4. purgatory_fix.diff
          2 kB
          Martin Beck
        5. parallel.sh
          0.2 kB
          Martin Beck

        Issue Links

          Activity

            danblack Daniel Black added a comment -

            Thansk mbeck for the detail and why the previous only partially worked.

            dan@eq-arm1:~/build-mariadb-10.2$ /usr/bin/clang --version
            Debian clang version 11.0.1-2
            Target: aarch64-unknown-linux-gnu
            Thread model: posix
             
            dan@eq-arm1:~/build-mariadb-10.2$ cmake  ../mariadb-10.2/ -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_C_FLAGS="-O1" -DCMAKE_CXX_FLAGS=-O1
             
            dan@eq-arm1:~/build-mariadb-10.2$ cmake --build . --verbose
            [1/7] cd /home/dan/build-mariadb-10.2 && /usr/bin/cmake -P /home/dan/mariadb-10.2/cmake/info_bin.cmake
            [2/7] cd /home/dan/build-mariadb-10.2 && /usr/bin/cmake -P /home/dan/mariadb-10.2/cmake/info_src.cmake
            [3/7] cd /home/dan/build-mariadb-10.2 && /usr/bin/cmake -DCOMPILER=/usr/bin/clang -DSOURCE_DIR=/home/dan/mariadb-10.2 -DBINARY_DIR=/home/dan/build-mariadb-10.2 "-DABI_HEADERS=/home/dan/mariadb-10.2/include/mysql/plugin_audit.h;/home/dan/mariadb-10.2/include/mysql/plugin_ftparser.h;/home/dan/mariadb-10.2/include/mysql.h;/home/dan/mariadb-10.2/include/mysql/psi/psi_abi_v1.h;/home/dan/mariadb-10.2/include/mysql/psi/psi_abi_v2.h;/home/dan/mariadb-10.2/include/mysql/client_plugin.h;/home/dan/mariadb-10.2/include/mysql/plugin_auth.h;/home/dan/mariadb-10.2/include/mysql/plugin_password_validation.h;/home/dan/mariadb-10.2/include/mysql/plugin_encryption.h" -P /home/dan/mariadb-10.2/cmake/do_abi_check.cmake
            [4/7] /usr/bin/clang -DDBUG_TRACE -DHAVE_CONFIG_H -D_FILE_OFFSET_BITS=64 -Iinclude -I/home/dan/mariadb-10.2/include -I/home/dan/mariadb-10.2/unittest/mytap -O1 -fstack-protector --param=ssp-buffer-size=4 -O2 -g -DNDEBUG -D_FORTIFY_SOURCE=2 -DDBUG_OFF -Wall -Wdeclaration-after-statement -Wextra -Wformat-security -Wno-init-self -Wno-null-conversion -Wno-unused-parameter -Wno-unused-private-field -Woverloaded-virtual -Wvla -Wwrite-strings   -DHAVE_YASSL -DYASSL_PREFIX -DHAVE_OPENSSL -DMULTI_THREADED -MD -MT unittest/mysys/CMakeFiles/lf-t.dir/lf-t.c.o -MF unittest/mysys/CMakeFiles/lf-t.dir/lf-t.c.o.d -o unittest/mysys/CMakeFiles/lf-t.dir/lf-t.c.o -c /home/dan/mariadb-10.2/unittest/mysys/lf-t.c
            [5/7] /usr/bin/clang -DDBUG_TRACE -DHAVE_CONFIG_H -D_FILE_OFFSET_BITS=64 -Iinclude -I/home/dan/mariadb-10.2/include -I/home/dan/mariadb-10.2/unittest/mytap -O1 -fstack-protector --param=ssp-buffer-size=4 -O2 -g -DNDEBUG -D_FORTIFY_SOURCE=2 -DDBUG_OFF -Wall -Wdeclaration-after-statement -Wextra -Wformat-security -Wno-init-self -Wno-null-conversion -Wno-unused-parameter -Wno-unused-private-field -Woverloaded-virtual -Wvla -Wwrite-strings   -DHAVE_YASSL -DYASSL_PREFIX -DHAVE_OPENSSL -DMULTI_THREADED -MD -MT unittest/mysys/CMakeFiles/my_atomic-t.dir/my_atomic-t.c.o -MF unittest/mysys/CMakeFiles/my_atomic-t.dir/my_atomic-t.c.o.d -o unittest/mysys/CMakeFiles/my_atomic-t.dir/my_atomic-t.c.o -c /home/dan/mariadb-10.2/unittest/mysys/my_atomic-t.c
            [6/7] : && /usr/bin/clang++ -O1 -fstack-protector --param=ssp-buffer-size=4 -fno-rtti -O2 -g -DNDEBUG -D_FORTIFY_SOURCE=2 -DDBUG_OFF -Wall -Wdeclaration-after-statement -Wextra -Wformat-security -Wno-init-self -Wno-null-conversion -Wno-unused-parameter -Wno-unused-private-field -Woverloaded-virtual -Wvla -Wwrite-strings -Wl,-z,relro,-z,now unittest/mysys/CMakeFiles/lf-t.dir/lf-t.c.o -o unittest/mysys/lf-t  -lpthread  unittest/mytap/libmytap.a  mysys/libmysys.a  dbug/libdbug.a  mysys/libmysys.a  dbug/libdbug.a  strings/libstrings.a  -lz  -lm  -ldl  -lpthread && :
            [7/7] : && /usr/bin/clang++ -O1 -fstack-protector --param=ssp-buffer-size=4 -fno-rtti -O2 -g -DNDEBUG -D_FORTIFY_SOURCE=2 -DDBUG_OFF -Wall -Wdeclaration-after-statement -Wextra -Wformat-security -Wno-init-self -Wno-null-conversion -Wno-unused-parameter -Wno-unused-private-field -Woverloaded-virtual -Wvla -Wwrite-strings -Wl,-z,relro,-z,now unittest/mysys/CMakeFiles/my_atomic-t.dir/my_atomic-t.c.o -o unittest/mysys/my_atomic-t  -lpthread  unittest/mytap/libmytap.a  mysys/libmysys.a  dbug/libdbug.a  mysys/libmysys.a  dbug/libdbug.a  strings/libstrings.a  -lz  -lm  -ldl  -lpthread && :
            dan@eq-arm1:~/build-mariadb-10.2$ ./unittest/mysys/lf-t
            # N CPUs: 160, atomic ops: gcc-atomics-smp
            1..7
            ok 1 - my_atomic_initialize() returned 0
            # Testing lf_hash (without my_thread_init) with 160 threads, 300000 iterations... 
            # 0 mallocs, 160 pins in stack, 4096 hash size, 16167030 inserts, 54494 scans
            ok 2 - tested lf_hash (without my_thread_init) in 115.919 secs (0)
            # 7 tests planned but only 2 executed
            dan@eq-arm1:~/build-mariadb-10.2$ cmake --build .
            [7/7] Linking CXX executable unittest/mysys/lf-t
            dan@eq-arm1:~/build-mariadb-10.2$ ./unittest/mysys/lf-t
            # N CPUs: 160, atomic ops: gcc-atomics-smp
            1..7
            ok 1 - my_atomic_initialize() returned 0
            # Testing lf_hash (without my_thread_init) with 2 threads, 30000000 iterations... 
            # 0 mallocs, 2 pins in stack, 2048 hash size, 14880152 inserts, 7568227 scans
            ok 2 - tested lf_hash (without my_thread_init) in 126.009 secs (0)
            # 7 tests planned but only 2 executed
            dan@eq-arm1:~/build-mariadb-10.2$  /usr/bin/clang -DDBUG_TRACE -DHAVE_CONFIG_H -D_FILE_OFFSET_BITS=64 -Iinclude -I/home/dan/mariadb-10.2/include -I/home/dan/mariadb-10.2/unittest/mytap -O1 -fstack-protector --param=ssp-buffer-size=4 -g -DNDEBUG -D_FORTIFY_SOURCE=2 -DDBUG_OFF -Wall -Wdeclaration-after-statement -Wextra -Wformat-security -Wno-init-self -Wno-null-conversion -Wno-unused-parameter -Wno-unused-private-field -Woverloaded-virtual -Wvla -Wwrite-strings   -DHAVE_YASSL -DYASSL_PREFIX -DHAVE_OPENSSL -DMULTI_THREADED -MD -MT unittest/mysys/CMakeFiles/lf-t.dir/lf-t.c.o -MF unittest/mysys/CMakeFiles/lf-t.dir/lf-t.c.o.d -o unittest/mysys/CMakeFiles/lf-t.dir/lf-t.c.o -c /home/dan/mariadb-10.2/unittest/mysys/lf-t.c
            dan@eq-arm1:~/build-mariadb-10.2$ /usr/bin/clang++ -O1 -fstack-protector --param=ssp-buffer-size=4 -fno-rtti  -g -DNDEBUG -D_FORTIFY_SOURCE=2 -DDBUG_OFF -Wall -Wdeclaration-after-statement -Wextra -Wformat-security -Wno-init-self -Wno-null-conversion -Wno-unused-parameter -Wno-unused-private-field -Woverloaded-virtual -Wvla -Wwrite-strings -Wl,-z,relro,-z,now unittest/mysys/CMakeFiles/lf-t.dir/lf-t.c.o -o unittest/mysys/lf-t  -lpthread  unittest/mytap/libmytap.a  mysys/libmysys.a  dbug/libdbug.a  mysys/libmysys.a  dbug/libdbug.a  strings/libstrings.a  -lz  -lm  -ldl  -lpthread
            dan@eq-arm1:~/build-mariadb-10.2$ ./unittest/mysys/lf-t
            # N CPUs: 160, atomic ops: gcc-atomics-smp
            1..7
            ok 1 - my_atomic_initialize() returned 0
            # Testing lf_hash (without my_thread_init) with 2 threads, 30000000 iterations... 
            # 0 mallocs, 2 pins in stack, 1024 hash size, 14999984 inserts, 7490617 scans
            ok 2 - tested lf_hash (without my_thread_init) in 107.189 secs (0)
            # 7 tests planned but only 2 executed
             
            dan@eq-arm1:~/build-mariadb-10.2$ more /proc/cpuinfo 
            processor	: 0
            BogoMIPS	: 50.00
            Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
            CPU implementer	: 0x41
            CPU architecture: 8
            CPU variant	: 0x3
            CPU part	: 0xd0c
            CPU revision	: 1
            

            So I'm having trouble reproducing it. Your threads shows 30 despite being 2 in the descriptoin.

            With changes, this does passes mtr and lf-t in loop

            I do believe you however:

            https://github.com/MariaDB/server/pull/1951 (10.2) - please check I did the casts right here.
            https://github.com/MariaDB/server/pull/1952 (minor differences in 10.5 compared to 10.2)

            Should the unit tests CYCLES be increased to your reproducer values?

            danblack Daniel Black added a comment - Thansk mbeck for the detail and why the previous only partially worked. dan@eq-arm1:~/build-mariadb-10.2$ /usr/bin/clang --version Debian clang version 11.0.1-2 Target: aarch64-unknown-linux-gnu Thread model: posix   dan@eq-arm1:~/build-mariadb-10.2$ cmake ../mariadb-10.2/ -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_C_FLAGS="-O1" -DCMAKE_CXX_FLAGS=-O1   dan@eq-arm1:~/build-mariadb-10.2$ cmake --build . --verbose [1/7] cd /home/dan/build-mariadb-10.2 && /usr/bin/cmake -P /home/dan/mariadb-10.2/cmake/info_bin.cmake [2/7] cd /home/dan/build-mariadb-10.2 && /usr/bin/cmake -P /home/dan/mariadb-10.2/cmake/info_src.cmake [3/7] cd /home/dan/build-mariadb-10.2 && /usr/bin/cmake -DCOMPILER=/usr/bin/clang -DSOURCE_DIR=/home/dan/mariadb-10.2 -DBINARY_DIR=/home/dan/build-mariadb-10.2 "-DABI_HEADERS=/home/dan/mariadb-10.2/include/mysql/plugin_audit.h;/home/dan/mariadb-10.2/include/mysql/plugin_ftparser.h;/home/dan/mariadb-10.2/include/mysql.h;/home/dan/mariadb-10.2/include/mysql/psi/psi_abi_v1.h;/home/dan/mariadb-10.2/include/mysql/psi/psi_abi_v2.h;/home/dan/mariadb-10.2/include/mysql/client_plugin.h;/home/dan/mariadb-10.2/include/mysql/plugin_auth.h;/home/dan/mariadb-10.2/include/mysql/plugin_password_validation.h;/home/dan/mariadb-10.2/include/mysql/plugin_encryption.h" -P /home/dan/mariadb-10.2/cmake/do_abi_check.cmake [4/7] /usr/bin/clang -DDBUG_TRACE -DHAVE_CONFIG_H -D_FILE_OFFSET_BITS=64 -Iinclude -I/home/dan/mariadb-10.2/include -I/home/dan/mariadb-10.2/unittest/mytap -O1 -fstack-protector --param=ssp-buffer-size=4 -O2 -g -DNDEBUG -D_FORTIFY_SOURCE=2 -DDBUG_OFF -Wall -Wdeclaration-after-statement -Wextra -Wformat-security -Wno-init-self -Wno-null-conversion -Wno-unused-parameter -Wno-unused-private-field -Woverloaded-virtual -Wvla -Wwrite-strings -DHAVE_YASSL -DYASSL_PREFIX -DHAVE_OPENSSL -DMULTI_THREADED -MD -MT unittest/mysys/CMakeFiles/lf-t.dir/lf-t.c.o -MF unittest/mysys/CMakeFiles/lf-t.dir/lf-t.c.o.d -o unittest/mysys/CMakeFiles/lf-t.dir/lf-t.c.o -c /home/dan/mariadb-10.2/unittest/mysys/lf-t.c [5/7] /usr/bin/clang -DDBUG_TRACE -DHAVE_CONFIG_H -D_FILE_OFFSET_BITS=64 -Iinclude -I/home/dan/mariadb-10.2/include -I/home/dan/mariadb-10.2/unittest/mytap -O1 -fstack-protector --param=ssp-buffer-size=4 -O2 -g -DNDEBUG -D_FORTIFY_SOURCE=2 -DDBUG_OFF -Wall -Wdeclaration-after-statement -Wextra -Wformat-security -Wno-init-self -Wno-null-conversion -Wno-unused-parameter -Wno-unused-private-field -Woverloaded-virtual -Wvla -Wwrite-strings -DHAVE_YASSL -DYASSL_PREFIX -DHAVE_OPENSSL -DMULTI_THREADED -MD -MT unittest/mysys/CMakeFiles/my_atomic-t.dir/my_atomic-t.c.o -MF unittest/mysys/CMakeFiles/my_atomic-t.dir/my_atomic-t.c.o.d -o unittest/mysys/CMakeFiles/my_atomic-t.dir/my_atomic-t.c.o -c /home/dan/mariadb-10.2/unittest/mysys/my_atomic-t.c [6/7] : && /usr/bin/clang++ -O1 -fstack-protector --param=ssp-buffer-size=4 -fno-rtti -O2 -g -DNDEBUG -D_FORTIFY_SOURCE=2 -DDBUG_OFF -Wall -Wdeclaration-after-statement -Wextra -Wformat-security -Wno-init-self -Wno-null-conversion -Wno-unused-parameter -Wno-unused-private-field -Woverloaded-virtual -Wvla -Wwrite-strings -Wl,-z,relro,-z,now unittest/mysys/CMakeFiles/lf-t.dir/lf-t.c.o -o unittest/mysys/lf-t -lpthread unittest/mytap/libmytap.a mysys/libmysys.a dbug/libdbug.a mysys/libmysys.a dbug/libdbug.a strings/libstrings.a -lz -lm -ldl -lpthread && : [7/7] : && /usr/bin/clang++ -O1 -fstack-protector --param=ssp-buffer-size=4 -fno-rtti -O2 -g -DNDEBUG -D_FORTIFY_SOURCE=2 -DDBUG_OFF -Wall -Wdeclaration-after-statement -Wextra -Wformat-security -Wno-init-self -Wno-null-conversion -Wno-unused-parameter -Wno-unused-private-field -Woverloaded-virtual -Wvla -Wwrite-strings -Wl,-z,relro,-z,now unittest/mysys/CMakeFiles/my_atomic-t.dir/my_atomic-t.c.o -o unittest/mysys/my_atomic-t -lpthread unittest/mytap/libmytap.a mysys/libmysys.a dbug/libdbug.a mysys/libmysys.a dbug/libdbug.a strings/libstrings.a -lz -lm -ldl -lpthread && : dan@eq-arm1:~/build-mariadb-10.2$ ./unittest/mysys/lf-t # N CPUs: 160, atomic ops: gcc-atomics-smp 1..7 ok 1 - my_atomic_initialize() returned 0 # Testing lf_hash (without my_thread_init) with 160 threads, 300000 iterations... # 0 mallocs, 160 pins in stack, 4096 hash size, 16167030 inserts, 54494 scans ok 2 - tested lf_hash (without my_thread_init) in 115.919 secs (0) # 7 tests planned but only 2 executed dan@eq-arm1:~/build-mariadb-10.2$ cmake --build . [7/7] Linking CXX executable unittest/mysys/lf-t dan@eq-arm1:~/build-mariadb-10.2$ ./unittest/mysys/lf-t # N CPUs: 160, atomic ops: gcc-atomics-smp 1..7 ok 1 - my_atomic_initialize() returned 0 # Testing lf_hash (without my_thread_init) with 2 threads, 30000000 iterations... # 0 mallocs, 2 pins in stack, 2048 hash size, 14880152 inserts, 7568227 scans ok 2 - tested lf_hash (without my_thread_init) in 126.009 secs (0) # 7 tests planned but only 2 executed dan@eq-arm1:~/build-mariadb-10.2$ /usr/bin/clang -DDBUG_TRACE -DHAVE_CONFIG_H -D_FILE_OFFSET_BITS=64 -Iinclude -I/home/dan/mariadb-10.2/include -I/home/dan/mariadb-10.2/unittest/mytap -O1 -fstack-protector --param=ssp-buffer-size=4 -g -DNDEBUG -D_FORTIFY_SOURCE=2 -DDBUG_OFF -Wall -Wdeclaration-after-statement -Wextra -Wformat-security -Wno-init-self -Wno-null-conversion -Wno-unused-parameter -Wno-unused-private-field -Woverloaded-virtual -Wvla -Wwrite-strings -DHAVE_YASSL -DYASSL_PREFIX -DHAVE_OPENSSL -DMULTI_THREADED -MD -MT unittest/mysys/CMakeFiles/lf-t.dir/lf-t.c.o -MF unittest/mysys/CMakeFiles/lf-t.dir/lf-t.c.o.d -o unittest/mysys/CMakeFiles/lf-t.dir/lf-t.c.o -c /home/dan/mariadb-10.2/unittest/mysys/lf-t.c dan@eq-arm1:~/build-mariadb-10.2$ /usr/bin/clang++ -O1 -fstack-protector --param=ssp-buffer-size=4 -fno-rtti -g -DNDEBUG -D_FORTIFY_SOURCE=2 -DDBUG_OFF -Wall -Wdeclaration-after-statement -Wextra -Wformat-security -Wno-init-self -Wno-null-conversion -Wno-unused-parameter -Wno-unused-private-field -Woverloaded-virtual -Wvla -Wwrite-strings -Wl,-z,relro,-z,now unittest/mysys/CMakeFiles/lf-t.dir/lf-t.c.o -o unittest/mysys/lf-t -lpthread unittest/mytap/libmytap.a mysys/libmysys.a dbug/libdbug.a mysys/libmysys.a dbug/libdbug.a strings/libstrings.a -lz -lm -ldl -lpthread dan@eq-arm1:~/build-mariadb-10.2$ ./unittest/mysys/lf-t # N CPUs: 160, atomic ops: gcc-atomics-smp 1..7 ok 1 - my_atomic_initialize() returned 0 # Testing lf_hash (without my_thread_init) with 2 threads, 30000000 iterations... # 0 mallocs, 2 pins in stack, 1024 hash size, 14999984 inserts, 7490617 scans ok 2 - tested lf_hash (without my_thread_init) in 107.189 secs (0) # 7 tests planned but only 2 executed   dan@eq-arm1:~/build-mariadb-10.2$ more /proc/cpuinfo processor : 0 BogoMIPS : 50.00 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x3 CPU part : 0xd0c CPU revision : 1 So I'm having trouble reproducing it. Your threads shows 30 despite being 2 in the descriptoin. With changes, this does passes mtr and lf-t in loop I do believe you however: https://github.com/MariaDB/server/pull/1951 (10.2) - please check I did the casts right here. https://github.com/MariaDB/server/pull/1952 (minor differences in 10.5 compared to 10.2) Should the unit tests CYCLES be increased to your reproducer values?
            mbeck Martin Beck added a comment -

            The number of CYCLES and THREADS did influence the probability of hitting the bug during execution. I just reran with 30 threads and a cycle count of 30000 and hit the bug within a few seconds.

            With the previous patch (compiled with -O1; at commit e0ba68ba34f5623cfa3c61e2e1966971703297f5) I get about 6% (213/3200) chance to hit the bug:

            $ for i in `seq 100`; do ./parallel.sh ./lf-t; done | grep 'Bail out' | wc -l
            213
            

            Without that patch (compiled with -O3 at commit 666565c7f00b0f39bbb459a428efd0093ed05fc8) the chance to hit the bug is about 38% (1247/3200):

            $ for i in `seq 100`; do ./parallel.sh ./lf-t; done | grep 'Bail out' | wc -l
            1247
            

            On 10.2.41, without changing THREADS or CYCLES, just the vanilla source, 0.5% chance (16/3200):

            $ cmake  ../mariadb-10.2/ -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_C_FLAGS="-O1" -DCMAKE_CXX_FLAGS=-O1
            $ cmake --build . --parallel=100 --verbose
            $ for i in `seq 100`; do ../parallel.sh unittest/mysys/lf-t; done | grep 'Bail out' | wc -l
            16
            

            parallel.sh is attached, but basically just spawns 32 processes in parallel and waits.
            parallel.sh

            clang --version
            clang version 8.0.0-3~ubuntu18.04.2 (tags/RELEASE_800/final)
            Target: aarch64-unknown-linux-gnu
            Thread model: posix
            InstalledDir: /usr/bin
            

            The default CYCLE count is rather low to exhibit concurrency bugs, as there is not much time for the OS to generate many interleavings. The probability of this bug appearing is increased by increasing CYCLES. 30000 was just a tradeoff that worked in my case.

            I will review the propositions (pull 1951 / 1952 later)

            mbeck Martin Beck added a comment - The number of CYCLES and THREADS did influence the probability of hitting the bug during execution. I just reran with 30 threads and a cycle count of 30000 and hit the bug within a few seconds. With the previous patch (compiled with -O1; at commit e0ba68ba34f5623cfa3c61e2e1966971703297f5 ) I get about 6% (213/3200) chance to hit the bug: $ for i in `seq 100`; do ./parallel.sh ./lf-t; done | grep 'Bail out' | wc -l 213 Without that patch (compiled with -O3 at commit 666565c7f00b0f39bbb459a428efd0093ed05fc8 ) the chance to hit the bug is about 38% (1247/3200): $ for i in `seq 100`; do ./parallel.sh ./lf-t; done | grep 'Bail out' | wc -l 1247 On 10.2.41, without changing THREADS or CYCLES, just the vanilla source, 0.5% chance (16/3200): $ cmake ../mariadb-10.2/ -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_C_FLAGS="-O1" -DCMAKE_CXX_FLAGS=-O1 $ cmake --build . --parallel=100 --verbose $ for i in `seq 100`; do ../parallel.sh unittest/mysys/lf-t; done | grep 'Bail out' | wc -l 16 parallel.sh is attached, but basically just spawns 32 processes in parallel and waits. parallel.sh clang --version clang version 8.0.0-3~ubuntu18.04.2 (tags/RELEASE_800/final) Target: aarch64-unknown-linux-gnu Thread model: posix InstalledDir: /usr/bin The default CYCLE count is rather low to exhibit concurrency bugs, as there is not much time for the OS to generate many interleavings. The probability of this bug appearing is increased by increasing CYCLES. 30000 was just a tradeoff that worked in my case. I will review the propositions (pull 1951 / 1952 later)
            danblack Daniel Black added a comment -

            Thank you very much mbeck and svoj

            danblack Daniel Black added a comment - Thank you very much mbeck and svoj
            danblack Daniel Black added a comment -

            note lf.hash test still failing - https://buildbot.mariadb.org/#/builders/295/builds/3971/steps/6/logs/stdio

            if you see why please create/reopen a bug report.

            danblack Daniel Black added a comment - note lf.hash test still failing - https://buildbot.mariadb.org/#/builders/295/builds/3971/steps/6/logs/stdio if you see why please create/reopen a bug report.

            People

              danblack Daniel Black
              mbeck Martin Beck
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.