Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-17092

ERROR: AddressSanitizer: use-after-poison around lock_trx_handle_wait_low

Details

    Description

      RQG testing with the test conf/mariadb/tables_stress* on 
      bb-10.2-marko commit 0aa9db1b4877ed411eae4175c059546fa07faa4f
      compiled with ASAN
       
      #3  <signal handler called>
      #4  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
      #5  0x00007f8f6062737a in __GI_abort () at abort.c:89
      #6  0x00007f8f61db9169 in ?? () from /usr/lib/x86_64-linux-gnu/libasan.so.3
      #7  0x00007f8f61dae06b in ?? () from /usr/lib/x86_64-linux-gnu/libasan.so.3
      #8  0x00007f8f61da7457 in ?? () from /usr/lib/x86_64-linux-gnu/libasan.so.3
      #9  0x00007f8f61da7b47 in __asan_report_load1 () from /usr/lib/x86_64-linux-gnu/libasan.so.3
      #10 0x000055ad8d70277d in lock_trx_handle_wait_low (trx=0x7f8f55009778)
        at storage/innobase/lock/lock0lock.cc:6926
      #11 0x000055ad8d702948 in lock_trx_handle_wait (trx=0x7f8f55009778)
        at storage/innobase/lock/lock0lock.cc:6950
      #12 0x000055ad8d5daa5b in innobase_kill_query (thd=0x62a0001e0208)
        at storage/innobase/handler/ha_innodb.cc:5249
      #13 0x000055ad8d0f4dc5 in kill_handlerton (thd=0x62a0001e0208, plugin=0x60200002e618, level=0x7f8f2d52d8c4)
         at sql/handler.cc:808
      #14 0x000055ad8cb695fc in plugin_foreach_with_mask (thd=0x62a0001e0208, func=0x55ad8d0f4c96 <kill_handlerton(THD*, plugin_ref, void*)>, type=1, state_mask=8, arg=0x7f8f2d52d8c4)
         at sql/sql_plugin.cc:2398
      #15 0x000055ad8d0f4e99 in ha_kill_query (thd=0x62a0001e0208, level=THD_ABORT_ASAP)
         at sql/handler.cc:815
      #16 0x000055ad8ca7cdbc in THD::awake (this=0x62a0001e0208, state_to_set=KILL_CONNECTION_HARD)
         at sql/sql_class.cc:1710
      #17 0x000055ad8cb4dce7 in kill_one_thread (thd=0x62a0001bc208, id=135, kill_signal=KILL_QUERY_HARD, type=KILL_TYPE_ID)
         at sql/sql_parse.cc:8897
      #18 0x000055ad8cb4e669 in sql_kill (thd=0x62a0001bc208, id=135, state=KILL_QUERY_HARD, type=KILL_TYPE_ID)
         at sql/sql_parse.cc:9003
      #19 0x000055ad8cb38361 in mysql_execute_command (thd=0x62a0001bc208)
         at sql/sql_parse.cc:5472
      #20 0x000055ad8cb47f28 in mysql_parse (thd=0x62a0001bc208, rawbuf=0x62b0000c4220 "KILL QUERY @kill_id  /* E_R Thread31 QNO 771 CON_ID 288 */", length=58, parser_state=0x7f8f2d52f560, is_com_multi=false, is_next_command=false)
         at sql/sql_parse.cc:8009
      #21 0x000055ad8cb22c46 in dispatch_command (command=COM_QUERY, thd=0x62a0001bc208, packet=0x629003886209 " KILL QUERY @kill_id  /* E_R Thread31 QNO 771 CON_ID 288 */ ", packet_length=60, is_com_multi=false, is_next_command=false)
         at sql/sql_parse.cc:1824
      #22 0x000055ad8cb1fce7 in do_command (thd=0x62a0001bc208)
         at sql/sql_parse.cc:1377
      #23 0x000055ad8ce50fc2 in do_handle_one_connection (connect=0x60800000b2a8)
         at sql/sql_connect.cc:1335
      #24 0x000055ad8ce509a7 in handle_one_connection (arg=0x60800000b2a8)
         at sql/sql_connect.cc:1241
      #25 0x00007f8f612666da in start_thread (arg=0x7f8f2d531700) at pthread_create.c:456
       
      SUMMARY: AddressSanitizer: use-after-poison /work_m/bb-10.2-marko/storage/innobase/lock/lock0lock.cc:6926 in lock_trx_handle_wait_low
      Shadow bytes around the buggy address:
        0x0ff26a9f92b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
        0x0ff26a9f92c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
        0x0ff26a9f92d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
        0x0ff26a9f92e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
        0x0ff26a9f92f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 f7 f7 f7 
      =>0x0ff26a9f9300: f7 f7 f7 f7 f7 f7 f7 f7 f7[f7]f7 f7 f7 f7 f7 f7 
        0x0ff26a9f9310: f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 
        0x0ff26a9f9320: f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 
        0x0ff26a9f9330: f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 
        0x0ff26a9f9340: f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 
        0x0ff26a9f9350: f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7
      

      Attachments

        Issue Links

          Activity

            Sad to see 5530a93f4. My comments are based on 10.5.

            First of all it doesn't really fix the broader problem, but rather silences symptoms.

            +  if (trx_t* trx= thd_to_trx(thd))
            +  {
             ---- THE GAP ---
            +    lock_mutex_enter();
            +    trx_sys_mutex_enter();
            

            While going through THE GAP concurrent thread may release (or even free?) trx back to trx_pool and another thread may reuse it. In effect it may break innocent transaction, even worse it may break innocent transaction of another user which it is not permitted to kill.

            Secondly, trx is found via server_threads/thd->ha_data and not via trx_sys.trx_list. The latter is officially protected by trx_sys.mutex, the former is not. The fact that trx_sys.mutex helps here is pure coincidence and can in no way be considered as meaningful design.

            Thirdly, the bug is in the server and not in InnoDB. Consider the following...

            KILL thread:

            rdlock(server_threads.rwlock);
            thd= server_threads.find();
            lock(thd->LOCK_thd_kill);
            unlock(server_thread.rwlock);
            thd->awake()/.../innobase_kill_query();
            unlock(thd->LOCK_thd_kill);
            

            InnoDB thd->ha_data is reset only upon disconnect, in ha_close_connection(). Relevant pseudo code:

            wrlock(server_thread.rwlock);
            server_threads.erase(thd);
            unlock(server_thread.rwlock);
            thd->free_connection()/ha_close_connection()/innobase_close_connection()/trx_free()
            

            If you look carefully into the above, you can conclude that thd->free_connection() can be called concurrently with KILL/thd->awake(). Which is the bug. And it is partially fixed in THD::~THD(), that is destructor waits for KILL completion:

              /*
                Other threads may have a lock on LOCK_thd_kill to ensure that this
                THD is not deleted while they access it. The following mutex_lock
                ensures that no one else is using this THD and it's now safe to delete
              */
              if (WSREP_NNULL(this)) mysql_mutex_lock(&LOCK_thd_data);
              mysql_mutex_lock(&LOCK_thd_kill);
              mysql_mutex_unlock(&LOCK_thd_kill);
              if (WSREP_NNULL(this)) mysql_mutex_unlock(&LOCK_thd_data);
            

            However this only works when THD is actually destroyed, whereas it must've been called when THD is released to the thread cache.

            If the above empty critical section would've been moved to thd->free_connection(), the whole 5530a93f4 would not be needed.

            svoj Sergey Vojtovich added a comment - Sad to see 5530a93f4. My comments are based on 10.5. First of all it doesn't really fix the broader problem, but rather silences symptoms. + if (trx_t* trx= thd_to_trx(thd)) + { ---- THE GAP --- + lock_mutex_enter(); + trx_sys_mutex_enter(); While going through THE GAP concurrent thread may release (or even free?) trx back to trx_pool and another thread may reuse it. In effect it may break innocent transaction, even worse it may break innocent transaction of another user which it is not permitted to kill. Secondly, trx is found via server_threads / thd->ha_data and not via trx_sys.trx_list . The latter is officially protected by trx_sys.mutex , the former is not. The fact that trx_sys.mutex helps here is pure coincidence and can in no way be considered as meaningful design. Thirdly, the bug is in the server and not in InnoDB. Consider the following... KILL thread: rdlock(server_threads.rwlock); thd= server_threads.find(); lock(thd->LOCK_thd_kill); unlock(server_thread.rwlock); thd->awake()/.../innobase_kill_query(); unlock(thd->LOCK_thd_kill); InnoDB thd->ha_data is reset only upon disconnect, in ha_close_connection() . Relevant pseudo code: wrlock(server_thread.rwlock); server_threads.erase(thd); unlock(server_thread.rwlock); thd->free_connection()/ha_close_connection()/innobase_close_connection()/trx_free() If you look carefully into the above, you can conclude that thd->free_connection() can be called concurrently with KILL/thd->awake() . Which is the bug. And it is partially fixed in THD::~THD() , that is destructor waits for KILL completion: /* Other threads may have a lock on LOCK_thd_kill to ensure that this THD is not deleted while they access it. The following mutex_lock ensures that no one else is using this THD and it's now safe to delete */ if (WSREP_NNULL(this)) mysql_mutex_lock(&LOCK_thd_data); mysql_mutex_lock(&LOCK_thd_kill); mysql_mutex_unlock(&LOCK_thd_kill); if (WSREP_NNULL(this)) mysql_mutex_unlock(&LOCK_thd_data); However this only works when THD is actually destroyed, whereas it must've been called when THD is released to the thread cache. If the above empty critical section would've been moved to thd->free_connection() , the whole 5530a93f4 would not be needed.

            I must revoke my first claim because:

            • trx objects seem to be freed only on shutdown
            • even though another trx->mysql_thd == thd binding is not impossible, it can't be established while we're holding thd->LOCK_thd_kill.

            So there seem to be no practical issues with the fix, despite absence of meaningful design: the fact that KILL may not be completed up through victim THD being released to thread cache/stay cached/re-initialised for another connection sounds fragile.

            svoj Sergey Vojtovich added a comment - I must revoke my first claim because: trx objects seem to be freed only on shutdown even though another trx->mysql_thd == thd binding is not impossible, it can't be established while we're holding thd->LOCK_thd_kill . So there seem to be no practical issues with the fix, despite absence of meaningful design: the fact that KILL may not be completed up through victim THD being released to thread cache/stay cached/re-initialised for another connection sounds fragile.

            svoj, sorry, I only now noticed your comments. I agree that the homebrew memory pool is a somewhat unfortunate design (which MariaDB inherited from MySQL). We might want to revise it at some point. Maybe trx_t should be a part of the InnoDB data of THD. Maybe removing unnecessary bloat from it would fix any memory allocator overhead.

            I am not proud of my solution, but I did not want to touch any code outside InnoDB. Do you have a better solution in mind? (Note: MDEV-21910 may have changed this flow again, without fixing the reported issue yet.)

            marko Marko Mäkelä added a comment - svoj , sorry, I only now noticed your comments. I agree that the homebrew memory pool is a somewhat unfortunate design (which MariaDB inherited from MySQL). We might want to revise it at some point. Maybe trx_t should be a part of the InnoDB data of THD . Maybe removing unnecessary bloat from it would fix any memory allocator overhead. I am not proud of my solution, but I did not want to touch any code outside InnoDB. Do you have a better solution in mind? (Note: MDEV-21910 may have changed this flow again, without fixing the reported issue yet.)

            @marko, yes, there's better solution that I've explained in my previous comment. InnoDB code should stay intact (as if fix for this bug was not pushed) and the real problem should be fixed in the server.

            I'm concerned (as far as I can be in the current situation) that trx_sys.mutex (or whatever it is called now) is abused for unrelated matters.

            A healthy code that we've designed back in May (with Eugene IIRC) was supposed to encapsulate trx_sys.trx_list and trx_sys.mutex so that the latter exclusively protects the former. Now this code looks less healthy because it got infected by another sick code.

            svoj Sergey Vojtovich added a comment - @marko, yes, there's better solution that I've explained in my previous comment. InnoDB code should stay intact (as if fix for this bug was not pushed) and the real problem should be fixed in the server. I'm concerned (as far as I can be in the current situation) that trx_sys.mutex (or whatever it is called now) is abused for unrelated matters. A healthy code that we've designed back in May (with Eugene IIRC) was supposed to encapsulate trx_sys.trx_list and trx_sys.mutex so that the latter exclusively protects the former. Now this code looks less healthy because it got infected by another sick code.

            I filed MDEV-23536 for the better suggestion.

            marko Marko Mäkelä added a comment - I filed MDEV-23536 for the better suggestion.

            People

              marko Marko Mäkelä
              mleich Matthias Leich
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.