[MDEV-26781] InnoDB hangs on SPATIAL INDEX when using SUX_LOCK_GENERIC Created: 2021-10-07  Updated: 2022-04-06  Resolved: 2022-04-06

Status: Closed
Project: MariaDB Server
Component/s: GIS, Storage Engine - InnoDB
Affects Version/s: 10.6, 10.7
Fix Version/s: 10.6.8, 10.7.4, 10.8.3, 10.9.1

Type: Bug Priority: Major
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: hang, regression-10.6
Environment:

Any OS for which futex support has not been implemented (not Linux, OpenBSD, Microsoft Windows)


Issue Links:
Relates
relates to MDEV-26476 InnoDB is missing futex support on so... Closed

 Description   

The test innodb_gis.rtree_purge would easily hang when using the fallback SUX_LOCK_GENERIC implementation, which is the only option for operating systems for which we lack a futex-like interface:

cmake -DCMAKE_CXX_FLAGS=-DSUX_LOCK_GENERIC .
cmake --build .
./mtr --parallel=auto --repeat=10 innodb_gis.rtree_purge

This appears to be a deadlock involving DML and purge:

#9  mtr_t::x_lock (this=0x7f4e5212b720, file=0x55d1d4ee3728 "/mariadb/10.6/storage/innobase/btr/btr0cur.cc", line=1461, lock=0x7f4de40f7808) at /mariadb/10.6/storage/innobase/include/mtr0mtr.h:240
#10 0x000055d1d49e7e9d in btr_cur_search_to_nth_level_func (index=index@entry=0x7f4de40f7698, level=level@entry=0, tuple=tuple@entry=0x7f4de41a19b8, mode=mode@entry=PAGE_CUR_RTREE_INSERT, latch_mode=<optimized out>, latch_mode@entry=33, cursor=cursor@entry=0x7f4e5212b420, ahi_latch=0x0, mtr=0x7f4e5212b720, autoinc=0) at /mariadb/10.6/storage/innobase/btr/btr0cur.cc:1461
#33 0x000055d1d40d1aad in dispatch_command (command=command@entry=COM_QUERY, thd=thd@entry=0x7f4de4000d48, packet=0x7f4e5212d400 "\377\377\377\377", packet@entry=0x7f4de4107f59 "insert into t select @p,@p from seq_1_to_130", packet_length=packet_length@entry=44, blocking=blocking@entry=true) at /mariadb/10.6/sql/sql_parse.cc:1896

At the same time, a purge task is attempting to acquire a shared latch on the page:

#9  0x000055d1d49e7ae1 in mtr_t::s_lock (lock=0x7f4de40f7808, line=1505, file=0x55d1d4ee3728 "/mariadb/10.6/storage/innobase/btr/btr0cur.cc", this=0x7f4e427fb170) at /mariadb/10.6/storage/innobase/include/mtr0mtr.h:229
#10 btr_cur_search_to_nth_level_func (index=index@entry=0x7f4de40f7698, level=level@entry=0, tuple=tuple@entry=0x7f4df8004b18, mode=mode@entry=PAGE_CUR_RTREE_LOCATE, latch_mode=<optimized out>, latch_mode@entry=2, cursor=cursor@entry=0x7f4e427faee0, ahi_latch=0x0, mtr=0x7f4e427fb170, autoinc=0) at /mariadb/10.6/storage/innobase/btr/btr0cur.cc:1505
#11 0x000055d1d4b045a4 in rtr_pcur_open (index=index@entry=0x7f4de40f7698, tuple=tuple@entry=0x7f4df8004b18, mode=mode@entry=PAGE_CUR_RTREE_LOCATE, latch_mode=latch_mode@entry=2, cursor=cursor@entry=0x7f4e427faee0, mtr=mtr@entry=0x7f4e427fb170) at /mariadb/10.6/storage/innobase/gis/gis0sea.cc:574
#12 0x000055d1d491e118 in row_search_index_entry (index=index@entry=0x7f4de40f7698, entry=entry@entry=0x7f4df8004b18, mode=mode@entry=2, pcur=pcur@entry=0x7f4e427faee0, mtr=mtr@entry=0x7f4e427fb170) at /mariadb/10.6/storage/innobase/row/row0row.cc:1300
#13 0x000055d1d490de0c in row_purge_remove_sec_if_poss_leaf (node=node@entry=0x55d1d693c068, index=index@entry=0x7f4de40f7698, entry=entry@entry=0x7f4df8004b18) at /mariadb/10.6/storage/innobase/row/row0purge.cc:524

On the futex-based implementation this works fine. That is, MariaDB running on Linux, OpenBSD, and Microsoft Windows should not be affected by this.

An easy fix could be to compose the ssux_lock out of 2 std::atomic fields (like in MDEV-25404) also when using SUX_LOCK_GENERIC.



 Comments   
Comment by Marko Mäkelä [ 2022-04-06 ]

The buggy implementation would occasionally cause hangs, not only SPATIAL INDEX tests but also in other functionality. On FreeBSD before MDEV-26476 was implemented, an encryption test would hang.

This was fixed by implementing a minimal layer to emulate futex:

  • The ssux_lock::writer will always be srw_mutex_impl. For the fallback implementation, srw_mutex_impl::mutex and srw_mutex_impl::cond will provide a wait queue for the 32-bit field srw_mutex_impl::lock.
  • The wait queue for the 32-bit field ssux_lock::readers will be provided by a new condition variable ssux_lock::readers_cond that will be used together with writer.mutex.

Compared to the old buggy implementation, sizeof(ssux_lock) and sizeof(sux_lock) will increase by 4 bytes. In the buggy implementation, there was only one 32-bit lock word, while now there are two: writer and readers.

Generated at Thu Feb 08 09:47:54 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.