[MDEV-30227] [ERROR] [FATAL] InnoDB: fdatasync() returned 9 Created: 2022-12-14  Updated: 2022-12-15  Resolved: 2022-12-15

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.5, 10.6, 10.7, 10.8, 10.9, 10.10, 10.11
Fix Version/s: 10.11.2, 10.5.19, 10.6.12, 10.7.8, 10.8.7, 10.9.5, 10.10.3

Type: Bug Priority: Major
Reporter: Matthias Leich Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: crash, race, regression-10.5

Issue Links:
Problem/Incident
is caused by MDEV-23855 InnoDB log checkpointing causes regre... Closed

 Description   

origin/bb-10.11-new-innodb-defaults a781237172f6a904a0a9512a64d5686288735605 2022-12-13T09:52:24+02:00
Per Marko: The problems affects releases >= 10.5
 
fbwt_value:
    OFF |
    ON ;
query:
    FLUSH TABLE _table FOR EXPORT |
    SET GLOBAL innodb_data_file_buffering = fbwt_value ;
query_init:
    ;
thread_connect:
    ;

After some time the server aborts with

  1. 2022-12-14T04:43:42 [1233099] | [rr 1234348 53137]2022-12-14 4:41:52 22 [ERROR] [FATAL] InnoDB: fdatasync() returned 9
  2. 2022-12-14T04:43:42 [1233099] | [rr 1234348 53153]221214 4:41:52 [rr 1234348 53156][ERROR] mysqld got signal 6 ;
    (rr) bt
    #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=109951340336704) at ./nptl/pthread_kill.c:44
    #1 __pthread_kill_internal (signo=6, threadid=109951340336704) at ./nptl/pthread_kill.c:78
    #2 GI_pthread_kill (threadid=109951340336704, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
    #3 0x0000151d40946476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
    #4 0x0000151d4092c7f3 in __GI_abort () at ./stdlib/abort.c:79
    #5 0x00005621f92b04da in ib::fatal::~fatal (this=this@entry=0x64000a951390, __in_chrg=<optimized out>) at /data/Server/bb-10.11-new-innodb-defaults/storage/innobase/ut/ut0ut.cc:527
    #6 0x00005621f8f812c6 in os_file_sync_posix (file=file@entry=54) at /data/Server/bb-10.11-new-innodb-defaults/storage/innobase/os/os0file.cc:876
    #7 0x00005621f8f85ab4 in os_file_flush_func (file=54) at /data/Server/bb-10.11-new-innodb-defaults/storage/innobase/os/os0file.cc:940
    #8 0x00005621f94ce80f in fil_space_t::flush_low (this=this@entry=0x61200006cf40) at /data/Server/bb-10.11-new-innodb-defaults/storage/innobase/fil/fil0fil.cc:514
    #9 0x00005621f91cd39d in fil_space_t::flush<false> (this=0x61200006cf40) at /data/Server/bb-10.11-new-innodb-defaults/storage/innobase/include/fil0fil.h:1521
    #10 0x00005621f91cbf08 in row_quiesce_table_start (table=0x618000046108, trx=trx@entry=0x4a873573b540) at /data/Server/bb-10.11-new-innodb-defaults/storage/innobase/row/row0quiesce.cc:531
    #11 0x00005621f8d992af in ha_innobase::external_lock (this=0x61d00025c6b8, thd=<optimized out>, lock_type=<optimized out>) at /data/Server/bb-10.11-new-innodb-defaults/storage/innobase/handler/ha_innodb.cc:15876
    #12 0x00005621f837ca0a in handler::ha_external_lock (this=0x61d00025c6b8, thd=thd@entry=0x62b00020d218, lock_type=lock_type@entry=0) at /data/Server/bb-10.11-new-innodb-defaults/sql/handler.cc:7094
    #13 0x00005621f88ce719 in lock_external (thd=thd@entry=0x62b00020d218, tables=0x606000196078, count=1) at /data/Server/bb-10.11-new-innodb-defaults/sql/lock.cc:396
    #14 0x00005621f88cf628 in mysql_lock_tables (thd=thd@entry=0x62b00020d218, sql_lock=sql_lock@entry=0x606000196058, flags=flags@entry=4096) at /data/Server/bb-10.11-new-innodb-defaults/sql/lock.cc:341
    #15 0x00005621f88d10bc in mysql_lock_tables (thd=thd@entry=0x62b00020d218, tables=tables@entry=0x629000e88ab0, count=count@entry=1, flags=flags@entry=4096) at /data/Server/bb-10.11-new-innodb-defaults/sql/lock.cc:304
    #16 0x00005621f7883dbe in lock_tables (thd=thd@entry=0x62b00020d218, tables=0x629000e88380, count=<optimized out>, flags=flags@entry=4096) at /data/Server/bb-10.11-new-innodb-defaults/sql/sql_base.cc:5818
    #17 0x00005621f7886d48 in open_and_lock_tables (thd=thd@entry=0x62b00020d218, options=..., tables=tables@entry=0x629000e88380, derived=derived@entry=false, flags=flags@entry=4096, prelocking_strategy=prelocking_strategy@entry=0x64000a9527c0)
        at /data/Server/bb-10.11-new-innodb-defaults/sql/sql_base.cc:5578
    #18 0x00005621f7f4159e in open_and_lock_tables (prelocking_strategy=0x64000a9527c0, flags=4096, derived=false, tables=0x629000e88380, thd=0x62b00020d218) at /data/Server/bb-10.11-new-innodb-defaults/sql/sql_base.h:279
    #19 flush_tables_with_read_lock (thd=thd@entry=0x62b00020d218, all_tables=0x629000e88380) at /data/Server/bb-10.11-new-innodb-defaults/sql/sql_reload.cc:588
    #20 0x00005621f7a361e7 in mysql_execute_command (thd=thd@entry=0x62b00020d218, is_called_from_prepared_stmt=is_called_from_prepared_stmt@entry=false) at /data/Server/bb-10.11-new-innodb-defaults/sql/sql_parse.cc:5416
    #21 0x00005621f7a3befa in mysql_parse (thd=thd@entry=0x62b00020d218, rawbuf=<optimized out>, length=<optimized out>, parser_state=parser_state@entry=0x64000a953610) at /data/Server/bb-10.11-new-innodb-defaults/sql/sql_parse.cc:8000
    #22 0x00005621f7a42736 in dispatch_command (command=command@entry=COM_QUERY, thd=thd@entry=0x62b00020d218, packet=packet@entry=0x629000e7e219 "FLUSH TABLE C FOR EXPORT /* E_R Thread6 QNO 8 CON_ID 22 */ ", packet_length=packet_length@entry=61,
        blocking=blocking@entry=true) at /data/Server/bb-10.11-new-innodb-defaults/sql/sql_parse.cc:1894
    #23 0x00005621f7a47795 in do_command (thd=0x62b00020d218, blocking=blocking@entry=true) at /data/Server/bb-10.11-new-innodb-defaults/sql/sql_parse.cc:1407
    #24 0x00005621f7eafd3e in do_handle_one_connection (connect=<optimized out>, connect@entry=0x608000003238, put_in_cache=put_in_cache@entry=true) at /data/Server/bb-10.11-new-innodb-defaults/sql/sql_connect.cc:1415
    #25 0x00005621f7eb03a4 in handle_one_connection (arg=0x608000003238) at /data/Server/bb-10.11-new-innodb-defaults/sql/sql_connect.cc:1317
    #26 0x0000151d40998b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
    #27 0x0000151d40a29bb4 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100
  1. GIT_SHOW: HEAD -> master, origin/master, origin/HEAD 572309e7c8e1561daea5903bb707591cef89e1ee 2022-12-13T19:50:23+01:00
  2. rqg.pl : Version 4.2.1 (2022-12)
    #
  3. $RQG_HOME/rqg.pl \
  4. --duration=300 \
  5. --queries=10000000 \
  6. --no_mask \
  7. --seed=random \
  8. --gendata \
  9. --engine=InnoDB \
  10. --rpl_mode=none \
  11. --max_gd_duration=1200 \
  12. -mysqld=-connect_timeout=60 \
  13. -mysqld=-interactive_timeout=28800 \
  14. -mysqld=-log-output=none \
  15. -mysqld=-loose-idle_write_transaction_timeout=0 \
  16. -mysqld=-innodb-buffer-pool-size=256M \
  17. -mysqld=-innodb_page_size=4K \
  18. -mysqld=-slave_net_timeout=60 \
  19. -mysqld=-loose_innodb_lock_schedule_algorithm=fcfs \
  20. -mysqld=-file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \
  21. -mysqld=-loose-plugin-load-add=provider_lz4.so \
  22. -mysqld=-log-bin \
  23. -mysqld=-net_read_timeout=30 \
  24. -mysqld=-wait_timeout=28800 \
  25. -mysqld=-loose-innodb_read_only_compressed=OFF \
  26. -mysqld=-net_write_timeout=60 \
  27. -mysqld=-loose-idle_transaction_timeout=0 \
  28. -mysqld=-loose-idle_readonly_transaction_timeout=0 \
  29. -mysqld=-log_bin_trust_function_creators=1 \
  30. -mysqld=-loose-innodb_fatal_semaphore_wait_threshold=300 \
  31. -mysqld=-innodb-lock-wait-timeout=50 \
  32. -mysqld=-loose-table_lock_wait_timeout=50 \
  33. -mysqld=-plugin-load-add=file_key_management.so \
  34. -mysqld=-lock-wait-timeout=86400 \
  35. -mysqld=-sync-binlog=1 \
  36. -mysqld=-loose-debug_assert_on_not_freed_memory=0 \
  37. -mysqld=-loose-max-statement-time=30 \
  38. --reporters=Backtrace,Deadlock,None \
  39. --validators=None \
  40. --threads=10 \
  41. --grammar=<see above> \
  42. <local settings>
    (rr)


 Comments   
Comment by Marko Mäkelä [ 2022-12-14 ]

Thank you, mleich. I believe that this bug was introduced by MDEV-23855. A race condition with unlink() or DROP TABLE was fixed in MDEV-24391, but this one is a race condition between close() and fdatasync(). Can you please check if the following patch would fix this?

diff --git a/storage/innobase/include/fil0fil.h b/storage/innobase/include/fil0fil.h
index f44aae05bb3..00486d3a7d8 100644
--- a/storage/innobase/include/fil0fil.h
+++ b/storage/innobase/include/fil0fil.h
@@ -1516,7 +1516,7 @@ template<bool have_reference> inline void fil_space_t::flush()
   ut_ad(purpose == FIL_TYPE_TABLESPACE || purpose == FIL_TYPE_IMPORT);
   if (have_reference)
     flush_low();
-  else if (!(acquire_low() & STOPPING))
+  else if (!(acquire_low() & (STOPPING | CLOSING)))
   {
     flush_low();
     release();

Your test case depends on MDEV-30136, which will cause all data files to be closed and reopened on the SET GLOBAL statement. I think that the race condition is possible, albeit almost impossible to reproduce, in earlier versions as well. Files could also be closed due to innodb_open_files (see MDEV-25215 for some discussion).

Comment by Marko Mäkelä [ 2022-12-14 ]

The motivation of the fix is that any thread that has expressed its intention to close the file will be responsible for invoking os_file_flush_func() on it. Hence, if that flag is set, the current thread can rely on the other thread to take care of the flushing. In the rr replay trace that I analyzed, the other thread that had closed the file immediately before the crash occurred had also invoked fdatasync() on it.

Comment by Marko Mäkelä [ 2022-12-14 ]

Sorry, it looks like the patch would leak a reference. We need to release() if either flag was set. I guess that in the STOPPING case (deleting the tablespace), nothing was checking for the reference count to be 0 afterwards.

diff --git a/storage/innobase/include/fil0fil.h b/storage/innobase/include/fil0fil.h
index f44aae05bb3..b26b16965fc 100644
--- a/storage/innobase/include/fil0fil.h
+++ b/storage/innobase/include/fil0fil.h
@@ -1516,9 +1516,10 @@ template<bool have_reference> inline void fil_space_t::flush()
   ut_ad(purpose == FIL_TYPE_TABLESPACE || purpose == FIL_TYPE_IMPORT);
   if (have_reference)
     flush_low();
-  else if (!(acquire_low() & STOPPING))
+  else
   {
-    flush_low();
+    if (!(acquire_low() & (STOPPING | CLOSING)))
+      flush_low();
     release();
   }
 }

Comment by Matthias Leich [ 2022-12-15 ]

origin/bb-10.11-new-innodb-defaults 5505ded98c35499d3cd130e8a3373389d76069e9 2022-12-14T15:42:50+02:00
with the patch from above applied.
Result of ~1800 runs of the simplified RQG tests:
No replay of the bad effect and also no other bad effect.
Without the patch I get a replay with less than 100 test runs.

Generated at Thu Feb 08 10:14:39 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.