Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-33011

mariabackup --backup: FATAL ERROR: ... Can't open datafile cool_down/t3

Details

    Description

      The problem was found on
         origin/10.6-MDEV-32939 028a35b50640fe94e659d58fe8c7241af2696e89
      but Marko means its not caused by the modifications for 10.6-MDEV-32939.
       
      Scenario:
      1. Start the source DB server, generate some unimportant initial data
      2. One connection runs a random mix of
           - CREATE TABLE <table> IF NOT EXISTS ...
           - RENAME TABLE <table>  TO cool_down . <table>
           Please note that the schema "cool_down" was never created.
      3. During 2. is ongoing some "mariabackup --backup" connects to the source DB server
           and wants to backup data to some other location.
      # 2023-12-13T13:53:51 [1677124] | [rr 1712197 19322][00] 2023-12-13 13:53:51 Finished backing up non-InnoDB tables and files
      # 2023-12-13T13:53:51 [1677124] | [rr 1712197 19406][01] 2023-12-13 13:53:51 Copying /data/results/1702475558/8/1/data//aria_log.00000001 to /data/results/1702475558/8/1_clone/data/aria_log.00000001
      # 2023-12-13T13:53:51 [1677124] | [rr 1712197 19426][01] 2023-12-13 13:53:51         ...done
      # 2023-12-13T13:53:51 [1677124] | [rr 1712197 19508][01] 2023-12-13 13:53:51 Copying /data/results/1702475558/8/1/data//aria_log_control to /data/results/1702475558/8/1_clone/data/aria_log_control
      # 2023-12-13T13:53:51 [1677124] | [rr 1712197 19528][01] 2023-12-13 13:53:51         ...done
      # 2023-12-13T13:53:51 [1677124] | [rr 1712197 19855][00] 2023-12-13 13:53:51 Waiting for log copy thread to read lsn 80003
      # 2023-12-13T13:53:51 [1677124] | [rr 1712197 19954][00] FATAL ERROR: 2023-12-13 13:53:51 Can't open datafile cool_down/t3
       
      It wonders me that some
           RENAME <existing table> TO <not existing schema> . <tablename> which should fail during parsing/check prerequisites
           gets memorized by the system at all.
           And if at all than why as something which was executed with success/needs attention when making a backup?
       
      pluto:/data/results/1702475558/MB-1270$ _RR_TRACE_DIR=./1_clone/rr rr replay --mark-stdio # mariabackup --backup --> Can't open datafile cool_down/t3
      pluto:/data/results/1702475558/MB-1270$ _RR_TRACE_DIR=./1/rr rr replay --mark-stdio # source DB server
       
      RQG
      ------
      # git clone https://github.com/mleich1/rqg --branch <pick the right branch> RQG
      #
      # GIT_SHOW: HEAD -> master, origin/master, origin/HEAD 1ed8eb38d6f08d76cb64aebdec71e668389df20f 2023-12-04T18:55:11+01:00
      # rqg.pl  : Version 4.4.0 (2023-08)
      #
      # $RQG_HOME/rqg.pl \
      # --duration=300 \
      # --queries=10000000 \
      # --no_mask \
      # --seed=random \
      # --engine=InnoDB \
      # --gendata_sql=conf/mariadb/table_stress.sql \
      # --gendata=conf/mariadb/table_stress.zz \
      # --max_gd_duration=1200 \
      # --rpl_mode=none \
      # --mysqld=--plugin-load-add=file_key_management.so \
      # --mysqld=--innodb-buffer-pool-size=5M \
      # --mysqld=--lock-wait-timeout=86400 \
      # --mysqld=--loose-table_lock_wait_timeout=50 \
      # --mysqld=--file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \
      # --mysqld=--log-output=none \
      # --mysqld=--loose-debug_assert_on_not_freed_memory=0 \
      # --mysqld=--wait_timeout=28800 \
      # --mysqld=--connect_timeout=60 \
      # --mysqld=--net_write_timeout=60 \
      # --mysqld=--innodb-lock-wait-timeout=50 \
      # --mysqld=--innodb_page_size=4K \
      # --mysqld=--interactive_timeout=28800 \
      # --mysqld=--slave_net_timeout=60 \
      # --mysqld=--log-bin \
      # --mysqld=--sync-binlog=1 \
      # --mysqld=--loose-plugin-load-add=provider_lz4.so \
      # --mysqld=--net_read_timeout=30 \
      # --mysqld=--loose-max-statement-time=30 \
      # --mysqld=--loose-idle_readonly_transaction_timeout=0 \
      # --mysqld=--loose_innodb_lock_schedule_algorithm=fcfs \
      # --mysqld=--loose-innodb_read_only_compressed=OFF \
      # --mysqld=--loose-idle_write_transaction_timeout=0 \
      # --mysqld=--sql_mode=STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION \
      # --mysqld=--loose-innodb-log-file-size=200M \
      # --mysqld=--loose-idle_transaction_timeout=0 \
      # --mysqld=--log_bin_trust_function_creators=1 \
      # --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \
      # --reporters=Deadlock,Mariabackup_linux,None \
      # --validators=None \
      # --grammar=MB-1270.yy \
      # --threads=1 \
      # <local settings>
       
       
      
      

      Attachments

        1. 10.5.patch
          7 kB
        2. 10.6.patch
          8 kB
        3. MB-1270.cfg
          45 kB
        4. MB-1270.yy
          0.5 kB

        Issue Links

          Activity

            marko Marko Mäkelä added a comment - - edited

            As far as I can tell, this only requires the following to be executed on the server:

            CREATE DATABASE test;
            CREATE TABLE test.t3 …;
            -- fails because of non-existing schema cool_down
            RENAME TABLE test.t3 TO cool_down.t3;
            

            Exactly at LSN 80003, the table was renamed back to test.t3.

            marko Marko Mäkelä added a comment - - edited As far as I can tell, this only requires the following to be executed on the server: CREATE DATABASE test; CREATE TABLE test.t3 …; -- fails because of non-existing schema cool_down RENAME TABLE test.t3 TO cool_down.t3; Exactly at LSN 80003, the table was renamed back to test.t3 .

            There indeed was no mkdir call for the schema cool_down in the server execution trace. I only saw one for CREATE SCHEMA IF NOT EXISTS rqg. Bootstrap invoked mkdir in mysql_create_db() for the schemata test, mysql, and performance_schema, in that order. I assume that an external process had created the directory. MariaDB or InnoDB does not keep much track of schema directories. The only thing I know of is the db.opt file that will be accessed by the SQL layer.

            LSN 80003 corresponds to the following (empty after rollback) transaction commit:

            #8  0x000055ef8602e6a0 in trx_t::rollback_finish (this=0x22fb58c86940) at /data/Server/10.6-MDEV-32939_3/storage/innobase/trx/trx0roll.cc:66
            #9  trx_t::rollback_low (this=this@entry=0x22fb58c86940, savept=savept@entry=0x0) at /data/Server/10.6-MDEV-32939_3/storage/innobase/trx/trx0roll.cc:133
            #10 0x000055ef8602512f in trx_t::rollback (this=this@entry=0x22fb58c86940, savept=savept@entry=0x0) at /data/Server/10.6-MDEV-32939_3/storage/innobase/trx/trx0roll.cc:176
            #11 0x000055ef85eba90f in row_rename_table_for_mysql (old_name=old_name@entry=0x54162fbd0140 "test/t3", new_name=new_name@entry=0x54162fbcff00 "cool_down/t3", trx=trx@entry=0x22fb58c86940, 
                use_fk=use_fk@entry=true) at /data/Server/10.6-MDEV-32939_3/storage/innobase/row/row0mysql.cc:2846
            #12 0x000055ef85aa4c40 in innobase_rename_table (trx=trx@entry=0x22fb58c86940, from=from@entry=0x54162fbd0f80 "./test/t3", to=to@entry=0x54162fbd11c0 "./cool_down/t3", use_fk=use_fk@entry=true)
                at /data/Server/10.6-MDEV-32939_3/storage/innobase/handler/ha_innodb.cc:13824
            #13 0x000055ef85aee415 in ha_innobase::rename_table (this=<optimized out>, from=<optimized out>, to=<optimized out>) at /data/Server/10.6-MDEV-32939_3/storage/innobase/handler/ha_innodb.cc:14230
            #14 0x000055ef850e3fc5 in handler::ha_rename_table (this=this@entry=0x62b0001275d8, from=from@entry=0x54162fbd0f80 "./test/t3", to=to@entry=0x54162fbd11c0 "./cool_down/t3")
                at /data/Server/10.6-MDEV-32939_3/sql/handler.cc:5393
            #15 0x000055ef84abf636 in mysql_rename_table (base=base@entry=0x615000002118, old_db=old_db@entry=0x62b000126408, old_name=old_name@entry=0x54162fbd1f20, new_db=new_db@entry=0x62b000126b38, 
                new_name=new_name@entry=0x54162fbd1f30, id=id@entry=0x54162fbd1f40, flags=<optimized out>) at /data/Server/10.6-MDEV-32939_3/sql/sql_table.cc:5055
            

            So, we indeed failed to rename the file from test/t3.ibd to cool_down/t3.ibd: rename returned -1 and set errno=ENOENT (No such file or directory):

            #0  rename (old=0x602000019310 "./test/t3.ibd", new=0x6030001425d0 "./cool_down/t3.ibd") at ../sysdeps/unix/sysv/linux/rename.c:27
            #1  0x000055ef85d42081 in os_file_rename_func (oldpath=oldpath@entry=0x602000019310 "./test/t3.ibd", newpath=newpath@entry=0x6030001425d0 "./cool_down/t3.ibd")
                at /data/Server/10.6-MDEV-32939_3/storage/innobase/os/os0file.cc:1432
            #2  0x000055ef85d18459 in mtr_t::commit_file (this=this@entry=0x54162fbcd530, space=..., name=name@entry=0x6030001425d0 "./cool_down/t3.ibd")
                at /data/Server/10.6-MDEV-32939_3/storage/innobase/mtr/mtr0mtr.cc:386
            #3  0x000055ef862f32f3 in fil_space_t::rename (this=0x6120000e0a40, path=path@entry=0x6030001425d0 "./cool_down/t3.ibd", log=log@entry=true, replace=replace@entry=false)
                at /data/Server/10.6-MDEV-32939_3/storage/innobase/fil/fil0fil.cc:1894
            #4  0x000055ef8623d46b in dict_table_t::rename_tablespace (this=this@entry=0x618000070908, new_name=..., replace=<optimized out>) at /data/Server/10.6-MDEV-32939_3/storage/innobase/dict/dict0dict.cc:1458
            #5  0x000055ef8625b95b in dict_table_rename_in_cache (table=table@entry=0x618000070908, new_name=..., replace_new_file=replace_new_file@entry=false) at /data/Server/10.6-MDEV-32939_3/include/span.h:121
            #6  0x000055ef85ebac34 in row_rename_table_for_mysql (old_name=old_name@entry=0x54162fbd0140 "test/t3", new_name=new_name@entry=0x54162fbcff00 "cool_down/t3", trx=trx@entry=0x22fb58c86940, 
                use_fk=use_fk@entry=true) at /data/Server/10.6-MDEV-32939_3/include/span.h:66
            #7  0x000055ef85aa4c40 in innobase_rename_table (trx=trx@entry=0x22fb58c86940, from=from@entry=0x54162fbd0f80 "./test/t3", to=to@entry=0x54162fbd11c0 "./cool_down/t3", use_fk=use_fk@entry=true)
                at /data/Server/10.6-MDEV-32939_3/storage/innobase/handler/ha_innodb.cc:13824
            #8  0x000055ef85aee415 in ha_innobase::rename_table (this=<optimized out>, from=<optimized out>, to=<optimized out>) at /data/Server/10.6-MDEV-32939_3/storage/innobase/handler/ha_innodb.cc:14230
            #9  0x000055ef850e3fc5 in handler::ha_rename_table (this=this@entry=0x62b0001275d8, from=from@entry=0x54162fbd0f80 "./test/t3", to=to@entry=0x54162fbd11c0 "./cool_down/t3")
                at /data/Server/10.6-MDEV-32939_3/sql/handler.cc:5393
            #10 0x000055ef84abf636 in mysql_rename_table (base=base@entry=0x615000002118, old_db=old_db@entry=0x62b000126408, old_name=old_name@entry=0x54162fbd1f20, new_db=new_db@entry=0x62b000126b38, 
                new_name=new_name@entry=0x54162fbd1f30, id=id@entry=0x54162fbd1f40, flags=<optimized out>) at /data/Server/10.6-MDEV-32939_3/sql/sql_table.cc:5055
            

            The actual bug seems to be that we write a FILE_RENAME record for attempting to rename the file (which we must do according to the rules of write-ahead logging), but nothing for "renaming it back" when we roll back the operation.

            marko Marko Mäkelä added a comment - There indeed was no mkdir call for the schema cool_down in the server execution trace. I only saw one for CREATE SCHEMA IF NOT EXISTS rqg . Bootstrap invoked mkdir in mysql_create_db() for the schemata test , mysql , and performance_schema , in that order. I assume that an external process had created the directory. MariaDB or InnoDB does not keep much track of schema directories. The only thing I know of is the db.opt file that will be accessed by the SQL layer. LSN 80003 corresponds to the following (empty after rollback) transaction commit: #8 0x000055ef8602e6a0 in trx_t::rollback_finish (this=0x22fb58c86940) at /data/Server/10.6-MDEV-32939_3/storage/innobase/trx/trx0roll.cc:66 #9 trx_t::rollback_low (this=this@entry=0x22fb58c86940, savept=savept@entry=0x0) at /data/Server/10.6-MDEV-32939_3/storage/innobase/trx/trx0roll.cc:133 #10 0x000055ef8602512f in trx_t::rollback (this=this@entry=0x22fb58c86940, savept=savept@entry=0x0) at /data/Server/10.6-MDEV-32939_3/storage/innobase/trx/trx0roll.cc:176 #11 0x000055ef85eba90f in row_rename_table_for_mysql (old_name=old_name@entry=0x54162fbd0140 "test/t3", new_name=new_name@entry=0x54162fbcff00 "cool_down/t3", trx=trx@entry=0x22fb58c86940, use_fk=use_fk@entry=true) at /data/Server/10.6-MDEV-32939_3/storage/innobase/row/row0mysql.cc:2846 #12 0x000055ef85aa4c40 in innobase_rename_table (trx=trx@entry=0x22fb58c86940, from=from@entry=0x54162fbd0f80 "./test/t3", to=to@entry=0x54162fbd11c0 "./cool_down/t3", use_fk=use_fk@entry=true) at /data/Server/10.6-MDEV-32939_3/storage/innobase/handler/ha_innodb.cc:13824 #13 0x000055ef85aee415 in ha_innobase::rename_table (this=<optimized out>, from=<optimized out>, to=<optimized out>) at /data/Server/10.6-MDEV-32939_3/storage/innobase/handler/ha_innodb.cc:14230 #14 0x000055ef850e3fc5 in handler::ha_rename_table (this=this@entry=0x62b0001275d8, from=from@entry=0x54162fbd0f80 "./test/t3", to=to@entry=0x54162fbd11c0 "./cool_down/t3") at /data/Server/10.6-MDEV-32939_3/sql/handler.cc:5393 #15 0x000055ef84abf636 in mysql_rename_table (base=base@entry=0x615000002118, old_db=old_db@entry=0x62b000126408, old_name=old_name@entry=0x54162fbd1f20, new_db=new_db@entry=0x62b000126b38, new_name=new_name@entry=0x54162fbd1f30, id=id@entry=0x54162fbd1f40, flags=<optimized out>) at /data/Server/10.6-MDEV-32939_3/sql/sql_table.cc:5055 So, we indeed failed to rename the file from test/t3.ibd to cool_down/t3.ibd : rename returned -1 and set errno=ENOENT (No such file or directory): #0 rename (old=0x602000019310 "./test/t3.ibd", new=0x6030001425d0 "./cool_down/t3.ibd") at ../sysdeps/unix/sysv/linux/rename.c:27 #1 0x000055ef85d42081 in os_file_rename_func (oldpath=oldpath@entry=0x602000019310 "./test/t3.ibd", newpath=newpath@entry=0x6030001425d0 "./cool_down/t3.ibd") at /data/Server/10.6-MDEV-32939_3/storage/innobase/os/os0file.cc:1432 #2 0x000055ef85d18459 in mtr_t::commit_file (this=this@entry=0x54162fbcd530, space=..., name=name@entry=0x6030001425d0 "./cool_down/t3.ibd") at /data/Server/10.6-MDEV-32939_3/storage/innobase/mtr/mtr0mtr.cc:386 #3 0x000055ef862f32f3 in fil_space_t::rename (this=0x6120000e0a40, path=path@entry=0x6030001425d0 "./cool_down/t3.ibd", log=log@entry=true, replace=replace@entry=false) at /data/Server/10.6-MDEV-32939_3/storage/innobase/fil/fil0fil.cc:1894 #4 0x000055ef8623d46b in dict_table_t::rename_tablespace (this=this@entry=0x618000070908, new_name=..., replace=<optimized out>) at /data/Server/10.6-MDEV-32939_3/storage/innobase/dict/dict0dict.cc:1458 #5 0x000055ef8625b95b in dict_table_rename_in_cache (table=table@entry=0x618000070908, new_name=..., replace_new_file=replace_new_file@entry=false) at /data/Server/10.6-MDEV-32939_3/include/span.h:121 #6 0x000055ef85ebac34 in row_rename_table_for_mysql (old_name=old_name@entry=0x54162fbd0140 "test/t3", new_name=new_name@entry=0x54162fbcff00 "cool_down/t3", trx=trx@entry=0x22fb58c86940, use_fk=use_fk@entry=true) at /data/Server/10.6-MDEV-32939_3/include/span.h:66 #7 0x000055ef85aa4c40 in innobase_rename_table (trx=trx@entry=0x22fb58c86940, from=from@entry=0x54162fbd0f80 "./test/t3", to=to@entry=0x54162fbd11c0 "./cool_down/t3", use_fk=use_fk@entry=true) at /data/Server/10.6-MDEV-32939_3/storage/innobase/handler/ha_innodb.cc:13824 #8 0x000055ef85aee415 in ha_innobase::rename_table (this=<optimized out>, from=<optimized out>, to=<optimized out>) at /data/Server/10.6-MDEV-32939_3/storage/innobase/handler/ha_innodb.cc:14230 #9 0x000055ef850e3fc5 in handler::ha_rename_table (this=this@entry=0x62b0001275d8, from=from@entry=0x54162fbd0f80 "./test/t3", to=to@entry=0x54162fbd11c0 "./cool_down/t3") at /data/Server/10.6-MDEV-32939_3/sql/handler.cc:5393 #10 0x000055ef84abf636 in mysql_rename_table (base=base@entry=0x615000002118, old_db=old_db@entry=0x62b000126408, old_name=old_name@entry=0x54162fbd1f20, new_db=new_db@entry=0x62b000126b38, new_name=new_name@entry=0x54162fbd1f30, id=id@entry=0x54162fbd1f40, flags=<optimized out>) at /data/Server/10.6-MDEV-32939_3/sql/sql_table.cc:5055 The actual bug seems to be that we write a FILE_RENAME record for attempting to rename the file (which we must do according to the rules of write-ahead logging), but nothing for "renaming it back" when we roll back the operation.

            The root cause is the WAL logging of file operation when the actual operation fails afterwards. It creates a situation with a log entry for a operation that would always fail. I could simulate both the backup scenario error and Innodb recovery failure exploiting the weakness. The error during backup repeats in 10.6 but the base issue is also present is 10.5 which eventually fails at backup --prepare stage. Also, the recovery issue happens with 10.5, 10.6 10.11 all the same.

            1. backup test

            SET DEBUG_DBUG="+d,checkpoint_after_file_create";
            CREATE TABLE t1(f1 INT NOT NULL)ENGINE=InnoDB;
            INSERT INTO t1 VALUES(1);
             
            --error ER_ERROR_ON_RENAME
            RENAME TABLE t1 TO non_existent_schema.t1;
             
            exec $XTRABACKUP --defaults-file=$MYSQLTEST_VARDIR/my.cnf --backup --target-dir=$targetdir;
            exec $XTRABACKUP --prepare --target-dir=$targetdir;
            

            2. Recovery test

            CREATE TABLE t1(col1 int PRIMARY KEY) ENGINE=InnoDB;
            INSERT INTO t1 VALUES(100);
             
            --error ER_ERROR_ON_RENAME
            RENAME TABLE t1 TO non_existent_schema.t1;
             
            SET GLOBAL innodb_fast_shutdown=2;
            --source include/restart_mysqld.inc
             
            SELECT * FROM t1;
            DROP TABLE t1;
            
            

            The WAL guarantees Atomicity of mini-transactions and Serializability in Innodb. One interesting point to note here is for the file operations we don't need WAL for ACID as operations like create, drop, rename can be considered atomic and the operations done on the dependent tablespaces are serialized. The main reason we log these operations are for tools like backup and PITR and other operations which would use the redo logs as way to roll forward from a previous snapshot of the database. Now, this requirement doesn't necessitates WAL i.e. that is the log has to be written before the operation. It should actually be fine to log the operations after it completed successfully as long as it is done before the base transaction commits. In fact a lot of WAL complexities would not be required. However, this would deviate from the current design we follow in MariaDB for FILE create/delete/rename and I don't intend to induce such changes as part of bug-fix. I mention it over here to keep a note of the same and mention as a viable design.

            Since we are following WAL for file operations, I checked how it is working today in MariaDB.

            1. We need to write and sync the log before the actual operation is persisted by Resource Manager (RM). This looks fine in code.

            2. Checkpoint must consult RM that the operation is complete before moving checkpoint LSN (TAIL) ahead. It could be tricky and right now I see the operation blocks checkpoint as
            an work around which could affect the system performance but is functionally correct.

            3. Once logged the operation must eventually complete successfully, or it is a major catastrophe. Right now, we fail for rename and handle it as normal error and it is the problem.

            4. The WAL replay must be idempotent or able to ignore if operation is already done by RM. This is usually done by comparing the Page LSN but somewhat tricky for file operation. Currently the operations are accumulated till end and looks to be a workable solution.

            I created a patch to address (3) for RENAME operation for the specific case where the destination schema name was missing. The patch checks for the missing schema before logging in an attempt to avoid the failure after WAL log is written/flushed. I also checked that the schema cannot be dropped or there cannot be any race with other rename to the same file. This is protected by the MDL lock in SQL today.

            The patch should this be a good improvement over the current situation and solves the issue at hand. I am attaching here the patch for 10.6 and 10.5 which are slightly different due to function and code organization but the solution is same.

            • 10.5.patch
            • 10.6.patch
            debarun Debarun Banerjee added a comment - The root cause is the WAL logging of file operation when the actual operation fails afterwards. It creates a situation with a log entry for a operation that would always fail. I could simulate both the backup scenario error and Innodb recovery failure exploiting the weakness. The error during backup repeats in 10.6 but the base issue is also present is 10.5 which eventually fails at backup --prepare stage. Also, the recovery issue happens with 10.5, 10.6 10.11 all the same. 1. backup test SET DEBUG_DBUG= "+d,checkpoint_after_file_create" ; CREATE TABLE t1(f1 INT NOT NULL)ENGINE=InnoDB; INSERT INTO t1 VALUES(1);   --error ER_ERROR_ON_RENAME RENAME TABLE t1 TO non_existent_schema.t1;   exec $XTRABACKUP --defaults-file=$MYSQLTEST_VARDIR/my.cnf --backup --target-dir=$targetdir; exec $XTRABACKUP --prepare --target-dir=$targetdir; 2. Recovery test CREATE TABLE t1(col1 int PRIMARY KEY) ENGINE=InnoDB; INSERT INTO t1 VALUES(100);   --error ER_ERROR_ON_RENAME RENAME TABLE t1 TO non_existent_schema.t1;   SET GLOBAL innodb_fast_shutdown=2; --source include/restart_mysqld.inc   SELECT * FROM t1; DROP TABLE t1; The WAL guarantees Atomicity of mini-transactions and Serializability in Innodb. One interesting point to note here is for the file operations we don't need WAL for ACID as operations like create, drop, rename can be considered atomic and the operations done on the dependent tablespaces are serialized. The main reason we log these operations are for tools like backup and PITR and other operations which would use the redo logs as way to roll forward from a previous snapshot of the database. Now, this requirement doesn't necessitates WAL i.e. that is the log has to be written before the operation. It should actually be fine to log the operations after it completed successfully as long as it is done before the base transaction commits. In fact a lot of WAL complexities would not be required. However, this would deviate from the current design we follow in MariaDB for FILE create/delete/rename and I don't intend to induce such changes as part of bug-fix. I mention it over here to keep a note of the same and mention as a viable design. Since we are following WAL for file operations, I checked how it is working today in MariaDB. 1. We need to write and sync the log before the actual operation is persisted by Resource Manager (RM). This looks fine in code. 2. Checkpoint must consult RM that the operation is complete before moving checkpoint LSN (TAIL) ahead. It could be tricky and right now I see the operation blocks checkpoint as an work around which could affect the system performance but is functionally correct. 3. Once logged the operation must eventually complete successfully, or it is a major catastrophe. Right now, we fail for rename and handle it as normal error and it is the problem. 4. The WAL replay must be idempotent or able to ignore if operation is already done by RM. This is usually done by comparing the Page LSN but somewhat tricky for file operation. Currently the operations are accumulated till end and looks to be a workable solution. I created a patch to address (3) for RENAME operation for the specific case where the destination schema name was missing. The patch checks for the missing schema before logging in an attempt to avoid the failure after WAL log is written/flushed. I also checked that the schema cannot be dropped or there cannot be any race with other rename to the same file. This is protected by the MDL lock in SQL today. The patch should this be a good improvement over the current situation and solves the issue at hand. I am attaching here the patch for 10.6 and 10.5 which are slightly different due to function and code organization but the solution is same. 10.5.patch 10.6.patch

            Hi Marko,

            Please see if the patch makes sense.
            https://github.com/MariaDB/server/pull/3081

            debarun Debarun Banerjee added a comment - Hi Marko, Please see if the patch makes sense. https://github.com/MariaDB/server/pull/3081

            Thank you for the analysis. Right, any WAL for file operations is mutually exclusive with log checkpoint. This might be worth improving if we at some point can move to a design where a single WAL is ensuring crash-safe DDL operations and this would remain the last bottleneck. Currenltly, at least two files (ib_logfile0 and ddl_recovery.log, optionally a binlog file) will have to be durably written during a DDL operation. The overhead around some main-memory data structures in InnoDB should be negligible compared to the overhead of the I/O related system calls.

            I am not sure if we could do away with the file operation log records in the WAL. Possibly it could work if we durably wrote the undo log of the DDL transaction so that crash recovery would be aware of the changes, and during the rollback of an incomplete DDL transaction we would be able to roll back the changes. Actually, that is what we are currently doing, always: first the roll-forward recovery will get us to the state of the crash, and then the transaction rollback will rename the files back.

            If we eliminated the roll-forward part and the FILE_ records, I think that the main challenge would be the discovery of the mapping between numeric tablespace ID and file name during crash recovery. Currently, that low-level crash recovery is dealing with files and data pages; it does not know how to interpret the page contents (such as undo log pages).

            A possible alternative solution to this problem could be the following:

            1. log_buffer_flush_to_disk(true) or similar, to ensure that the TRX_UNDO_RENAME_TABLE record will be recovered in case the server is killed.
            2. Attempt to create an empty file at the destination path. (This could fail for a number of other reasons, such as missing write permissions on the directory.)
            3. If it succeeds and if fstat(2) tells that the stat::st_dev is the same as for the source path (same file system), then durably write the FILE_RENAME record.
            4. Rename the file, replacing the empty file.

            In case the server was killed before the FILE_RENAME record was durably written, the rollback of the TRX_UNDO_RENAME_TABLE record would remove the destination file if it is empty.

            marko Marko Mäkelä added a comment - Thank you for the analysis. Right, any WAL for file operations is mutually exclusive with log checkpoint. This might be worth improving if we at some point can move to a design where a single WAL is ensuring crash-safe DDL operations and this would remain the last bottleneck. Currenltly, at least two files ( ib_logfile0 and ddl_recovery.log , optionally a binlog file) will have to be durably written during a DDL operation. The overhead around some main-memory data structures in InnoDB should be negligible compared to the overhead of the I/O related system calls. I am not sure if we could do away with the file operation log records in the WAL. Possibly it could work if we durably wrote the undo log of the DDL transaction so that crash recovery would be aware of the changes, and during the rollback of an incomplete DDL transaction we would be able to roll back the changes. Actually, that is what we are currently doing, always: first the roll-forward recovery will get us to the state of the crash, and then the transaction rollback will rename the files back. If we eliminated the roll-forward part and the FILE_ records, I think that the main challenge would be the discovery of the mapping between numeric tablespace ID and file name during crash recovery. Currently, that low-level crash recovery is dealing with files and data pages; it does not know how to interpret the page contents (such as undo log pages). A possible alternative solution to this problem could be the following: log_buffer_flush_to_disk(true) or similar, to ensure that the TRX_UNDO_RENAME_TABLE record will be recovered in case the server is killed. Attempt to create an empty file at the destination path. (This could fail for a number of other reasons, such as missing write permissions on the directory.) If it succeeds and if fstat(2) tells that the stat::st_dev is the same as for the source path (same file system), then durably write the FILE_RENAME record. Rename the file, replacing the empty file. In case the server was killed before the FILE_RENAME record was durably written, the rollback of the TRX_UNDO_RENAME_TABLE record would remove the destination file if it is empty.

            marko Interesting comment. Although I think creating 2 files at the same time is kind of creating anomaly and doesn't look very good to me. Changing the things on disk could have multiple ways of impacting the system and need to be more carefully decided. e.g. While rolling back it might face both the old and new file and in current case if it replaces the old with new file (which is empty) it could go wrong. I think changes f such magnitude should be more carefully considered with other viable / better alternatives.

            One future way could be separate out the meta information that redo log contains. Such logs are never needed for roll forward and backup and any other application should ignore them. This is a case where we actually should log the meta information before changing the name and doesn't have a lot to do with WAL consistency or roll forward. It would however need bigger consideration such as adding new redo logs possibly not allowing degrade with a crashed DB etc.

            I think it should be fine to skip right now, as the scenario is also very unlikely. The current bug scenario is much more likely and can happen when user mistyped the schema name or something.

            debarun Debarun Banerjee added a comment - marko Interesting comment. Although I think creating 2 files at the same time is kind of creating anomaly and doesn't look very good to me. Changing the things on disk could have multiple ways of impacting the system and need to be more carefully decided. e.g. While rolling back it might face both the old and new file and in current case if it replaces the old with new file (which is empty) it could go wrong. I think changes f such magnitude should be more carefully considered with other viable / better alternatives. One future way could be separate out the meta information that redo log contains. Such logs are never needed for roll forward and backup and any other application should ignore them. This is a case where we actually should log the meta information before changing the name and doesn't have a lot to do with WAL consistency or roll forward. It would however need bigger consideration such as adding new redo logs possibly not allowing degrade with a crashed DB etc. I think it should be fine to skip right now, as the scenario is also very unlikely. The current bug scenario is much more likely and can happen when user mistyped the schema name or something.

            For now, we can handle the missing directory as a special case and be done with this.

            One more possible fix that would not require any changes to file formats would be to refine recovery so that a failure to apply a FILE_RENAME operation would be tolerated for a while, in an anticipation that there will be an opposite FILE_RENAME that was written as part of a rollback, or that there will be an incomplete DDL transaction for the rename. Log checkpoints would have to be blocked in the time while an unaccounted-for FILE_RENAME operation exists.

            marko Marko Mäkelä added a comment - For now, we can handle the missing directory as a special case and be done with this. One more possible fix that would not require any changes to file formats would be to refine recovery so that a failure to apply a FILE_RENAME operation would be tolerated for a while, in an anticipation that there will be an opposite FILE_RENAME that was written as part of a rollback, or that there will be an incomplete DDL transaction for the rename. Log checkpoints would have to be blocked in the time while an unaccounted-for FILE_RENAME operation exists.

            People

              debarun Debarun Banerjee
              mleich Matthias Leich
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.