Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-15636

mariabackup --lock-ddl-per-table hangs in FLUSH TABLES due to MDL conflict if ALTER TABLE issued

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Fixed
    • 10.2.13
    • 10.2.14, 10.3.6
    • Backup
    • None

    Description

      Trying to take a DDL aware backup with mariabackup --lock-ddl-per-table causes backup to wait indefinitely in FLUSH NO_WRITE_TO_BINLOG TABLES if an ALTER TABLE is issued during the backup.

      To reproduce, run backup with:

      mariabackup --lock-ddl-per-table  --backup --user user --password password --stream=xbstream --parallel=4 --tmpdir=/tmp | gzip > backup.xbs.gz
      

      While backup is running, issue an alter table to add a column, eg:

      alter table testpk add column MARIA_ID bigint unsigned not null auto_increment primary key
      

      The backup will hang in FLUSH TABLES forever.

      Show processlist shows that there is some kind of conflict between the MDL taken by the lock-ddl-per-table and the flush tables command:

      MariaDB [(none)]> show processlist;
      +----+-------------+-----------+------+---------+------+---------------------------------+--------------------------------------------------------------------------------------------+----------+
      | Id | User        | Host      | db   | Command | Time | State                           | Info                                                                                       | Progress |
      +----+-------------+-----------+------+---------+------+---------------------------------+--------------------------------------------------------------------------------------------+----------+
      |  1 | system user |           | NULL | Daemon  | NULL | InnoDB purge worker             | NULL                                                                                       |    0.000 |
      |  2 | system user |           | NULL | Daemon  | NULL | InnoDB purge worker             | NULL                                                                                       |    0.000 |
      |  3 | system user |           | NULL | Daemon  | NULL | InnoDB purge worker             | NULL                                                                                       |    0.000 |
      |  4 | system user |           | NULL | Daemon  | NULL | InnoDB purge coordinator        | NULL                                                                                       |    0.000 |
      |  5 | system user |           | NULL | Daemon  | NULL | InnoDB shutdown handler         | NULL                                                                                       |    0.000 |
      | 63 | root        | localhost | rick | Query   |  113 | Waiting for table metadata lock | alter table testpk add column MARIA_ID bigint unsigned not null auto_increment primary key |    0.000 |
      | 80 | root        | localhost | NULL | Query   |   52 | Waiting for table flush         | FLUSH NO_WRITE_TO_BINLOG TABLES                                                            |    0.000 |
      | 81 | root        | localhost | NULL | Sleep   |  117 |                                 | NULL                                                                                       |    0.000 |
      | 82 | root        | localhost | NULL | Query   |    0 | init                            | show processlist                                                                           |    0.000 |
      +----+-------------+-----------+------+---------+------+---------------------------------+--------------------------------------------------------------------------------------------+----------+
      9 rows in set (0.00 sec)
      
      

      Thread dump show what looks like a metadata deadlock, both threads waiting in MDL_wait:

           10 libaio::??(libaio.so.1),LinuxAIOHandler::collect(os0file.cc:1893),LinuxAIOHandler::poll(os0file.cc:2038),os_aio_linux_handler(os0file.cc:2092),os_aio_handler(os0file.cc:2092),fil_aio_wait(fil0fil.cc:5330),io_handler_thread(srv0start.cc:343),start_thread,clone
            3 pthread_cond_wait,wait(os0event.cc:166),wait_low(os0event.cc:166),os_event_wait_low(os0event.cc:166),srv_resume_thread(srv0srv.cc:935),srv_worker_thread(srv0srv.cc:935),start_thread,clone
            1 pthread_cond_wait,wait(os0event.cc:166),wait_low(os0event.cc:166),os_event_wait_low(os0event.cc:166),srv_resume_thread(srv0srv.cc:935),srv_purge_coordinator_suspend(srv0srv.cc:935),srv_purge_coordinator_thread(srv0srv.cc:935),start_thread,clone
            1 pthread_cond_wait,wait(os0event.cc:166),wait_low(os0event.cc:166),os_event_wait_low(os0event.cc:166),buf_resize_thread(buf0buf.cc:3083),start_thread,clone
            1 pthread_cond_wait,wait(os0event.cc:166),wait_low(os0event.cc:166),os_event_wait_low(os0event.cc:166),buf_dump_thread(buf0dump.cc:777),start_thread,clone
            1 pthread_cond_wait,inline_mysql_cond_wait(mysql_thread.h:1149),inline_mysql_cond_wait(mysql_thread.h:1149),thd_destructor_proxy(mysql_thread.h:1149),start_thread,clone
            1 pthread_cond_wait,inline_mysql_cond_wait(mysql_thread.h:1149),inline_mysql_cond_wait(mysql_thread.h:1149),handle_slave_background(mysql_thread.h:1149),start_thread,clone
            1 pthread_cond_timedwait,os_event::timed_wait(os0event.cc:286),os_event::wait_time_low(os0event.cc:413),srv_monitor_thread(srv0srv.cc:1754),start_thread,clone
            1 pthread_cond_timedwait,os_event::timed_wait(os0event.cc:286),os_event::wait_time_low(os0event.cc:413),srv_error_monitor_thread(srv0srv.cc:1923),start_thread,clone
            1 pthread_cond_timedwait,os_event::timed_wait(os0event.cc:286),os_event::wait_time_low(os0event.cc:413),pc_sleep_if_needed(buf0flu.cc:2718),buf_flush_page_cleaner_coordinator(buf0flu.cc:2718),start_thread,clone
            1 pthread_cond_timedwait,os_event::timed_wait(os0event.cc:286),os_event::wait_time_low(os0event.cc:413),lock_wait_timeout_thread(lock0wait.cc:547),start_thread,clone
            1 pthread_cond_timedwait,os_event::timed_wait(os0event.cc:286),os_event::wait_time_low(os0event.cc:413),ib_wqueue_timedwait(ut0wqueue.cc:163),fts_optimize_thread(fts0opt.cc:3032),start_thread,clone
            1 pthread_cond_timedwait,os_event::timed_wait(os0event.cc:286),os_event::wait_time_low(os0event.cc:413),dict_stats_thread(dict0stats_bg.cc:451),start_thread,clone
            1 pthread_cond_timedwait,inline_mysql_cond_timedwait(mysql_thread.h:1186),timer_handler(mysql_thread.h:1186),start_thread,clone
            1 pthread_cond_timedwait,inline_mysql_cond_timedwait(mysql_thread.h:1186),my_service_thread_sleep(mysql_thread.h:1186),ma_checkpoint_background(ma_checkpoint.c:709),start_thread,clone
            1 pthread_cond_timedwait,inline_mysql_cond_timedwait(mysql_thread.h:1186),MDL_wait::timed_wait(mysql_thread.h:1186),TABLE_SHARE::wait_for_old_version(table.cc:4410),close_cached_tables(sql_base.cc:454),reload_acl_and_cache(sql_reload.cc:334),mysql_execute_command(sql_parse.cc:5384),mysql_parse(sql_parse.cc:7902),dispatch_command(sql_parse.cc:1806),do_command(sql_parse.cc:1360),do_handle_one_connection(sql_connect.cc:1335),handle_one_connection(sql_connect.cc:1241),start_thread,clone
            1 pthread_cond_timedwait,inline_mysql_cond_timedwait(mysql_thread.h:1186),MDL_wait::timed_wait(mysql_thread.h:1186),MDL_context::acquire_lock(mdl.cc:2138),MDL_context::upgrade_shared_lock(mdl.cc:2333),mysql_inplace_alter_table(sql_table.cc:7245),mysql_alter_table(sql_table.cc:7245),Sql_cmd_alter_table::execute(sql_alter.cc:324),mysql_execute_command(sql_parse.cc:6208),mysql_parse(sql_parse.cc:7902),dispatch_command(sql_parse.cc:1806),do_command(sql_parse.cc:1360),do_handle_one_connection(sql_connect.cc:1335),handle_one_connection(sql_connect.cc:1241),start_thread,clone
            1 pthread_cond_timedwait,inline_mysql_cond_timedwait(mysql_thread.h:1186),inline_mysql_cond_timedwait(mysql_thread.h:1186),cache_thread(mysql_thread.h:1186),one_thread_per_connection_end(mysql_thread.h:1186),do_handle_one_connection(sql_connect.cc:1354),handle_one_connection(sql_connect.cc:1241),start_thread,clone
            1 __poll,vio_io_wait(viosocket.c:945),vio_socket_io_wait(viosocket.c:108),vio_read(viosocket.c:184),my_real_read(net_serv.cc:892),my_net_read_packet_reallen(net_serv.cc:1162),my_net_read_packet(net_serv.cc:1146),do_command(sql_parse.cc:1242),do_handle_one_connection(sql_connect.cc:1335),handle_one_connection(sql_connect.cc:1241),start_thread,clone
            1 __poll,handle_connections_sockets(mysqld.cc:6584),mysqld_main(mysqld.cc:6050),__libc_start_main,_start
            1 nanosleep,os_thread_sleep(os0thread.cc:230),srv_master_sleep(srv0srv.cc:2412),srv_master_thread(srv0srv.cc:2412),start_thread,clone
            1 nanosleep,os_thread_sleep(os0thread.cc:230),btr_defragment_thread(btr0defragment.cc:772),start_thread,clone
            1 do_sigwait,__sigwait,signal_hand(mysqld.cc:3505),start_thread,clone
      

      Attachments

        Issue Links

          Activity

            manjot Manjot Singh (Inactive) added a comment - - edited

            I don't think the solution to this - killing app/user queries - is one that is viable for production deployments.

            manjot Manjot Singh (Inactive) added a comment - - edited I don't think the solution to this - killing app/user queries - is one that is viable for production deployments.
            wlad Vladislav Vaintroub added a comment - - edited

            you have option no-lock. In Percona you do not have option- it is always no-lock implied
            Also, it is not the first kill in a backup.
            You can propose other solutions, and Improve on a patch.

            So, there are the options

            1. no lock-ddl-per-table
            2. lock-ddl-per-table with KILL
            3. lock-ddl-per-table without KILL, with --no-lock (i.e without FTWRL)

            I do not see any other, that can be done without changing server. If FTWRL kept existing MDL locks, instead of releasing and doing implicit commit, not KILLing would be easy to implement, but implicit COMMIT is documented and cannot be changed.

            wlad Vladislav Vaintroub added a comment - - edited you have option no-lock. In Percona you do not have option- it is always no-lock implied Also, it is not the first kill in a backup. You can propose other solutions, and Improve on a patch. So, there are the options 1. no lock-ddl-per-table 2. lock-ddl-per-table with KILL 3. lock-ddl-per-table without KILL, with --no-lock (i.e without FTWRL) I do not see any other, that can be done without changing server. If FTWRL kept existing MDL locks, instead of releasing and doing implicit commit, not KILLing would be easy to implement, but implicit COMMIT is documented and cannot be changed.

            We are working on implementing a better backup lock strategy, see MDEV-5336. With this backup strategy there is no reason to use --ddl-lock-per-table anymore. One can then also do DDL's while the backup is running (until the last 2 stages which should in most cases be very short).

            monty Michael Widenius added a comment - We are working on implementing a better backup lock strategy, see MDEV-5336 . With this backup strategy there is no reason to use --ddl-lock-per-table anymore. One can then also do DDL's while the backup is running (until the last 2 stages which should in most cases be very short).

            People

              wlad Vladislav Vaintroub
              rpizzi Rick Pizzi (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.