[MDEV-15636] mariabackup --lock-ddl-per-table hangs in FLUSH TABLES due to MDL conflict if ALTER TABLE issued Created: 2018-03-22  Updated: 2018-11-28  Resolved: 2018-04-02

Status: Closed
Project: MariaDB Server
Component/s: Backup
Affects Version/s: 10.2.13
Fix Version/s: 10.2.14, 10.3.6

Type: Bug Priority: Critical
Reporter: Rick Pizzi Assignee: Vladislav Vaintroub
Resolution: Fixed Votes: 1
Labels: None

Issue Links:
Relates
relates to MDEV-5336 Implement BACKUP STAGE for safe exter... Closed
relates to MDEV-16791 mariabackup : allow consistent backup... Closed
relates to MDEV-17772 3 way lock : ALTER, MDL, BACKUP STAG... Closed

 Description   

Trying to take a DDL aware backup with mariabackup --lock-ddl-per-table causes backup to wait indefinitely in FLUSH NO_WRITE_TO_BINLOG TABLES if an ALTER TABLE is issued during the backup.

To reproduce, run backup with:

mariabackup --lock-ddl-per-table  --backup --user user --password password --stream=xbstream --parallel=4 --tmpdir=/tmp | gzip > backup.xbs.gz

While backup is running, issue an alter table to add a column, eg:

alter table testpk add column MARIA_ID bigint unsigned not null auto_increment primary key

The backup will hang in FLUSH TABLES forever.

Show processlist shows that there is some kind of conflict between the MDL taken by the lock-ddl-per-table and the flush tables command:

MariaDB [(none)]> show processlist;
+----+-------------+-----------+------+---------+------+---------------------------------+--------------------------------------------------------------------------------------------+----------+
| Id | User        | Host      | db   | Command | Time | State                           | Info                                                                                       | Progress |
+----+-------------+-----------+------+---------+------+---------------------------------+--------------------------------------------------------------------------------------------+----------+
|  1 | system user |           | NULL | Daemon  | NULL | InnoDB purge worker             | NULL                                                                                       |    0.000 |
|  2 | system user |           | NULL | Daemon  | NULL | InnoDB purge worker             | NULL                                                                                       |    0.000 |
|  3 | system user |           | NULL | Daemon  | NULL | InnoDB purge worker             | NULL                                                                                       |    0.000 |
|  4 | system user |           | NULL | Daemon  | NULL | InnoDB purge coordinator        | NULL                                                                                       |    0.000 |
|  5 | system user |           | NULL | Daemon  | NULL | InnoDB shutdown handler         | NULL                                                                                       |    0.000 |
| 63 | root        | localhost | rick | Query   |  113 | Waiting for table metadata lock | alter table testpk add column MARIA_ID bigint unsigned not null auto_increment primary key |    0.000 |
| 80 | root        | localhost | NULL | Query   |   52 | Waiting for table flush         | FLUSH NO_WRITE_TO_BINLOG TABLES                                                            |    0.000 |
| 81 | root        | localhost | NULL | Sleep   |  117 |                                 | NULL                                                                                       |    0.000 |
| 82 | root        | localhost | NULL | Query   |    0 | init                            | show processlist                                                                           |    0.000 |
+----+-------------+-----------+------+---------+------+---------------------------------+--------------------------------------------------------------------------------------------+----------+
9 rows in set (0.00 sec)

Thread dump show what looks like a metadata deadlock, both threads waiting in MDL_wait:

     10 libaio::??(libaio.so.1),LinuxAIOHandler::collect(os0file.cc:1893),LinuxAIOHandler::poll(os0file.cc:2038),os_aio_linux_handler(os0file.cc:2092),os_aio_handler(os0file.cc:2092),fil_aio_wait(fil0fil.cc:5330),io_handler_thread(srv0start.cc:343),start_thread,clone
      3 pthread_cond_wait,wait(os0event.cc:166),wait_low(os0event.cc:166),os_event_wait_low(os0event.cc:166),srv_resume_thread(srv0srv.cc:935),srv_worker_thread(srv0srv.cc:935),start_thread,clone
      1 pthread_cond_wait,wait(os0event.cc:166),wait_low(os0event.cc:166),os_event_wait_low(os0event.cc:166),srv_resume_thread(srv0srv.cc:935),srv_purge_coordinator_suspend(srv0srv.cc:935),srv_purge_coordinator_thread(srv0srv.cc:935),start_thread,clone
      1 pthread_cond_wait,wait(os0event.cc:166),wait_low(os0event.cc:166),os_event_wait_low(os0event.cc:166),buf_resize_thread(buf0buf.cc:3083),start_thread,clone
      1 pthread_cond_wait,wait(os0event.cc:166),wait_low(os0event.cc:166),os_event_wait_low(os0event.cc:166),buf_dump_thread(buf0dump.cc:777),start_thread,clone
      1 pthread_cond_wait,inline_mysql_cond_wait(mysql_thread.h:1149),inline_mysql_cond_wait(mysql_thread.h:1149),thd_destructor_proxy(mysql_thread.h:1149),start_thread,clone
      1 pthread_cond_wait,inline_mysql_cond_wait(mysql_thread.h:1149),inline_mysql_cond_wait(mysql_thread.h:1149),handle_slave_background(mysql_thread.h:1149),start_thread,clone
      1 pthread_cond_timedwait,os_event::timed_wait(os0event.cc:286),os_event::wait_time_low(os0event.cc:413),srv_monitor_thread(srv0srv.cc:1754),start_thread,clone
      1 pthread_cond_timedwait,os_event::timed_wait(os0event.cc:286),os_event::wait_time_low(os0event.cc:413),srv_error_monitor_thread(srv0srv.cc:1923),start_thread,clone
      1 pthread_cond_timedwait,os_event::timed_wait(os0event.cc:286),os_event::wait_time_low(os0event.cc:413),pc_sleep_if_needed(buf0flu.cc:2718),buf_flush_page_cleaner_coordinator(buf0flu.cc:2718),start_thread,clone
      1 pthread_cond_timedwait,os_event::timed_wait(os0event.cc:286),os_event::wait_time_low(os0event.cc:413),lock_wait_timeout_thread(lock0wait.cc:547),start_thread,clone
      1 pthread_cond_timedwait,os_event::timed_wait(os0event.cc:286),os_event::wait_time_low(os0event.cc:413),ib_wqueue_timedwait(ut0wqueue.cc:163),fts_optimize_thread(fts0opt.cc:3032),start_thread,clone
      1 pthread_cond_timedwait,os_event::timed_wait(os0event.cc:286),os_event::wait_time_low(os0event.cc:413),dict_stats_thread(dict0stats_bg.cc:451),start_thread,clone
      1 pthread_cond_timedwait,inline_mysql_cond_timedwait(mysql_thread.h:1186),timer_handler(mysql_thread.h:1186),start_thread,clone
      1 pthread_cond_timedwait,inline_mysql_cond_timedwait(mysql_thread.h:1186),my_service_thread_sleep(mysql_thread.h:1186),ma_checkpoint_background(ma_checkpoint.c:709),start_thread,clone
      1 pthread_cond_timedwait,inline_mysql_cond_timedwait(mysql_thread.h:1186),MDL_wait::timed_wait(mysql_thread.h:1186),TABLE_SHARE::wait_for_old_version(table.cc:4410),close_cached_tables(sql_base.cc:454),reload_acl_and_cache(sql_reload.cc:334),mysql_execute_command(sql_parse.cc:5384),mysql_parse(sql_parse.cc:7902),dispatch_command(sql_parse.cc:1806),do_command(sql_parse.cc:1360),do_handle_one_connection(sql_connect.cc:1335),handle_one_connection(sql_connect.cc:1241),start_thread,clone
      1 pthread_cond_timedwait,inline_mysql_cond_timedwait(mysql_thread.h:1186),MDL_wait::timed_wait(mysql_thread.h:1186),MDL_context::acquire_lock(mdl.cc:2138),MDL_context::upgrade_shared_lock(mdl.cc:2333),mysql_inplace_alter_table(sql_table.cc:7245),mysql_alter_table(sql_table.cc:7245),Sql_cmd_alter_table::execute(sql_alter.cc:324),mysql_execute_command(sql_parse.cc:6208),mysql_parse(sql_parse.cc:7902),dispatch_command(sql_parse.cc:1806),do_command(sql_parse.cc:1360),do_handle_one_connection(sql_connect.cc:1335),handle_one_connection(sql_connect.cc:1241),start_thread,clone
      1 pthread_cond_timedwait,inline_mysql_cond_timedwait(mysql_thread.h:1186),inline_mysql_cond_timedwait(mysql_thread.h:1186),cache_thread(mysql_thread.h:1186),one_thread_per_connection_end(mysql_thread.h:1186),do_handle_one_connection(sql_connect.cc:1354),handle_one_connection(sql_connect.cc:1241),start_thread,clone
      1 __poll,vio_io_wait(viosocket.c:945),vio_socket_io_wait(viosocket.c:108),vio_read(viosocket.c:184),my_real_read(net_serv.cc:892),my_net_read_packet_reallen(net_serv.cc:1162),my_net_read_packet(net_serv.cc:1146),do_command(sql_parse.cc:1242),do_handle_one_connection(sql_connect.cc:1335),handle_one_connection(sql_connect.cc:1241),start_thread,clone
      1 __poll,handle_connections_sockets(mysqld.cc:6584),mysqld_main(mysqld.cc:6050),__libc_start_main,_start
      1 nanosleep,os_thread_sleep(os0thread.cc:230),srv_master_sleep(srv0srv.cc:2412),srv_master_thread(srv0srv.cc:2412),start_thread,clone
      1 nanosleep,os_thread_sleep(os0thread.cc:230),btr_defragment_thread(btr0defragment.cc:772),start_thread,clone
      1 do_sigwait,__sigwait,signal_hand(mysqld.cc:3505),start_thread,clone



 Comments   
Comment by Manjot Singh (Inactive) [ 2018-07-18 ]

I don't think the solution to this - killing app/user queries - is one that is viable for production deployments.

Comment by Vladislav Vaintroub [ 2018-07-18 ]

you have option no-lock. In Percona you do not have option- it is always no-lock implied
Also, it is not the first kill in a backup.
You can propose other solutions, and Improve on a patch.

So, there are the options

1. no lock-ddl-per-table
2. lock-ddl-per-table with KILL
3. lock-ddl-per-table without KILL, with --no-lock (i.e without FTWRL)

I do not see any other, that can be done without changing server. If FTWRL kept existing MDL locks, instead of releasing and doing implicit commit, not KILLing would be easy to implement, but implicit COMMIT is documented and cannot be changed.

Comment by Michael Widenius [ 2018-11-28 ]

We are working on implementing a better backup lock strategy, see MDEV-5336. With this backup strategy there is no reason to use --ddl-lock-per-table anymore. One can then also do DDL's while the backup is running (until the last 2 stages which should in most cases be very short).

Generated at Thu Feb 08 08:22:53 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.