[MDEV-11947] InnoDB purge workers fail to shut down Created: 2017-01-31  Updated: 2017-02-09  Resolved: 2017-02-06

Status: Closed
Project: MariaDB Server
Component/s: Server, Storage Engine - InnoDB, Tests
Affects Version/s: 10.0, 10.2
Fix Version/s: 10.0.30, 10.1.22, 10.2.4

Type: Bug Priority: Blocker
Reporter: Elena Stepanova Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: 10.2-rc

Attachments: HTML File threads1    
Issue Links:
PartOf
is part of MDEV-10604 Create a list of unstable MTR tests t... Closed
Relates
relates to MDEV-11802 innodb.innodb_bug14676111 fails in bu... Closed

 Description   

Upd:
Apparently, the problem appeared in 10.2 tree with this push:
https://internal.askmonty.org/buildbot/builders/kvm-bintar-centos5-amd64/builds/4239


The failure can appear in buildbot like this:
http://buildbot.askmonty.org/buildbot/builders/kvm-bintar-centos5-amd64/builds/4574/steps/test/logs/stdio

rpl.rpl_report_port 'mix'                w4 [ fail ]  Found warnings/errors in server log file!
        Test ended at 2017-01-30 18:47:48
line
2017-01-30 18:47:45 1075733824 [ERROR] mysqld: Table './mysql/gtid_slave_pos' is marked as crashed and should be repaired
2017-01-30 18:47:45 1075733824 [Warning] Checking table:   './mysql/gtid_slave_pos'
2017-01-30 18:47:45 1075733824 [ERROR] mysql.gtid_slave_pos: 1 client is using or hasn't closed the table properly
^ Found warnings in /usr/local/mariadb-10.2.4-linux-x86_64/mysql-test/var/4/log/mysqld.2.err
ok

The test restarts slave server as a part of the test flow.
Error log shows that the slave starts shutdown but cannot finish it. After 60 second timeout it gets killed, restarts and thus reports gtid tables as crashed.
http://buildbot.askmonty.org/buildbot/builders/kvm-bintar-centos5-amd64/builds/4574/steps/test/logs/mysqld.2.err.4

2017-01-30 18:46:44 46971208279872 [Note] /usr/local/mariadb-10.2.4-linux-x86_64/bin/mysqld: ready for connections.
Version: '10.2.4-MariaDB-log'  socket: '/usr/local/mariadb-10.2.4-linux-x86_64/mysql-test/var/tmp/4/mysqld.2.sock'  port: 16061  MariaDB Server
2017-01-30 18:46:44 1326331200 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='', master_port='3306', master_log_file='', master_log_pos='4'. New state master_host='127.0.0.1', master_port='16060', master_log_file='master-bin.000001', master_log_pos='4'.
2017-01-30 18:46:44 1327240512 [Note] Slave SQL thread initialized, starting replication in log 'master-bin.000001' at position 4, relay log './slave-relay-bin.000001' position: 4
2017-01-30 18:46:44 1326937408 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16060',replication started in log 'master-bin.000001' at position 4
2017-01-30 18:46:44 1328149824 [Note] /usr/local/mariadb-10.2.4-linux-x86_64/bin/mysqld (root[root] @ localhost [127.0.0.1]): Normal shutdown
 
2017-01-30 18:46:44 1328149824 [Note] Event Scheduler: Purging the queue. 0 events
2017-01-30 18:46:44 1327240512 [Note] Error reading relay log event: slave SQL thread was killed
2017-01-30 18:46:44 1327240512 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 329
2017-01-30 18:46:44 1326937408 [Note] Slave I/O thread exiting, read up to log 'master-bin.000001', position 329
2017-01-30 18:47:45 47863460431680 [Note] /usr/local/mariadb-10.2.4-linux-x86_64/bin/mysqld (mysqld 10.2.4-MariaDB-log) starting as process 24713 ...
2017-01-30 18:47:45 47863460431680 [Warning] Could not increase number of max_open_files to more than 1024 (request: 4162)
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'partition' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2017-01-30 18:47:45 47863460431680 [Note] InnoDB: Uses event mutexes
2017-01-30 18:47:45 47863460431680 [Note] InnoDB: Compressed tables use zlib 1.2.3
2017-01-30 18:47:45 47863460431680 [Note] InnoDB: Using Linux native AIO
2017-01-30 18:47:45 47863460431680 [Note] InnoDB: Number of pools: 1
2017-01-30 18:47:45 47863460431680 [Note] InnoDB: Using generic crc32 instructions
2017-01-30 18:47:45 47863460431680 [ERROR] InnoDB: Linux Native AIO not supported. You can either move tmpdir to a file system that supports native AIO or you can set innodb_use_native_aio to FALSE to avoid this message.
2017-01-30 18:47:45 47863460431680 [ERROR] InnoDB: Linux Native AIO check on tmpdirreturned error[22]
2017-01-30 18:47:45 47863460431680 [Warning] InnoDB: Linux Native AIO disabled.
2017-01-30 18:47:45 47863460431680 [Note] InnoDB: Initializing buffer pool, total size = 8M, instances = 1, chunk size = 8M
2017-01-30 18:47:45 47863460431680 [Note] InnoDB: Completed initialization of buffer pool
2017-01-30 18:47:45 1176955200 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2017-01-30 18:47:45 47863460431680 [Note] InnoDB: Highest supported file format is Barracuda.
2017-01-30 18:47:45 47863460431680 [Note] InnoDB: Log scan progressed past the checkpoint lsn 1238748
2017-01-30 18:47:45 47863460431680 [Note] InnoDB: Doing recovery: scanned up to log sequence number 1238757
2017-01-30 18:47:45 47863460431680 [Note] InnoDB: Doing recovery: scanned up to log sequence number 1238757
2017-01-30 18:47:45 47863460431680 [Note] InnoDB: Database was not shutdown normally!
2017-01-30 18:47:45 47863460431680 [Note] InnoDB: Starting crash recovery.
2017-01-30 18:47:45 47863460431680 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
2017-01-30 18:47:45 47863460431680 [Note] InnoDB: Creating shared tablespace for temporary tables
2017-01-30 18:47:45 47863460431680 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2017-01-30 18:47:45 47863460431680 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2017-01-30 18:47:45 47863460431680 [Note] InnoDB: 96 redo rollback segment(s) found. 96 redo rollback segment(s) are active.
2017-01-30 18:47:45 47863460431680 [Note] InnoDB: 32 non-redo rollback segment(s) are active.
2017-01-30 18:47:45 47863460431680 [Note] InnoDB: 5.7.14 started; log sequence number 1238757
2017-01-30 18:47:45 1271363904 [Note] InnoDB: Loading buffer pool(s) from /dev/shm/var_auto_nOan/4/mysqld.2/data/ib_buffer_pool
2017-01-30 18:47:45 1271363904 [Note] InnoDB: Buffer pool(s) load completed at 170130 18:47:45
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'SEQUENCE' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_SYS_DATAFILES' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_SYS_TABLESTATS' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_LOCKS' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_SYS_SEMAPHORE_WAITS' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_CMPMEM' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_CMP_PER_INDEX' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_CMP' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_FT_DELETED' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_CMP_RESET' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_LOCK_WAITS' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_TABLESPACES_ENCRYPTION' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_BUFFER_PAGE_LRU' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_SYS_FIELDS' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_CMPMEM_RESET' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'FEEDBACK' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_SYS_COLUMNS' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_FT_INDEX_TABLE' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_CMP_PER_INDEX_RESET' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_MUTEXES' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_FT_INDEX_CACHE' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_SYS_FOREIGN_COLS' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_FT_BEING_DELETED' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_BUFFER_POOL_STATS' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_TRX' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_SYS_FOREIGN' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_SYS_TABLES' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_FT_DEFAULT_STOPWORD' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_FT_CONFIG' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_BUFFER_PAGE' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_SYS_TABLESPACES' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_METRICS' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_SYS_INDEXES' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_SYS_VIRTUAL' is disabled.
2017-01-30 18:47:45 47863460431680 [Note] Plugin 'INNODB_TABLESPACES_SCRUBBING' is disabled.
2017-01-30 18:47:45 47863460431680 [Warning] /usr/local/mariadb-10.2.4-linux-x86_64/bin/mysqld: unknown variable 'loose-feedback-debug-startup-interval=20'
2017-01-30 18:47:45 47863460431680 [Warning] /usr/local/mariadb-10.2.4-linux-x86_64/bin/mysqld: unknown variable 'loose-feedback-debug-first-interval=60'
2017-01-30 18:47:45 47863460431680 [Warning] /usr/local/mariadb-10.2.4-linux-x86_64/bin/mysqld: unknown variable 'loose-feedback-debug-interval=60'
2017-01-30 18:47:45 47863460431680 [Warning] /usr/local/mariadb-10.2.4-linux-x86_64/bin/mysqld: unknown option '--loose-pam-debug'
2017-01-30 18:47:45 47863460431680 [Warning] /usr/local/mariadb-10.2.4-linux-x86_64/bin/mysqld: unknown variable 'loose-debug-sync-timeout=300'
2017-01-30 18:47:45 47863460431680 [Note] Recovering after a crash using slave-bin
2017-01-30 18:47:45 47863460431680 [Note] Starting crash recovery...
2017-01-30 18:47:45 47863460431680 [Note] Crash recovery finished.
2017-01-30 18:47:45 1075733824 [ERROR] mysqld: Table './mysql/gtid_slave_pos' is marked as crashed and should be repaired
2017-01-30 18:47:45 1075733824 [Warning] Checking table:   './mysql/gtid_slave_pos'
2017-01-30 18:47:45 1075733824 [ERROR] mysql.gtid_slave_pos: 1 client is using or hasn't closed the table properly

The problem is reproducible on corresponding VMs (e.g. vm-centos5-amd64-install), not reliably, but easily enough by executing the test with --repeat. I ran it with --repeat=20 --force-restart), hit the problem nearly every time.

Increasing the timeout does not help.

Could not reproduce it without InnoDB enabled on the slave.

Stack trace from all threads from the hanging server is attached. The stack trace does not change within reasonable time.


This is most likely the same problem in innodb_fts.innodb_fts_plugin, now without a slave:
http://buildbot.askmonty.org/buildbot/builders/kvm-bintar-centos5-x86/builds/6274/steps/test/logs/stdio
http://buildbot.askmonty.org/buildbot/builders/kvm-bintar-centos5-x86/builds/6274/steps/test/logs/mysqld.1.err.4


innodb_zip.innochecksum_3 has a different presentation, but probably the same root cause:
http://buildbot.askmonty.org/buildbot/builders/kvm-bintar-centos5-x86/builds/6274/steps/test/logs/stdio
http://buildbot.askmonty.org/buildbot/builders/kvm-bintar-centos5-x86/builds/6274/steps/test/logs/mysqld.1.err.1

innodb_zip.innochecksum_3 'innodb'       w1 [ fail ]  Found warnings/errors in server log file!
        Test ended at 2017-01-30 19:08:12
line
2017-01-30 19:08:07 3086788336 [Warning] InnoDB: innodb_checksum_algorithm is set to "strict_none" but the page [page id: space=0, page number=190] contains a valid checksum "crc32".  Accepting the page as valid. Change innodb_checksum_algorithm to "none" to silently accept such pages or rewrite all pages so that they contain "none" checksum.
2017-01-30 19:08:07 3086788336 [Warning] InnoDB: innodb_checksum_algorithm is set to "strict_none" but the page [page id: space=0, page number=191] contains a valid checksum "crc32".  Accepting the page as valid. Change innodb_checksum_algorithm to "none" to silently accept such pages or rewrite all pages so that they contain "none" checksum.
^ Found warnings in /usr/local/mariadb-10.2.4-linux-i686/mysql-test/var/1/log/mysqld.1.err
ok

It also restart the server, and one of restarts the server hangs on shutdown.
After getting killed and crash-recovered, the server produces warnings as above.



 Comments   
Comment by Marko Mäkelä [ 2017-01-31 ]

The innodb_zip.innochecksum_3 failure should not be a direct consequence of any hang. The test is shutting down the server, and after that it is invoking innochecksum --write=none to rewrite the dummy page checksums. It does not matter if the server was forcibly killed (well, except if the forced kill happened to cause a corrupted page, because innochecksum does not use any doublewrite buffer).

Curiously, the innochecksum_3.test sometimes assigns restart_options, sometimes $restart_parameters. Does restart_options have any effect?

The page numbers (space 0, page 190 and 191) are the very last 2 pages of the InnoDB doublewrite buffer in the system tablespace when using the default innodb_page_size=16k.
extra/innochecksum.cc contains some code to specifically avoid touching any pages in the doublewrite buffer:

			if (is_system_tablespace) {
				/* enable when page is double write buffer.*/
				skip_page = is_page_doublewritebuffer(buf);

So, it seems that we should either suppress the warning or simply zero out all doublewrite buffer pages and disable the doublewrite buffer while running this test.

Comment by Elena Stepanova [ 2017-01-31 ]

marko,
And yet, it seems that the warnings appear after the shutdown hang and recovery. I don't recommend suppressing them until you make sure how it's related, but it's up to you at the end.

Comment by Marko Mäkelä [ 2017-01-31 ]

The doublewrite buffer is not emptied on a shutdown. Instead, it is emptied on startup after any doublewrite recovery has been done. I think that a reasonable fix for the innochecksum tests would be to simply disable the InnoDB doublewrite buffer. In that case, those pages should be all zero, and there would be no problem (well, except if the forced kill happened to occur during a write request).

Comment by Marko Mäkelä [ 2017-02-02 ]

Let us look at the shutdown hang (forced kill+restart after 61 seconds) mentioned in the Description:

2017-01-30 18:46:44 1327240512 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 329
2017-01-30 18:46:44 1326937408 [Note] Slave I/O thread exiting, read up to log 'master-bin.000001', position 329
 
2017-01-30 18:47:45 47863460431680 [Note] /usr/local/mariadb-10.2.4-linux-x86_64/bin/mysqld (mysqld 10.2.4-MariaDB-log) starting as process 24713 ...

Normally InnoDB should display a message early in the shutdown process:

	ib::info() << "Starting shutdown...";

But there is no "InnoDB: Starting shutdown..." in the log, and also the global variables are suggesting that shutdown was never initiated:

(gdb) p srv_shutdown_state
$1 = SRV_SHUTDOWN_NONE
(gdb) p srv_fast_shutdown
$2 = 1
(gdb) p innodb_inited 
$3 = true
(gdb) p srv_was_started 
$4 = 1

Comment by Marko Mäkelä [ 2017-02-02 ]

The problem is that srv_purge_coordinator_thread() exited without ensuring that all worker threads actually exited. In the core dump that I examined, one worker thread is waiting for an event, which apparently was lost, and the protocol is failing to account for lost signals. Ever since MDEV-5800 was finished, that srv_worker_thread() would be associated with a THD object, thus preventing close_connections() from finishing. This is the reason why unireg_end() and the plugin shutdown never gets called.

We must make the purge subsystem more robust, so that signals will never be lost. This problem could explain the intermittent failures of the MySQL 5.7 test innodb.index_merge_threshold.

Comment by Marko Mäkelä [ 2017-02-02 ]

https://github.com/MariaDB/server/commit/49de5997ef6dd9f3b97d2c2ea81dc50170f929c2

Comment by Jan Lindström (Inactive) [ 2017-02-03 ]

ok to push.

Comment by Marko Mäkelä [ 2017-02-03 ]

I will try to backport this to 10.0, because I think that purge is sometimes ‘stuck’ with no apparent reason there. A natural explanation would be a lost signal.

Comment by Marko Mäkelä [ 2017-02-03 ]

The list of unstable tests in MDEV-10604 mentions a test that could be addressed by this fix, that is, by ensuring that the purge threads does not hang on shutdown. The log says:

| innodb.innodb-get-fk                              | 2016-03-01 12:27:56 |          1 | # ?-------- Apparently timeout on valgrind, one-time in March, logs not available

Comment by Marko Mäkelä [ 2017-02-03 ]

Note: innodb.innodb-get-fk in 10.0 never actually started the server as --innodb-read-only. The test could be a better telltale sign of a shutdown hang on 10.1, where the test could fail to restart the server because --innodb-read-only would refuse to restart the server in certain cases (after MDEV-11814 always when redo log apply would be needed).

Comment by Marko Mäkelä [ 2017-02-03 ]

10.0 patch: https://github.com/MariaDB/server/commit/e1ad0e5a3f1c2bdd11499166ff8f14f763878cfb

Comment by Jan Lindström (Inactive) [ 2017-02-06 ]

ok to push.

Generated at Thu Feb 08 07:53:54 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.