For the above failure on kvm-asan, data.tar.xz is a copy of the data directory. To my surprise, it recovered fine both when restored to /dev/shm (using mmap(2) based access to the log) and to some other directory (reading the log by pread(2)).
To my surprise, the recovery would start from a much later LSN:
That is, the first checkpoint is 0xed25 (60709) and the second one is 0x1bd88f (1824911). Even if I discard the second checkpoint, the recovery will succeed for me:
2022-06-29 9:13:27 0 [Note] InnoDB: Starting final batch to recover 6 pages from redo log.
2022-06-29 9:13:27 0 [Note] InnoDB: 128 rollback segments are active.
2022-06-29 9:13:27 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
2022-06-29 9:13:27 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
2022-06-29 9:13:27 0 [Note] InnoDB: log sequence number 1952036; transaction id 67
At the indicated LSN 1800619 (0x1b79ab) of the failure message, I see the end of a mini-transaction that the checkpoint pointed to (0x1b7913). In the MDEV-14425 log format, the LSN conveniently is equal to the byte offset in the ib_logfile0 that was created at database initialization and did not wrap around yet. Here is the mini-transaction in question:
od -Ax -t x1 -t a -j 0x1b7913 -N 152 data/ib_logfile0
1b7913 b0 01 05 00 2e 2f 74 65 73 74 2f 74 31 2e 69 62
0 soh enq nul . / t e s t / t 1 . i b
1b7923 64 b0 12 01 00 2e 2f 6d 79 73 71 6c 2f 69 6e 6e
1b7953 6e 6e 6f 64 62 5f 69 6e 64 65 78 5f 73 74 61 74
n n o d b _ i n d e x _ s t a t
1b7963 73 2e 69 62 64 b0 01 06 00 2e 2f 74 65 73 74 2f
s . i b d 0 soh ack nul . / t e s t /
1b7973 74 32 2e 69 62 64 b0 01 07 00 2e 2f 74 65 73 74
t 2 . i b d 0 soh bel nul . / t e s t
1b7983 2f 74 33 2e 69 62 64 b0 01 08 00 2e 2f 74 65 73
/ t 3 . i b d 0 soh bs nul . / t e s
1b7993 74 2f 74 34 2e 69 62 64 fa 00 00 00 00 00 00 00
t / t 4 . i b d z nul nul nul nul nul nul nul
1b79a3 00 ed 25 01 9e 41 cb 2d
nul m % soh rs A K -
1b79ab
This must have been written by fil_names_clear(). We have a number of FILE_MODIFY records of files that were modified since the checkpoint (0xb0, encoded length, encoded tablespace ID, 0, file name), and a FILE_CHECKPOINT record (0xfa, 0, 0, big-endian 64-bit LSN=0xed25), and the end-of-mini-transaction marker (sequence bit=1) and the CRC-32C checksum 0x9e41cb2d. Everything looks perfectly fine.
Most failures occur on FreeBSD, but the above is the first case where a copy of the data directory is available yet. Here are two failures for 10.5:
10.5 35f2cdcb99f4f8b39895070fde3f825329106361
encryption.innodb-redo-nokeys 'cbc,innodb' w3 [ fail ] Found warnings/errors in server log file!
Test ended at 2022-06-23 04:51:11
line
2022-06-23 4:51:08 0 [ERROR] InnoDB: Missing FILE_CHECKPOINT at 1381251 between the checkpoint 52621 and the end 1446400.
^ Found warnings in /tmp/var/3/log/mysqld.1.err
…
encryption.innodb-redo-nokeys 'ctr,innodb' w4 [ fail ] Found warnings/errors in server log file!
Test ended at 2022-06-23 04:53:50
line
2022-06-23 4:53:47 0 [ERROR] InnoDB: Missing FILE_CHECKPOINT at 1171764 between the checkpoint 52621 and the end 1236992.
^ Found warnings in /tmp/var/4/log/mysqld.1.err
In each case, it seems possible that the latest checkpoint was from the shutdown of the server bootstrap.
serg, to me, this looks like a race condition in mtr. I suspect that the new server process was started while the old one is still being killed by the test (while writing a checkpoint). Before MDEV-24393, such a race would have been prevented by advisory locks on the files. If it is the case, fixing MDEV-28662 should fix this bug.
I think that I will add some retry loop there, like we did when InnoDB did not obey --skip-external-locking. It is not ideal that the test harness is not reliable when it comes to killing and restarting processes, but since that is outside my control, I think that it is easiest to work around that deficiency in the InnoDB startup code.
Marko Mäkelä
added a comment - An attempted fix of MDEV-28495 confirmed my hypothesis that the old, supposedly-to-be-killed server was still running:
bb-10.3-MDEV-28495 66d7599c2792555b52c570c3047bea310e59fc3b
encryption.innodb-read-only 'cbc,innodb' w3 [ fail ]
Test ended at 2022-07-27 13:41:59
CURRENT_TEST: encryption.innodb-read-only
Server [mysqld.1 - pid: 643356, winpid: 643356, exit: 256] failed during test run
…
2022-07-27 13:41:58 0 [Note] InnoDB: Completed initialization of buffer pool
2022-07-27 13:41:58 0 [ERROR] InnoDB: Unable to lock ./ibdata1 error: 9
2022-07-27 13:41:58 0x7f409aaa5d80 InnoDB: Assertion failure in file /mariadb/10.3/storage/innobase/fil/fil0fil.cc line 1781
InnoDB: Failing assertion: !os_file_lock(node->handle, node->name)
I think that I will add some retry loop there, like we did when InnoDB did not obey --skip-external-locking . It is not ideal that the test harness is not reliable when it comes to killing and restarting processes, but since that is outside my control, I think that it is easiest to work around that deficiency in the InnoDB startup code.
Already in 10.3, several kill-and-restart tests were affected by this, also outside the encryption suite. Apparently, the probability of log checkpoints occurring near the server kill is much higher with encryption tests (which are writing relatively much data right before killing the server).
Marko Mäkelä
added a comment - Already in 10.3, several kill-and-restart tests were affected by this, also outside the encryption suite. Apparently, the probability of log checkpoints occurring near the server kill is much higher with encryption tests (which are writing relatively much data right before killing the server).
MDEV-28495 should hopefully fix this, by ensuring (via the lock on the system tablespace file) that the being-killed server process has actually been killed.
Marko Mäkelä
added a comment - MDEV-28495 should hopefully fix this, by ensuring (via the lock on the system tablespace file) that the being-killed server process has actually been killed.
An additional fix could help with those cases that are not ‘rescued’ by InnoDB advisory file locks. I suspect that some Spider tests could have failed due to a new server being started before the killed server has properly terminated.
Marko Mäkelä
added a comment - - edited An additional fix could help with those cases that are not ‘rescued’ by InnoDB advisory file locks. I suspect that some Spider tests could have failed due to a new server being started before the killed server has properly terminated.
MDEV-31568 was filed for the observation that the advisory file locking on (only) the InnoDB system tablespace does not work, starting with the MariaDB Server 10.5 release series.
Marko Mäkelä
added a comment - MDEV-31568 was filed for the observation that the advisory file locking on (only) the InnoDB system tablespace does not work, starting with the MariaDB Server 10.5 release series.
People
Marko Mäkelä
Marko Mäkelä
Votes:
0Vote for this issue
Watchers:
3Start watching this issue
Dates
Created:
Updated:
Resolved:
Git Integration
Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.
{"report":{"fcp":1790.4000000953674,"ttfb":326.7000002861023,"pageVisibility":"visible","entityId":112340,"key":"jira.project.issue.view-issue","isInitial":true,"threshold":1000,"elementTimings":{},"userDeviceMemory":8,"userDeviceProcessors":64,"apdex":0.5,"journeyId":"7c18dc47-c4f7-4321-b759-4419c2baf8f3","navigationType":0,"readyForUser":1944.2000002861023,"redirectCount":0,"resourceLoadedEnd":1488.9000000953674,"resourceLoadedStart":335.90000009536743,"resourceTiming":[{"duration":492.80000019073486,"initiatorType":"link","name":"https://jira.mariadb.org/s/2c21342762a6a02add1c328bed317ffd-CDN/lu2cib/820016/12ta74/0a8bac35585be7fc6c9cc5a0464cd4cf/_/download/contextbatch/css/_super/batch.css","startTime":335.90000009536743,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":335.90000009536743,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":828.7000002861023,"responseStart":0,"secureConnectionStart":0},{"duration":492.7999997138977,"initiatorType":"link","name":"https://jira.mariadb.org/s/7ebd35e77e471bc30ff0eba799ebc151-CDN/lu2cib/820016/12ta74/494e4c556ecbb29f90a3d3b4f09cb99c/_/download/contextbatch/css/jira.browse.project,project.issue.navigator,jira.view.issue,jira.general,jira.global,atl.general,-_super/batch.css?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&slack-enabled=true&whisper-enabled=true","startTime":336.2000002861023,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":336.2000002861023,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":829,"responseStart":0,"secureConnectionStart":0},{"duration":948.8000001907349,"initiatorType":"script","name":"https://jira.mariadb.org/s/0917945aaa57108d00c5076fea35e069-CDN/lu2cib/820016/12ta74/0a8bac35585be7fc6c9cc5a0464cd4cf/_/download/contextbatch/js/_super/batch.js?locale=en","startTime":336.5,"connectEnd":336.5,"connectStart":336.5,"domainLookupEnd":336.5,"domainLookupStart":336.5,"fetchStart":336.5,"redirectEnd":0,"redirectStart":0,"requestStart":856.7000002861023,"responseEnd":1285.3000001907349,"responseStart":894.2000002861023,"secureConnectionStart":336.5},{"duration":1151.9000000953674,"initiatorType":"script","name":"https://jira.mariadb.org/s/2d8175ec2fa4c816e8023260bd8c1786-CDN/lu2cib/820016/12ta74/494e4c556ecbb29f90a3d3b4f09cb99c/_/download/contextbatch/js/jira.browse.project,project.issue.navigator,jira.view.issue,jira.general,jira.global,atl.general,-_super/batch.js?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&locale=en&slack-enabled=true&whisper-enabled=true","startTime":337,"connectEnd":337,"connectStart":337,"domainLookupEnd":337,"domainLookupStart":337,"fetchStart":337,"redirectEnd":0,"redirectStart":0,"requestStart":857.1000003814697,"responseEnd":1488.9000000953674,"responseStart":896.1000003814697,"secureConnectionStart":337},{"duration":559.5999999046326,"initiatorType":"script","name":"https://jira.mariadb.org/s/a9324d6758d385eb45c462685ad88f1d-CDN/lu2cib/820016/12ta74/c92c0caa9a024ae85b0ebdbed7fb4bd7/_/download/contextbatch/js/atl.global,-_super/batch.js?locale=en","startTime":341.1000003814697,"connectEnd":341.1000003814697,"connectStart":341.1000003814697,"domainLookupEnd":341.1000003814697,"domainLookupStart":341.1000003814697,"fetchStart":341.1000003814697,"redirectEnd":0,"redirectStart":0,"requestStart":860.6000003814697,"responseEnd":900.7000002861023,"responseStart":899.5,"secureConnectionStart":341.1000003814697},{"duration":564.4000000953674,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:calendar-en/jira.webresources:calendar-en.js","startTime":341.40000009536743,"connectEnd":341.40000009536743,"connectStart":341.40000009536743,"domainLookupEnd":341.40000009536743,"domainLookupStart":341.40000009536743,"fetchStart":341.40000009536743,"redirectEnd":0,"redirectStart":0,"requestStart":860.7000002861023,"responseEnd":905.8000001907349,"responseStart":902,"secureConnectionStart":341.40000009536743},{"duration":562.5999999046326,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:calendar-localisation-moment/jira.webresources:calendar-localisation-moment.js","startTime":341.6000003814697,"connectEnd":341.6000003814697,"connectStart":341.6000003814697,"domainLookupEnd":341.6000003814697,"domainLookupStart":341.6000003814697,"fetchStart":341.6000003814697,"redirectEnd":0,"redirectStart":0,"requestStart":860.9000000953674,"responseEnd":904.2000002861023,"responseStart":900.9000000953674,"secureConnectionStart":341.6000003814697},{"duration":504,"initiatorType":"link","name":"https://jira.mariadb.org/s/b04b06a02d1959df322d9cded3aeecc1-CDN/lu2cib/820016/12ta74/a2ff6aa845ffc9a1d22fe23d9ee791fc/_/download/contextbatch/css/jira.global.look-and-feel,-_super/batch.css","startTime":341.80000019073486,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":341.80000019073486,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":845.8000001907349,"responseStart":0,"secureConnectionStart":0},{"duration":564.3000001907349,"initiatorType":"script","name":"https://jira.mariadb.org/rest/api/1.0/shortcuts/820016/47140b6e0a9bc2e4913da06536125810/shortcuts.js?context=issuenavigation&context=issueaction","startTime":342,"connectEnd":342,"connectStart":342,"domainLookupEnd":342,"domainLookupStart":342,"fetchStart":342,"redirectEnd":0,"redirectStart":0,"requestStart":861,"responseEnd":906.3000001907349,"responseStart":902.5,"secureConnectionStart":342},{"duration":504,"initiatorType":"link","name":"https://jira.mariadb.org/s/3ac36323ba5e4eb0af2aa7ac7211b4bb-CDN/lu2cib/820016/12ta74/d176f0986478cc64f24226b3d20c140d/_/download/contextbatch/css/com.atlassian.jira.projects.sidebar.init,-_super,-project.issue.navigator,-jira.view.issue/batch.css?jira.create.linked.issue=true","startTime":342.1000003814697,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":342.1000003814697,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":846.1000003814697,"responseStart":0,"secureConnectionStart":0},{"duration":564.3000001907349,"initiatorType":"script","name":"https://jira.mariadb.org/s/5d5e8fe91fbc506585e83ea3b62ccc4b-CDN/lu2cib/820016/12ta74/d176f0986478cc64f24226b3d20c140d/_/download/contextbatch/js/com.atlassian.jira.projects.sidebar.init,-_super,-project.issue.navigator,-jira.view.issue/batch.js?jira.create.linked.issue=true&locale=en","startTime":342.30000019073486,"connectEnd":342.30000019073486,"connectStart":342.30000019073486,"domainLookupEnd":342.30000019073486,"domainLookupStart":342.30000019073486,"fetchStart":342.30000019073486,"redirectEnd":0,"redirectStart":0,"requestStart":861.2000002861023,"responseEnd":906.6000003814697,"responseStart":903.5,"secureConnectionStart":342.30000019073486},{"duration":1126.9000000953674,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:bigpipe-js/jira.webresources:bigpipe-js.js","startTime":344.2000002861023,"connectEnd":344.2000002861023,"connectStart":344.2000002861023,"domainLookupEnd":344.2000002861023,"domainLookupStart":344.2000002861023,"fetchStart":344.2000002861023,"redirectEnd":0,"redirectStart":0,"requestStart":1116.5,"responseEnd":1471.1000003814697,"responseStart":1464.7000002861023,"secureConnectionStart":344.2000002861023},{"duration":1131.4000000953674,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:bigpipe-init/jira.webresources:bigpipe-init.js","startTime":344.30000019073486,"connectEnd":344.30000019073486,"connectStart":344.30000019073486,"domainLookupEnd":344.30000019073486,"domainLookupStart":344.30000019073486,"fetchStart":344.30000019073486,"redirectEnd":0,"redirectStart":0,"requestStart":1389.4000000953674,"responseEnd":1475.7000002861023,"responseStart":1471.9000000953674,"secureConnectionStart":344.30000019073486},{"duration":262.40000009536743,"initiatorType":"xmlhttprequest","name":"https://jira.mariadb.org/rest/webResources/1.0/resources","startTime":1443.5,"connectEnd":1443.5,"connectStart":1443.5,"domainLookupEnd":1443.5,"domainLookupStart":1443.5,"fetchStart":1443.5,"redirectEnd":0,"redirectStart":0,"requestStart":1669.9000000953674,"responseEnd":1705.9000000953674,"responseStart":1704.4000000953674,"secureConnectionStart":1443.5}],"fetchStart":0,"domainLookupStart":0,"domainLookupEnd":0,"connectStart":0,"connectEnd":0,"requestStart":132,"responseStart":326,"responseEnd":342,"domLoading":330,"domInteractive":2077,"domContentLoadedEventStart":2077,"domContentLoadedEventEnd":2140,"domComplete":2936,"loadEventStart":2936,"loadEventEnd":2937,"userAgent":"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)","marks":[{"name":"bigPipe.sidebar-id.start","time":2045.5},{"name":"bigPipe.sidebar-id.end","time":2046.3000001907349},{"name":"bigPipe.activity-panel-pipe-id.start","time":2046.4000000953674},{"name":"bigPipe.activity-panel-pipe-id.end","time":2048.800000190735},{"name":"activityTabFullyLoaded","time":2164.5}],"measures":[],"correlationId":"595b349725008","effectiveType":"4g","downlink":10,"rtt":0,"serverDuration":116,"dbReadsTimeInMs":18,"dbConnsTimeInMs":27,"applicationHash":"9d11dbea5f4be3d4cc21f03a88dd11d8c8687422","experiments":[]}}
An attempted fix of
MDEV-28495confirmed my hypothesis that the old, supposedly-to-be-killed server was still running:bb-10.3-MDEV-28495 66d7599c2792555b52c570c3047bea310e59fc3b
encryption.innodb-read-only 'cbc,innodb' w3 [ fail ]
Test ended at 2022-07-27 13:41:59
CURRENT_TEST: encryption.innodb-read-only
Server [mysqld.1 - pid: 643356, winpid: 643356, exit: 256] failed during test run
…
2022-07-27 13:41:58 0 [Note] InnoDB: Completed initialization of buffer pool
2022-07-27 13:41:58 0 [ERROR] InnoDB: Unable to lock ./ibdata1 error: 9
2022-07-27 13:41:58 0x7f409aaa5d80 InnoDB: Assertion failure in file /mariadb/10.3/storage/innobase/fil/fil0fil.cc line 1781
InnoDB: Failing assertion: !os_file_lock(node->handle, node->name)
I think that I will add some retry loop there, like we did when InnoDB did not obey --skip-external-locking. It is not ideal that the test harness is not reliable when it comes to killing and restarting processes, but since that is outside my control, I think that it is easiest to work around that deficiency in the InnoDB startup code.