[MDEV-18009] Missing redo log flush in innodb.instant_alter_crash Created: 2018-12-14 Updated: 2019-05-09 Resolved: 2018-12-20 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB, Tests |
| Affects Version/s: | 10.4 |
| Fix Version/s: | 10.4.2, 10.3.15 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Marko Mäkelä | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | recovery | ||
| Issue Links: |
|
||||||||||||
| Description |
|
The test innodb.instant_alter_crash is supposed to flush the redo log before each time it kills the server. For some reason, it does not seem to be doing that. I was only able to repeat
This bug could be affecting earlier versions as well. |
| Comments |
| Comment by Eugene Kosov (Inactive) [ 2018-12-18 ] | ||||||||||||||||||||||||||||||||||
|
At last I have a simplified test case. I reduced one from innodb.instant_alter_crash and use a revision before fix for
| ||||||||||||||||||||||||||||||||||
| Comment by Eugene Kosov (Inactive) [ 2018-12-18 ] | ||||||||||||||||||||||||||||||||||
|
I run test with --gdb='b fsync;commands;bt;c;end;r' to see flushes to redo log and see this. The last flush before sleep 2:
LSN is 157508 here. Then while sleep is in progress next flush happens:
And LSN here is 158121 which is bigger than the previous 157508. | ||||||||||||||||||||||||||||||||||
| Comment by Eugene Kosov (Inactive) [ 2018-12-18 ] | ||||||||||||||||||||||||||||||||||
|
Async ALTER sleeps just before commit and thus before fsync() so there is nothing wrong that no flush happens before server kill without sleep for 2 seconds. I see nothing wrong in that test case. | ||||||||||||||||||||||||||||||||||
| Comment by Eugene Kosov (Inactive) [ 2018-12-18 ] | ||||||||||||||||||||||||||||||||||
|
After discussion with marko I've changed my test case to have a DML right before kill. It's purpose is to flush redo log, specifically, entries to not committed ALTER TABLE. It must call fsync() and it calls. All works as expected. I don't have a small test case now But in a instant_alter_crash I can see the real problem: fsync() is not called for DML. Here is how it should look like:
But in a instant_alter_crash() trx_flush_log_if_needed() is never called because trx->must_flush_log_later is somehow false here:
At first glance condition looks suspicious: do not flush if do not must flush later which is flush now? | ||||||||||||||||||||||||||||||||||
| Comment by Eugene Kosov (Inactive) [ 2018-12-18 ] | ||||||||||||||||||||||||||||||||||
|
DELETE * FROM t1 has nothing to do because t1 is empty. That's why it doesn't write to redo log. | ||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-05-09 ] | ||||||||||||||||||||||||||||||||||
|
I backported the adjustment of the test to 10.3.15 and applied a minor follow-up fix. |