[MDEV-31311] The test innodb.log_file_size_online occasionally hangs Created: 2023-05-19 Updated: 2023-07-11 Resolved: 2023-07-03 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.9, 10.10, 10.11, 11.0, 11.1 |
| Fix Version/s: | 10.9.8, 10.10.6, 10.11.5, 11.0.3, 11.1.2, 11.2.1 |
| Type: | Bug | Priority: | Major |
| Reporter: | Marko Mäkelä | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | hang | ||
| Issue Links: |
|
||||||||
| Description |
|
The
One failure that I checked seemed to have an idle buf_flush_page_cleaner() thread and the SET GLOBAL thread looping in innodb_log_file_size_update(). Possibly a wake-up of the page cleaner is missing. We will need more data from a debugger in order to understand what is going on. The contents of the global data structure log_sys during the hang needs to be known. |
| Comments |
| Comment by Marko Mäkelä [ 2023-06-28 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
I observed a failure of this test on a IA-32 GNU/Linux builder:
Here, the failure to specify cmake -DWITH_DBUG_TRACE=OFF should have contributed to some slowness, but I do not think it should cause the hang. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-06-29 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
oleg.smirnov provided the contents of log_sys and some of buf_pool.flush_list from one hang. The latest checkpoint is slightly ahead of log_sys.resize_lsn, which might be the reason that the page cleaner is not being woken up. For some reason, the log resizing had not been completed when the previous checkpoint finished. I will have to review the logic in detail and come up with a patch, so that the page cleaner will keep running in this case. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-06-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
I tried a one-line change to wake up the page cleaner, but it did not help. Both before and after that change, we would actually have log_sys.first_lsn updated to be not earlier than the log_sys.resize_lsn, that is, the resizing appeared to be sort-of completed in log_t::write_checkpoint(). oleg.smirnov is now testing a fix to log_t::write_checkpoint() that could ensure proper completion, or provide more information in the case that it hangs again. I tried and failed to reproduce this hang on my local AMD64 GNU/Linux system, both with/without the "fake PMEM" for log stored in /dev/shm, and with/without the revised fix. On Buildbot, the hangs are by far most frequent on various Windows builders, but there have been some hangs on kvm-fulltest2 (32-bit GNU/Linux debug build), IBM AIX, and some FreeBSD. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-06-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
It could be that in log_t::resize_start() we had log_sys.write_lsn==log_sys.first_lsn (that is, no log had been written), or possibly exactly 511 bytes had been written (due to some background processing, such as purging some history of committed transactions after ./mtr --bootstrap). In both my GNU/Linux environment and oleg.smirnov’s Microsoft Windows environment, the log would be resized from 10MiB to 5MiB at server startup, and on my system that will cause some log to be written:
Between that point of time and the resizing to 4MiB there were no log writes in a local non-hanging rr replay trace that I analyzed. On my system, the page cleaner would be woken up by resizing:
I will provide one more patch to oleg.smirnov, to add some output to log_t::resize_start() so that we will know more what happens in those runs that result in a hang. If the hang is caused by there being exactly 511 bytes of log writes, the hang might go away if the test is run as follows:
It would also be useful to save a copy of var/install.db for a run that leads to a hang. It might be more easily reproducible by starting the server on a copy of such a data directory. (The bootstrap is not exactly deterministic.) | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-07-03 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
oleg.smirnov ran some more tests for me. If the test is run as
then it would not hang at all. This suggests that in order for the hang to be possible, there must be some concurrent log writes (due to the purge of history of committed transactions that were run during the bootstrap). He also provided me with a copy of the data directory from the hang, the contents of log_sys, as well as some output from additional diagnostics:
In the copy of the data directory, there is a freshly created ib_logfile0 (size: 5MiB) and a being-created ib_logfile101. Both filed contain identical log record payload starting at byte offset 0x3000. The 1094 bytes correspond to the difference of the latest LSN and the latest checkpoint LSN (which also is the creation LSN of both log files): 0xe5da-0xe194=1094. All pending log had been written to both files. Log resizing would not complete unless there is another log checkpoint. The log_sys.resize_lsn was 0xe194 (at the latest checkpoint), so no write of the buffer pool was ever initiated, and no log checkpoint occurred. Based on my reading of log_t::write_checkpoint(), I think that the last 2 hunks of the following should fix this:
The last part of the last chunk is only there to make the error handling a little more robust. That error was never triggered during any test. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-07-03 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
What worked in the end was to invoke buf_flush_ahead() on buf_pool.get_oldest_modification(0)+1, to ensure that the checkpoint will be advanced and the resizing will eventually complete. |