[MDEV-11044] SIGSEGV on server startup Created: 2016-10-13 Updated: 2017-07-05 Resolved: 2017-07-05 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.0.22, 10.0.25, 10.1.14 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | Matthias Fehring | Assignee: | Marko Mäkelä |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
openSUSE Tumbleweed, Linux 4.7.6, glibc 2.24, Akonadi 5.3.1, CPU: AMD FX 6300, RAM 16GB |
||
| Attachments: |
|
| Description |
|
When trying to startup the MariaDB server for KDE Akonadi, the server crashes with a segmentation fault. |
| Comments |
| Comment by Elena Stepanova [ 2016-10-13 ] | ||||||||||||||||||
|
Could you please also attach the server error log? Thanks. | ||||||||||||||||||
| Comment by Elena Stepanova [ 2016-10-19 ] | ||||||||||||||||||
|
Copying the stack trace from the attached gdb.txt to make it searchable in JIRA:
| ||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2016-10-28 ] | ||||||||||||||||||
|
Is it possible that you run out-of-memory ? | ||||||||||||||||||
| Comment by Matthias Fehring [ 2016-10-29 ] | ||||||||||||||||||
|
No, i don't think so. There is enough free memory and it crashes directly.
| ||||||||||||||||||
| Comment by Marko Mäkelä [ 2016-11-23 ] | ||||||||||||||||||
|
Could the undo log page be corrupted, causing the "len" parameter to be very big?
The last statement assumes innodb_page_size=16k (asking to dump 0x4000 bytes from undo_rec-0x1008). Use the last statement only if the previous one does not work. | ||||||||||||||||||
| Comment by Matthias Fehring [ 2016-11-23 ] | ||||||||||||||||||
|
| ||||||||||||||||||
| Comment by Marko Mäkelä [ 2016-11-28 ] | ||||||||||||||||||
|
Thanks for the update. At byte 0x1008 in undo_page.bin there are 16 zero bytes before what looks like the start of an undo log record. Let me try to decode the page dump.
It could be easiest for me to get the data files (at least the InnoDB system tablespace files and the ib_logfile*) to repeat the problem myself. I understand if you do not want to disclose confidential data. | ||||||||||||||||||
| Comment by Matthias Fehring [ 2016-12-06 ] | ||||||||||||||||||
|
All that happend after a system crash caused by a game (X:Rebirth). As the storage contains information about private and business e-mails I am not willing and legally not allowed to disclose the data.
| ||||||||||||||||||
| Comment by Marko Mäkelä [ 2016-12-20 ] | ||||||||||||||||||
|
Thanks, Matthias! If there was a system crash that was resolved by forcibly switching the power off, it is possible that we are witnessing the results of another part of the hardware/firmware/software stack failing. Maybe fsync() is not working on the hardware as expected, and some writes are being reordered, causing a situation where some of the recovered InnoDB data pages correspond to a wrong logical time (too old or too new LSN). Admittedly, it is unlikely that a Linux kernel hang triggered by a GPU intensive game would cause the hard disk firmware to hang. Perhaps the system was not completely frozen, and it was just in the middle of performing a write when the power was switched off? How were the configuration parameters innodb_doublewrite, innodb_flush_method, innodb_checksum_algorithm, innodb_checksums set? Were any non-default InnoDB or XtraDB configuration parameters specified? I did some more code review. It seems to me that purge should sequentially read the undo logs of each purgeable transaction. The wrong purge_sys->offset (4104 == 0x1008) should actually have been read from the same page, at some earlier address than 0x1008. The next-record pointer is in the first two bytes of the undo log record, and it must not be past the currently marked end of the page; see trx_undo_page_get_next_rec(). It seems that trx_undo_page_get_next_rec() can start reading an undo page from an arbitrary offset (purge_sys->hdr_offset) that we could have read as corrupted from some other page. There are quite a few bytes "0x10 0x08" in the undo log page, and of course it is also possible that purge_sys->hdr_offset was directly initialized as 0x1008 from some other page. With the available data, it is not possible for me to debug this further. If you still have the data and are willing to investigate, it would be interesting to see which undo pages were accessed by trx0purge.cc before the corruption was noticed. That could then allow us to determine the cause of the corruption. Because I do not remember seeing anything like this while working on MySQL and InnoDB since 2003, I would tend to believe that the problem lies outside InnoDB. That said, I am not fully familiar of the changes made in XtraDB or MariaDB. | ||||||||||||||||||
| Comment by Matthias Fehring [ 2016-12-20 ] | ||||||||||||||||||
|
Hello Marko, I attached the complete config of the server instance as it is returned by the --verbose --help option. I still have the data available. What futher information can I provide? | ||||||||||||||||||
| Comment by Marko Mäkelä [ 2017-01-10 ] | ||||||||||||||||||
|
Hello Matthias, The page checksum of the attached file undo_page.bin (which was extracted from gdb) is valid, which suggests that there was no recent redo log activity for the page, or a redo log checkpoint was written by one of the startup attempts before the code crashed when accessing the seemingly corrupted undo log. As far as I can tell, given that you (quite understandably) do not want to share the data files for deeper analysis, the only way to move forward with this is that someone starts up the server on the data files under a debugger, setting breakpoints in InnoDB code to figure out where the corruption originally occurs. Less likely options are that someone else runs into this bug, or that you are able to create a reduced test case for repeating the problem. One more thing that you could perhaps try is to check the diagnostics of the storage medium. Does smartctl -a report any serious errors, such as sector relocation? I am thinking of a possible chain of events like this: The server crashed. On the first crash recovery attempt (applying the redo log), reading the undo log page failed, and the hard drive substituted an empty page (filled with zeros). This would pass the InnoDB page checksum. Then, the redo log records would be applied and the page would be rewritten, except for some ‘holes’ that were not covered by the redo log records. Also, I wonder if there is some tool like memtest86 which would test the reliability of file systems and the block device. Running memtest86 or similar could also be a good idea, because InnoDB page checksums are only calculated immediately before writing a block back to disk. A low-probability RAM corruption caused by unreliable hardware could go unnoticed for a long time. | ||||||||||||||||||
| Comment by Matthias Fehring [ 2017-06-28 ] | ||||||||||||||||||
|
While using Thunderbird since this issue happened to my Kontact MariaDB database, I totally forgot about this issue report. I checked the SMART values of my storage device and everything is fine with it. I also die an overnight memtest86 run to test my system memory, also without any issues. To be able to use KDE's Kontact again, I deleted the old data (including the corrupted database) an redownloaded the emails from the IMAP server. So I am sorry not being able to help to resolve this anymore. Maybe it was a weird combination of different circumstances that led tot this situation. I think this can be closed as WONTFIX or something similar. Thanks to all for your efforts. |