[MDEV-4867] Incomplete Binlog writes on disk-full error, also breaking slave replication Created: 2013-08-09 Updated: 2022-09-08 |
|
| Status: | Stalled |
| Project: | MariaDB Server |
| Component/s: | None |
| Affects Version/s: | 10.0.3, 5.5.32, 5.3.12 |
| Fix Version/s: | 5.5 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Arjen Lentz | Assignee: | Michael Widenius |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Linux, probably any |
||
| Description |
|
It appears the binlog and replication code still has write calls that aren't checked for successful completion. On disk full, the binlog gets truncated. As a direct consequence and example, a slave will report this error: Obviously it's not just a problem with replication, but also for having correct binlogs for point-in-time recovery. Suggested solution: writes should check for success, and in case of a disk-full error, report and retry correctly. The "wait until space has been freed up" approach that's used on other disk-full conditions, would be fine here also. |
| Comments |
| Comment by Elena Stepanova [ 2013-09-01 ] |
|
Hi, I'm getting "wait until" when the binary log cannot be written (on 5.5 and on 10.0, didn't check 5.3): 130902 0:25:22 [Warning] mysqld: Disk is full writing '/data/smalldisk/master-bin.000001' (Errcode: 28). Waiting for someone to free space... (Expect up to 60 secs delay for server to continue after freeing disk space) I do the following:
Is there anything special in the use case you described that I've missed? |
| Comment by Martijn Otto [ 2013-09-02 ] |
|
We recently ran into exactly this problem too. The server does indeed continue to write to the binlog after free space has been created again, however that does leave it with a broken binlog. Say you are trying to write 1024 bytes to the binlog, but only 512 bytes are available. 512 bytes will be written, after which the disk is full and the message is written to the error log. After creating free space, the remaining 512 bytes are not written to the file. Instead, the next binlog entry is written. Any slaves replicating this server will notice this and stop replicating. The only way to get them to work again is to guess at which position the binlog continues and set that as the new master_log_pos. |
| Comment by Elena Stepanova [ 2013-09-02 ] |
|
I see, thanks for the info, I will check it out. |
| Comment by Elena Stepanova [ 2013-09-16 ] |
|
In my manual tests on the current 5.5 tree the rest of the event (or the event group) was written to the binary log after the space was freed.
I tried it with RBR (atomic transactions) and with SBR (atomic and multi-statement transactions). Possibly the initially described erroneous behavior somehow depends on the event type(s) and their contents, or on server startup parameters, or on the file system, or it's caused by a specific combination of free/required disk space, or it is only possible in the concurrent scenario. I'm assigning this to Kristian, maybe he'll be able to see the reason of the problem by code inspection (when he has a chance to inspect it). |
| Comment by Michael Widenius [ 2014-03-23 ] |
|
Elena did all the rights steps to try to reproduce this, so there is not much more I can do regarding testing. I took a look at all code in log.cc and related code but could not find anything wrong. The logic for file full is quite simply and is the same for any event: I am reasonable confident in the my_write() code, so the error is very likely somewhere else. I would however need more data to be able to find out where. If you could upload the last 100K of the binary log where things went wrong to the private folder at ftp://ftp.askmonty.org, then we would have more to go on to find and fix this. |
| Comment by Michael Widenius [ 2014-03-23 ] |
|
I have changed the priority of this to "minor" until we find a way to repeat it or more information of what could be different from our test and the system where this happens. |
| Comment by Arjen Lentz (Inactive) [ 2014-03-23 ] |
|
Thanks Monty Regards -------- Original message -------- Michael Widenius commented on MDEV-4867: Elena did all the rights steps to try to reproduce this, so there is not much more I can do regarding testing. I took a look at all code in log.cc and related code but could not find anything wrong. The logic for file full is quite simply and is the same for any event: I am reasonable confident in the my_write() code, so the error is very likely somewhere else. I would however need more data to be able to find out where. If you could upload the last 100K of the binary log where things went wrong to the private folder at ftp://ftp.askmonty.org, then we would have more to go on to find and fix this. – |