[MDEV-27803] MariaDB binlog corruption when "No space left on device" and stuck session killed by client - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.2(EOL), 10.6
Fix Version/s: 10.6
Component/s: Replication
Labels:
None

Description

MariaDB server should be recoverable during storage-full condition, but in following condition the binlog corrupted during storage-full (0 space on disk). Could cause binlog replay failure, and replication failure.

The issue is reproducible in 10.6.5 and was also seen in 10.2.40.

Issue description:

When MariaDB server runs out of storage, it fails to write binlog file because of "No space left on device". At this time, the server is still running.

2022-02-10 23:51:28 6 [Warning] mysqld: Disk is full writing '/data/log/binlog/mysql-bin-changelog.000001' (Errcode: 28 "No space left on device"). Waiting for someone to free space... (Expect up to 60 secs delay for server to continue after freeing disk space)

2022-02-10 23:51:28 6 [Warning] mysqld: Retry in 60 secs. Message reprinted in 600 secs

As it will keep retrying the binlog writing, it suppose to recover after releasing some storage or adding more storage.

However in this condition, if the stuck session is killed by the client, the binlog writing will break and couldn't recover.

After the binlog corrupted:

new insert query will fail with errors of errno: 11 "Resource temporarily unavailable")

MariaDB [(none)]> INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa');

2022-02-10 23:55:46 11 [ERROR] mysqld: Error writing file '/data/log/binlog/mysql-bin-changelog' (errno: 11 "Resource temporarily unavailable")

Using mysqlbinlog to parse the problematic binlog file will fail with following error:

# /usr/local/mysql/bin/mysqlbinlog /data/log/binlog/mysql-bin-changelog.000001 > /tmp/0001

ERROR: Error in Log_event::read_log_event(): 'Event truncated', data_len: 673207109, event_type: 32

The only way to recovery in this scenario is to restart MariaDB server. However it left the problematic binlog and it can't be replayed. If there're replicas, replication will also fail because of it:

Last_IO_Error: Relay log write failure: could not queue event from master

How to reproduce:

The issue can be reproduced with following steps, using source code https://github.com/MariaDB/server/tree/mariadb-10.6.5/, and building/installing on AWS EC2 instance.

Create an EC2 instance of Amazon Linux 2, add a EBS volume storage of 1 GB. Mount the volume to `/data` on instance:

sudo su -

fdisk -l

mkdir /data

mkfs.ext4 /dev/nvme1n1

mount /dev/nvme1n1 /data

Checkout MariaDB 10.6.5, build and install. Here's the parameters and commands I used:

yum-builddep -y mariadb-server

yum install -y git gcc gcc-c++ bison libxml2-devel libevent-devel rpm-build

git clone https://github.com/MariaDB/server.git --branch mariadb-10.6.5 --depth 1

cd server && cmake . && make -j `nproc` &&make install

Prepare

pkill mysqld && sleep 1

sudo rm -rf /data/*

sudo mkdir -p /data/log/binlog /data/log/error /data/log/innodb/ /data/db/innodb/ /data/tmp/

sudo chown `whoami`:`whoami` /data -R

Init the DB and start with `--log-bin` in background

sudo /usr/local/mysql/scripts/mysql_install_db \

 --no-defaults --user=`whoami` --datadir=/data/db --basedir=/usr/local/mysql/ --innodb-log-group-home-dir /data/log/innodb --innodb-log-file-size 134217728 \

 --auth-root-authentication-method=normal --force --skip-name-resolve --skip-test-db --cross-bootstrap --innodb-data-home-dir /data/db/innodb

/usr/local/mysql/bin/mysqld --no-defaults --user=`whoami` --datadir=/data/db --basedir=/usr/local/mysql/ --innodb-log-group-home-dir /data/log/innodb --innodb-log-file-size 134217728 --innodb-data-home-dir /data/db/innodb  \

--binlog_cache_size 32768  --binlog_format MIXED   --max_binlog_size 134217728   --sync-binlog 1 --log-bin='/data/log/binlog/mysql-bin-changelog'  --skip-grant-tables --server_id=2 &

Connect to the database and create test db/table:

/usr/local/mysql/bin/mysql -e "\

create database t; \

CREATE TABLE t.t1 (a INT, b MEDIUMTEXT) ENGINE=Innodb; \

INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa');"

Fill up the storage

let size1=`df /dev/nvme1n1 | sed -n '2p' | awk '{print $4}'`*1024

fallocate -l $size1 /data/1

cat /data/1 > /data/2

Keep inserting a few times until query stuck

/usr/local/mysql/bin/mysql -e " \

INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \

INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \

INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \

INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \

INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \

INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \

INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \

INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \

INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \

INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \

INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \

INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \

INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \

INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \

INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \

INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \

INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \

INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa');"

At this time, check the mysql-error log should contain following lines

2022-02-09 21:22:30 31 [Warning] mysqld: Disk is full writing '/data/log/binlog/mysql-bin-changelog.000011' (Errcode: 28 "No space left on device"). Waiting for someone to free space... (Expect up to 60 secs delay for server to continue after freeing disk space)

2022-02-09 21:22:30 31 [Warning] mysqld: Retry in 60 secs. Message reprinted in 600 secs

Show processlist shows query state is "Commit":

MariaDB [(none)]> show full processlist;

+----+------+-----------+------+---------+------+----------+----------------------------------------------------------------------------------------------------------------------------+----------+

| Id | User | Host      | db   | Command | Time | State    | Info                                                                                                                       | Progress |

+----+------+-----------+------+---------+------+----------+----------------------------------------------------------------------------------------------------------------------------+----------+

| 11 | root | localhost | NULL | Query   |   48 | Commit   | INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa') |    0.000 |

| 12 | root | localhost | NULL | Query   |    0 | starting | show full processlist                                                                                                      |    0.000 |

+----+------+-----------+------+---------+------+----------+----------------------------------------------------------------------------------------------------------------------------+----------+

2 rows in set (0.000 sec)

Kill the stuck client using ctrl+c

MariaDB [(none)]> INSERT INTO t.t1 VALUES (1, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa");

^CCtrl-C -- query killed. Continuing normally.

^CCtrl-C -- query killed. Continuing normally.

ERROR 2013 (HY000): Lost connection to MySQL server during query

MariaDB [(none)]> Ctrl-C -- exit!

Aborted

At this point `show processlist`shows the query "Command=Killed" and "State=Commit"

#/usr/local/mysql/bin/mysql -e "show processlist"

+----+------+-----------+------+---------+------+----------+------------------------------------------------------------------------------------------------------+----------+

| Id | User | Host      | db   | Command | Time | State    | Info                                                                                                 | Progress |

+----+------+-----------+------+---------+------+----------+------------------------------------------------------------------------------------------------------+----------+

|  5 | root | localhost | NULL | Killed  |   22 | Commit   | INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa |    0.000 |

|  8 | root | localhost | NULL | Query   |    0 | starting | show processlist                                                                                     |    0.000 |

+----+------+-----------+------+---------+------+----------+------------------------------------------------------------------------------------------------------+----------+

Wait for 1 min (Time in processlist becomes 59)
Then should see this error:

[ERROR] mysqld: Error writing file '/data/log/binlog/mysql-bin-changelog' (errno: 28 "No space left on device")

release storage by rm /data/1

[21:24:54][root][~]$ rm /data/1

rm: remove regular file '/data/1'? y

Ideally at this time since there's enough storage, the binlog should be recovered.

However, the binlog is already corrupted at this point:
use mysqlbinlog to parse the binlog will see errors like:

[root@ip-172-31-41-130 tmp]# /usr/local/mysql/bin/mysqlbinlog /data/log/binlog/mysql-bin-changelog.000001 > /tmp/0001

ERROR: Error in Log_event::read_log_event(): 'Event truncated', data_len: 673207109, event_type: 32

Retry the inserting it will show errors about the binlog writing.

/usr/local/mysql/bin/mysql -e "INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa')"

ERROR 1026 (HY000) at line 1: Error writing file '/data/log/binlog/mysql-bin-changelog' (errno: 11 "Resource temporarily unavailable")

Error log:

2022-02-10 23:18:17 11 [ERROR] mysqld: Error writing file '/data/log/binlog/mysql-bin-changelog' (errno: 11 "Resource temporarily unavailable")

Attachments

Issue Links

relates to

MDEV-14462 Confusing error message: ib_logfiles are too small for innodb_thread_concurrency=0

Closed

MDEV-27436 binlog corruption (/tmp no space left on device at the same moment)

Closed

MariaDB Server

MariaDB binlog corruption when "No space left on device" and stuck session killed by client

Details

Description

Issue description:

How to reproduce:

Attachments

Issue Links

Activity

People

Dates

Git Integration