Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-27803

MariaDB binlog corruption when "No space left on device" and stuck session killed by client

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.2(EOL), 10.6
    • 10.6
    • Replication
    • None

    Description

      MariaDB server should be recoverable during storage-full condition, but in following condition the binlog corrupted during storage-full (0 space on disk). Could cause binlog replay failure, and replication failure.

      The issue is reproducible in 10.6.5 and was also seen in 10.2.40.

      Issue description:

      When MariaDB server runs out of storage, it fails to write binlog file because of "No space left on device". At this time, the server is still running.

      2022-02-10 23:51:28 6 [Warning] mysqld: Disk is full writing '/data/log/binlog/mysql-bin-changelog.000001' (Errcode: 28 "No space left on device"). Waiting for someone to free space... (Expect up to 60 secs delay for server to continue after freeing disk space)
      2022-02-10 23:51:28 6 [Warning] mysqld: Retry in 60 secs. Message reprinted in 600 secs
      

      As it will keep retrying the binlog writing, it suppose to recover after releasing some storage or adding more storage.

      However in this condition, if the stuck session is killed by the client, the binlog writing will break and couldn't recover.

      After the binlog corrupted:

      1. new insert query will fail with errors of errno: 11 "Resource temporarily unavailable")

        MariaDB [(none)]> INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa');
        2022-02-10 23:55:46 11 [ERROR] mysqld: Error writing file '/data/log/binlog/mysql-bin-changelog' (errno: 11 "Resource temporarily unavailable")
        

      2. Using mysqlbinlog to parse the problematic binlog file will fail with following error:

        # /usr/local/mysql/bin/mysqlbinlog /data/log/binlog/mysql-bin-changelog.000001 > /tmp/0001
        ERROR: Error in Log_event::read_log_event(): 'Event truncated', data_len: 673207109, event_type: 32
        

      The only way to recovery in this scenario is to restart MariaDB server. However it left the problematic binlog and it can't be replayed. If there're replicas, replication will also fail because of it:

      Last_IO_Error: Relay log write failure: could not queue event from master
      

      How to reproduce:

      The issue can be reproduced with following steps, using source code https://github.com/MariaDB/server/tree/mariadb-10.6.5/, and building/installing on AWS EC2 instance.

      1. Create an EC2 instance of Amazon Linux 2, add a EBS volume storage of 1 GB. Mount the volume to `/data` on instance:

        sudo su -
        fdisk -l
        mkdir /data
        mkfs.ext4 /dev/nvme1n1
        mount /dev/nvme1n1 /data
        

      2. Checkout MariaDB 10.6.5, build and install. Here's the parameters and commands I used:

        yum-builddep -y mariadb-server
        yum install -y git gcc gcc-c++ bison libxml2-devel libevent-devel rpm-build
        git clone https://github.com/MariaDB/server.git --branch mariadb-10.6.5 --depth 1
        cd server && cmake . && make -j `nproc` &&make install
        

      3. Prepare

        pkill mysqld && sleep 1
        sudo rm -rf /data/*
        sudo mkdir -p /data/log/binlog /data/log/error /data/log/innodb/ /data/db/innodb/ /data/tmp/
        sudo chown `whoami`:`whoami` /data -R
        

      4. Init the DB and start with `--log-bin` in background

        sudo /usr/local/mysql/scripts/mysql_install_db \
         --no-defaults --user=`whoami` --datadir=/data/db --basedir=/usr/local/mysql/ --innodb-log-group-home-dir /data/log/innodb --innodb-log-file-size 134217728 \
         --auth-root-authentication-method=normal --force --skip-name-resolve --skip-test-db --cross-bootstrap --innodb-data-home-dir /data/db/innodb
         
        /usr/local/mysql/bin/mysqld --no-defaults --user=`whoami` --datadir=/data/db --basedir=/usr/local/mysql/ --innodb-log-group-home-dir /data/log/innodb --innodb-log-file-size 134217728 --innodb-data-home-dir /data/db/innodb  \
        --binlog_cache_size 32768  --binlog_format MIXED   --max_binlog_size 134217728   --sync-binlog 1 --log-bin='/data/log/binlog/mysql-bin-changelog'  --skip-grant-tables --server_id=2 &
        

      5. Connect to the database and create test db/table:

        /usr/local/mysql/bin/mysql -e "\
        create database t; \
        CREATE TABLE t.t1 (a INT, b MEDIUMTEXT) ENGINE=Innodb; \
        INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa');"
        

      6. Fill up the storage

        let size1=`df /dev/nvme1n1 | sed -n '2p' | awk '{print $4}'`*1024
        fallocate -l $size1 /data/1
        cat /data/1 > /data/2
        

      7. Keep inserting a few times until query stuck

        /usr/local/mysql/bin/mysql -e " \
        INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \
        INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \
        INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \
        INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \
        INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \
        INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \
        INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \
        INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \
        INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \
        INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \
        INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \
        INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \
        INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \
        INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \
        INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \
        INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \
        INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); \
        INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa');"
        

      8. At this time, check the mysql-error log should contain following lines

        2022-02-09 21:22:30 31 [Warning] mysqld: Disk is full writing '/data/log/binlog/mysql-bin-changelog.000011' (Errcode: 28 "No space left on device"). Waiting for someone to free space... (Expect up to 60 secs delay for server to continue after freeing disk space)
        2022-02-09 21:22:30 31 [Warning] mysqld: Retry in 60 secs. Message reprinted in 600 secs
        

      9. Show processlist shows query state is "Commit":

        MariaDB [(none)]> show full processlist;
        +----+------+-----------+------+---------+------+----------+----------------------------------------------------------------------------------------------------------------------------+----------+
        | Id | User | Host      | db   | Command | Time | State    | Info                                                                                                                       | Progress |
        +----+------+-----------+------+---------+------+----------+----------------------------------------------------------------------------------------------------------------------------+----------+
        | 11 | root | localhost | NULL | Query   |   48 | Commit   | INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa') |    0.000 |
        | 12 | root | localhost | NULL | Query   |    0 | starting | show full processlist                                                                                                      |    0.000 |
        +----+------+-----------+------+---------+------+----------+----------------------------------------------------------------------------------------------------------------------------+----------+
        2 rows in set (0.000 sec)
        

      10. Kill the stuck client using ctrl+c

        MariaDB [(none)]> INSERT INTO t.t1 VALUES (1, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa");
        ^CCtrl-C -- query killed. Continuing normally.
        ^CCtrl-C -- query killed. Continuing normally.
        ERROR 2013 (HY000): Lost connection to MySQL server during query
        MariaDB [(none)]> Ctrl-C -- exit!
        Aborted
        

      11. At this point `show processlist`shows the query "Command=Killed" and "State=Commit"

        #/usr/local/mysql/bin/mysql -e "show processlist"
        +----+------+-----------+------+---------+------+----------+------------------------------------------------------------------------------------------------------+----------+
        | Id | User | Host      | db   | Command | Time | State    | Info                                                                                                 | Progress |
        +----+------+-----------+------+---------+------+----------+------------------------------------------------------------------------------------------------------+----------+
        |  5 | root | localhost | NULL | Killed  |   22 | Commit   | INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa |    0.000 |
        |  8 | root | localhost | NULL | Query   |    0 | starting | show processlist                                                                                     |    0.000 |
        +----+------+-----------+------+---------+------+----------+------------------------------------------------------------------------------------------------------+----------+
        

      12. Wait for 1 min (Time in processlist becomes 59)
        Then should see this error:

        [ERROR] mysqld: Error writing file '/data/log/binlog/mysql-bin-changelog' (errno: 28 "No space left on device")
        

      13. release storage by rm /data/1

        [21:24:54][root][~]$ rm /data/1 
        rm: remove regular file '/data/1'? y
        

      14. Ideally at this time since there's enough storage, the binlog should be recovered.
      15. However, the binlog is already corrupted at this point:
        use mysqlbinlog to parse the binlog will see errors like:

        [root@ip-172-31-41-130 tmp]# /usr/local/mysql/bin/mysqlbinlog /data/log/binlog/mysql-bin-changelog.000001 > /tmp/0001
        ERROR: Error in Log_event::read_log_event(): 'Event truncated', data_len: 673207109, event_type: 32
        

      16. Retry the inserting it will show errors about the binlog writing.

        /usr/local/mysql/bin/mysql -e "INSERT INTO t.t1 VALUES (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa')"
        ERROR 1026 (HY000) at line 1: Error writing file '/data/log/binlog/mysql-bin-changelog' (errno: 11 "Resource temporarily unavailable")
        

        Error log:

        2022-02-10 23:18:17 11 [ERROR] mysqld: Error writing file '/data/log/binlog/mysql-bin-changelog' (errno: 11 "Resource temporarily unavailable")
        

      Attachments

        Issue Links

          Activity

            People

              Elkin Andrei Elkin
              wenhug Hugo Wen
              Votes:
              2 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.