Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-23303

deadlocks after upgrade from 10.4.12 to 10.5.4

Details

    Description

      A couple of weeks ago we have upgraded our MariaDB + Galera (3 nodes) cluster to the last stable version - 10.5.4. In our configuration, all writes go to 1 node only. And we started facing deadlocks (gap locking) issue on that node. Nothing was changed from the DB side except upgrade and nothing was changed from the application side.

      Could you please advise on this and please let me know what additional information will you require for further investigation?

      Thank you.

      Attachments

        1. galera.cnf
          5 kB
        2. galera_error_log_part.log
          274 kB
        3. galera_error.log_2node.gz
          4.17 MB
        4. galera_error.log_3node.gz
          4.88 MB
        5. screenshot-1.png
          screenshot-1.png
          1.23 MB

        Activity

          sun4ezzz Aleksandr Omelchuk created issue -
          elenst Elena Stepanova made changes -
          Field Original Value New Value
          Component/s Galera [ 10124 ]
          Fix Version/s 10.5 [ 23123 ]
          Assignee Jan Lindström [ jplindst ]
          sun4ezzz Aleksandr Omelchuk made changes -
          Description A couple of weeks ago we have upgraded our MariaDB + Galera (3 nodes) cluster to the last stable version - 10.5.4. And we started facing deadlocks (gap locking) issue. Nothing was changed from the DB side except upgrade and nothing was changed from the application side.

          Could you please advise on this and please let me know what additional information will you require for further investigation?

          Thank you.
          A couple of weeks ago we have upgraded our MariaDB + Galera (3 nodes) cluster to the last stable version - 10.5.4. In our configuration, all writes go to 1 node only. And we started facing deadlocks (gap locking) issue on that node. Nothing was changed from the DB side except upgrade and nothing was changed from the application side.

          Could you please advise on this and please let me know what additional information will you require for further investigation?

          Thank you.

          You could start sharing your config files and error logs but you must be aware that gap locks are normal on InnoDB and they are needed for consistency. 10.5 does not anymore support innobase_locks_unsafe_for_binlog so you may need to use READ COMMITTED transaction isolation level instead on some cases or you need to handle deadlocks because of gap locking in your application.

          jplindst Jan Lindström (Inactive) added a comment - You could start sharing your config files and error logs but you must be aware that gap locks are normal on InnoDB and they are needed for consistency. 10.5 does not anymore support innobase_locks_unsafe_for_binlog so you may need to use READ COMMITTED transaction isolation level instead on some cases or you need to handle deadlocks because of gap locking in your application.
          jplindst Jan Lindström (Inactive) made changes -
          Labels upgrade need_feedback upgrade
          sun4ezzz Aleksandr Omelchuk made changes -
          Attachment galera_error_log_part.log [ 52980 ]
          Attachment galera.cnf [ 52981 ]

          Thank you for the update.
          Attached to the ticket our config file and part from error log before upgrade, during upgrade and with "innodb_print_all_deadlocks" enabled.

          As for the gap locks, we are aware of that mechanism and we use READ COMMITTED isolation level in our transactions. (We also tried to set the "READ COMMITTED" for the whole DB but that didn't help).
          As I mentioned in the description we used the same configuration from the DB side and application side before and after the upgrade. And only after the upgrade the deadlocks start happening.

          Please take a look.
          Thank you.

          sun4ezzz Aleksandr Omelchuk added a comment - Thank you for the update. Attached to the ticket our config file and part from error log before upgrade, during upgrade and with "innodb_print_all_deadlocks" enabled. As for the gap locks, we are aware of that mechanism and we use READ COMMITTED isolation level in our transactions. (We also tried to set the "READ COMMITTED" for the whole DB but that didn't help). As I mentioned in the description we used the same configuration from the DB side and application side before and after the upgrade. And only after the upgrade the deadlocks start happening. Please take a look. Thank you.

          Can you provide show create table `stock`.`stocks`

          jplindst Jan Lindström (Inactive) added a comment - Can you provide show create table `stock`.`stocks`

          MariaDB [stock]> show create table stocks\G
          *************************** 1. row ***************************
                 Table: stocks
          Create Table: CREATE TABLE `stocks` (
            `stock_key` varchar(100) NOT NULL,
            `last_update` datetime NOT NULL,
            `location_id` varchar(50) NOT NULL,
            `product_id` varchar(50) NOT NULL,
            `stock_data` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL CHECK (json_valid(`stock_data`)),
            `available_for_sale` decimal(15,5) NOT NULL DEFAULT 0.00000,
            `selling_scheme` varchar(15) NOT NULL DEFAULT 'GENERAL',
            `loc_type` varchar(15) DEFAULT NULL,
            `stock_source_key` varchar(100) DEFAULT NULL,
            PRIMARY KEY (`stock_key`),
            KEY `fk_stock_source_key_stock_key` (`stock_source_key`),
            KEY `idx_location_id` (`location_id`),
            CONSTRAINT `fk_stock_source_key_stock_key` FOREIGN KEY (`stock_source_key`) REFERENCES `stocks` (`stock_key`)
          ) ENGINE=InnoDB DEFAULT CHARSET=latin1
          

          sun4ezzz Aleksandr Omelchuk added a comment - MariaDB [stock]> show create table stocks\G *************************** 1. row *************************** Table: stocks Create Table: CREATE TABLE `stocks` ( `stock_key` varchar(100) NOT NULL, `last_update` datetime NOT NULL, `location_id` varchar(50) NOT NULL, `product_id` varchar(50) NOT NULL, `stock_data` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL CHECK (json_valid(`stock_data`)), `available_for_sale` decimal(15,5) NOT NULL DEFAULT 0.00000, `selling_scheme` varchar(15) NOT NULL DEFAULT 'GENERAL', `loc_type` varchar(15) DEFAULT NULL, `stock_source_key` varchar(100) DEFAULT NULL, PRIMARY KEY (`stock_key`), KEY `fk_stock_source_key_stock_key` (`stock_source_key`), KEY `idx_location_id` (`location_id`), CONSTRAINT `fk_stock_source_key_stock_key` FOREIGN KEY (`stock_source_key`) REFERENCES `stocks` (`stock_key`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1

          Table has self-reference foreign key. Any write operation (INSERT, UPDATE, DELETE) on table stocks might require row lock also to rows parent if one exists. For INSERTS we naturally need to make sure that primary key column remains unique. GAP-locks are also necessary and it seems that 10.5 might be stricter on rules at least Galera case.

          jplindst Jan Lindström (Inactive) added a comment - Table has self-reference foreign key. Any write operation (INSERT, UPDATE, DELETE) on table stocks might require row lock also to rows parent if one exists. For INSERTS we naturally need to make sure that primary key column remains unique. GAP-locks are also necessary and it seems that 10.5 might be stricter on rules at least Galera case.
          jplindst Jan Lindström (Inactive) made changes -
          Labels need_feedback upgrade upgrade

          Based on provided information I do not think there is a Bug, just a deadlock situation that needs to be addressed on application logic.

          jplindst Jan Lindström (Inactive) added a comment - Based on provided information I do not think there is a Bug, just a deadlock situation that needs to be addressed on application logic.
          jplindst Jan Lindström (Inactive) made changes -
          Fix Version/s N/A [ 14700 ]
          Fix Version/s 10.5 [ 23123 ]
          Resolution Not a Bug [ 6 ]
          Status Open [ 1 ] Closed [ 6 ]

          Thank you for the update.
          So we will need to rollback our upgrade till we found a solution for our application.

          Could you please provide me information about that stricter rules for Galera case? Cause I didn't find any information in Change log regarding the new behavior for deadlocks.

          sun4ezzz Aleksandr Omelchuk added a comment - Thank you for the update. So we will need to rollback our upgrade till we found a solution for our application. Could you please provide me information about that stricter rules for Galera case? Cause I didn't find any information in Change log regarding the new behavior for deadlocks.

          We performed a rollback to version 10.4.12 and deadlocks gone.
          I agree that deadlocks definitely depend on application logic but the main thing here is - what have been changed in 10.5 version for that case?
          Why there is no information in Change Log regarding the new behavior?
          Is it possible that there are other undocumented changes in 10.5 version?

          Thank you.

          sun4ezzz Aleksandr Omelchuk added a comment - We performed a rollback to version 10.4.12 and deadlocks gone. I agree that deadlocks definitely depend on application logic but the main thing here is - what have been changed in 10.5 version for that case? Why there is no information in Change Log regarding the new behavior? Is it possible that there are other undocumented changes in 10.5 version? Thank you.

          With current information it must be innodb_locks_unsafe_for_binlog see https://mariadb.com/kb/en/changes-improvements-in-mariadb-105/

          jplindst Jan Lindström (Inactive) added a comment - With current information it must be innodb_locks_unsafe_for_binlog see https://mariadb.com/kb/en/changes-improvements-in-mariadb-105/

          We have never used that variable. And it is set to 0 by default.
          We always use READ COMMITTED isolation level.
          This is not the cause.

          Kindly please take a look at this issue.
          Thank you.

          sun4ezzz Aleksandr Omelchuk added a comment - We have never used that variable. And it is set to 0 by default. We always use READ COMMITTED isolation level. This is not the cause. Kindly please take a look at this issue. Thank you.

          Is it possible to reopen this ticket?

          sun4ezzz Aleksandr Omelchuk added a comment - Is it possible to reopen this ticket?
          jplindst Jan Lindström (Inactive) made changes -
          Resolution Not a Bug [ 6 ]
          Status Closed [ 6 ] Stalled [ 10000 ]

          I do not see isolation level setting on your configuration file. Can you provide full unedited error log from all nodes. Do you use only one master?

          jplindst Jan Lindström (Inactive) added a comment - I do not see isolation level setting on your configuration file. Can you provide full unedited error log from all nodes. Do you use only one master?

          If I correctly understand the first deadlock on provided error log it is caused by fact that two concurrent INSERT-clauses insert a row so that we take gap-lock to primary key row with stock_key 82736280_1.

          jplindst Jan Lindström (Inactive) added a comment - If I correctly understand the first deadlock on provided error log it is caused by fact that two concurrent INSERT-clauses insert a row so that we take gap-lock to primary key row with stock_key 82736280_1.
          sun4ezzz Aleksandr Omelchuk made changes -
          Attachment galera_error.log_2node.gz [ 53241 ]
          Attachment galera_error.log_3node.gz [ 53242 ]
          sun4ezzz Aleksandr Omelchuk made changes -
          Attachment screenshot-1.png [ 53245 ]
          sun4ezzz Aleksandr Omelchuk added a comment - - edited

          1. Yes, we have set up MariaDB connector in the sequential mode for writes, so all our writes go to 1st node as default and there we faced with deadlocks.
          2. We set isolation level inside the transaction (but once we faced with the issue we tried to set on DB level and that didn't help).
          3. Attached the logs from 2 other nodes.
          4. As for the first deadlock in error log, there are 2 inserts which was trying to put data with different PKs, but looks like locked with the same key (82736280_1).

          sun4ezzz Aleksandr Omelchuk added a comment - - edited 1. Yes, we have set up MariaDB connector in the sequential mode for writes, so all our writes go to 1st node as default and there we faced with deadlocks. 2. We set isolation level inside the transaction (but once we faced with the issue we tried to set on DB level and that didn't help). 3. Attached the logs from 2 other nodes. 4. As for the first deadlock in error log, there are 2 inserts which was trying to put data with different PKs, but looks like locked with the same key (82736280_1).

          Yes, it does lock same key using gap-lock but that is normal in this kind of self reference. I really do not see anything that would indicate a bug. We do not have any change on locking rules between 10.4 and 10.5.

          jplindst Jan Lindström (Inactive) added a comment - Yes, it does lock same key using gap-lock but that is normal in this kind of self reference. I really do not see anything that would indicate a bug. We do not have any change on locking rules between 10.4 and 10.5.

          Understood. But is it normal with any type of isolation level?
          We are not facing with that deadlocks currently on 10.4.12 and 10.4.13 versions.
          How can we proceed further with the investigation?

          Thank you.

          sun4ezzz Aleksandr Omelchuk added a comment - Understood. But is it normal with any type of isolation level? We are not facing with that deadlocks currently on 10.4.12 and 10.4.13 versions. How can we proceed further with the investigation? Thank you.

          As from documentation https://mariadb.com/kb/en/innodb-lock-modes:

          Gap locks are disabled if the innodb_locks_unsafe_for_binlog system variable is set, or the isolation level is set to READ COMMITTED.
          

          So we use READ COMMITTED isolation level and as expected we don't face with the gap locking in 10.4. But something goes wrong in 10.5.
          Please take a look.

          sun4ezzz Aleksandr Omelchuk added a comment - As from documentation https://mariadb.com/kb/en/innodb-lock-modes: Gap locks are disabled if the innodb_locks_unsafe_for_binlog system variable is set, or the isolation level is set to READ COMMITTED. So we use READ COMMITTED isolation level and as expected we don't face with the gap locking in 10.4. But something goes wrong in 10.5. Please take a look.

          Can you repeat the problematic deadlock in 10.5 easily? If you can, can you share a repeatable test case.

          jplindst Jan Lindström (Inactive) added a comment - Can you repeat the problematic deadlock in 10.5 easily? If you can, can you share a repeatable test case.
          jplindst Jan Lindström (Inactive) made changes -
          Labels upgrade need_feedback upgrade

          We have downgraded all our environments to 10.4 and currently, we can't reproduce the issue.

          sun4ezzz Aleksandr Omelchuk added a comment - We have downgraded all our environments to 10.4 and currently, we can't reproduce the issue.

          Galera had a bug https://jira.mariadb.org/browse/MDEV-23557 that could have effect here also as code used persistent b-tree cursor after mini-transaction commit that could cause page contents to change e.g. split or merge, this naturally would be visible only with high concurrency and lot of foreign key actions.

          jplindst Jan Lindström (Inactive) added a comment - Galera had a bug https://jira.mariadb.org/browse/MDEV-23557 that could have effect here also as code used persistent b-tree cursor after mini-transaction commit that could cause page contents to change e.g. split or merge, this naturally would be visible only with high concurrency and lot of foreign key actions.

          I recommend using more recent version of the server and if this problem is still reproducible, I would need more detailed description how to repeat.

          jplindst Jan Lindström (Inactive) added a comment - I recommend using more recent version of the server and if this problem is still reproducible, I would need more detailed description how to repeat.
          jplindst Jan Lindström (Inactive) made changes -
          issue.field.resolutiondate 2020-09-14 09:00:58.0 2020-09-14 09:00:58.59
          jplindst Jan Lindström (Inactive) made changes -
          Resolution Cannot Reproduce [ 5 ]
          Status Stalled [ 10000 ] Closed [ 6 ]
          serg Sergei Golubchik made changes -
          Workflow MariaDB v3 [ 111722 ] MariaDB v4 [ 158148 ]

          People

            jplindst Jan Lindström (Inactive)
            sun4ezzz Aleksandr Omelchuk
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.