[MDEV-21345] Deadlocks when two transactions update the same row at the same time Created: 2019-12-18  Updated: 2020-08-03

Status: Open
Project: MariaDB Server
Component/s: Locking, Storage Engine - InnoDB
Affects Version/s: 5.5.60-galera
Fix Version/s: 5.5

Type: Bug Priority: Major
Reporter: Zhidong Assignee: Marko Mäkelä
Resolution: Unresolved Votes: 0
Labels: None
Environment:

docker mariadb:5.5.60


Issue Links:
Relates
relates to MDEV-23338 server crash on select for update Open

 Description   

Hi guys,

I encountered a pretty weird problem that is hard to interpret. I tried to reproduce it but with no luck.

Definitions of tables involved

CREATE TABLE `task_instance` (
  `task_id` varchar(250) NOT NULL,
  `dag_id` varchar(250) NOT NULL,
  `execution_date` datetime(6) NOT NULL,
  `start_date` datetime(6) DEFAULT NULL,
  `end_date` datetime(6) DEFAULT NULL,
  `duration` float DEFAULT NULL,
  `state` varchar(20) DEFAULT NULL,
  `try_number` int(11) DEFAULT NULL,
  `hostname` varchar(1000) DEFAULT NULL,
  `unixname` varchar(1000) DEFAULT NULL,
  `job_id` int(11) DEFAULT NULL,
  `pool` varchar(50) DEFAULT NULL,
  `queue` varchar(50) DEFAULT NULL,
  `priority_weight` int(11) DEFAULT NULL,
  `operator` varchar(1000) DEFAULT NULL,
  `queued_dttm` datetime(6) DEFAULT NULL,
  `pid` int(11) DEFAULT NULL,
  `max_tries` int(11) DEFAULT '-1',
  PRIMARY KEY (`task_id`,`dag_id`,`execution_date`),
  KEY `ti_dag_state` (`dag_id`,`state`),
  KEY `ti_pool` (`pool`,`state`,`priority_weight`),
  KEY `ti_state_lkp` (`dag_id`,`task_id`,`execution_date`,`state`),
  KEY `ti_state` (`state`),
  KEY `ti_job_id` (`job_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
 
CREATE TABLE `dag_run` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `dag_id` varchar(250) DEFAULT NULL,
  `execution_date` datetime(6) DEFAULT NULL,
  `state` varchar(50) DEFAULT NULL,
  `run_id` varchar(250) DEFAULT NULL,
  `external_trigger` tinyint(1) DEFAULT NULL,
  `conf` blob,
  `end_date` datetime(6) DEFAULT NULL,
  `start_date` datetime(6) DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `dag_id` (`dag_id`,`execution_date`),
  UNIQUE KEY `dag_id_2` (`dag_id`,`run_id`),
  KEY `dag_id_state` (`dag_id`,`state`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
 
CREATE TABLE `log` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `dttm` datetime(6) DEFAULT NULL,
  `dag_id` varchar(250) DEFAULT NULL,
  `task_id` varchar(250) DEFAULT NULL,
  `event` varchar(30) DEFAULT NULL,
  `execution_date` datetime(6) DEFAULT NULL,
  `owner` varchar(500) DEFAULT NULL,
  `extra` text,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

------------------------
LATEST DETECTED DEADLOCK
------------------------
191216  8:52:09
*** (1) TRANSACTION:
TRANSACTION 4AB6056BF, ACTIVE 0 sec fetching rows
mysql tables in use 2, locked 2
LOCK WAIT 5 lock struct(s), heap size 1248, 5 row lock(s)
MySQL thread id 256467507, OS thread handle 0x7f973d815700, query id 14205100829 127.0.0.1 airflow Sending data
UPDATE task_instance, dag_run SET task_instance.state=NULL WHERE task_instance.dag_id IN ('A_Dag', 'B_Dag') AND task_instance.state IN ('queued', 'scheduled') AND dag_run.dag_id = task_instance.dag_id AND dag_run.execution_date = task_instance.execution_date AND dag_run.state != 'running'
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 0 page no 112745 n bits 128 index `PRIMARY` of table `airflow`.`task_instance` trx id 4AB6056BF lock_mode X locks rec but not gap waiting
*** (2) TRANSACTION:
TRANSACTION 4AB6056BC, ACTIVE 0 sec updating or deleting
mysql tables in use 1, locked 1
4 lock struct(s), heap size 1248, 2 row lock(s), undo log entries 2
MySQL thread id 256620571, OS thread handle 0x7f973d531700, query id 14205100833 192.168.1.1 airflow Updating
UPDATE task_instance SET start_date='2019-12-16 08:51:35.486074', state='running', try_number=1, hostname='192.168.1.1', unixname='airflow', job_id=119942, operator='PythonOperator', pid=60059 WHERE task_instance.task_id = 'X_Task' AND task_instance.dag_id = 'X_Dag' AND task_instance.execution_date = '2019-12-16 08:50:00'
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 0 page no 112745 n bits 128 index `PRIMARY` of table `airflow`.`task_instance` trx id 4AB6056BC lock_mode X locks rec but not gap
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 0 page no 112079 n bits 232 index `ti_state` of table `airflow`.`task_instance` trx id 4AB6056BC lock_mode X locks rec but not gap waiting
*** WE ROLL BACK TRANSACTION (1)

general.log

191216  8:52:09
                256467507 Query SELECT 1
                256620571 Connect       airflow@192.168.1.1 as anonymous on airflow
                256620571 Query set autocommit=0
                256467507 Query UPDATE task_instance, dag_run SET task_instance.state='failed' WHERE task_instance.dag_id IN ('A_Dag', 'B_Dag') AND task_instance.state IN ('up_for_retry') AND dag_run.dag_id = task_instance.dag_id AND dag_run.execution_date = task_instance.execution_date AND dag_run.state != 'running'
                256620571 Query SELECT 1
                256620571 Query SELECT task_instance.try_number AS task_instance_try_number, task_instance.task_id AS task_instance_task_id, task_instance.dag_id AS task_instance_dag_id, task_instance.execution_date AS task_instance_execution_date, task_instance.start_date AS task_instance_start_date, task_instance.end_date AS task_instance_end_date, task_instance.duration AS task_instance_duration, task_instance.state AS task_instance_state, task_instance.max_tries AS task_instance_max_tries, task_instance.hostname AS task_instance_hostname, task_instance.unixname AS task_instance_unixname, task_instance.job_id AS task_instance_job_id, task_instance.pool AS task_instance_pool, task_instance.queue AS task_instance_queue, task_instance.priority_weight AS task_instance_priority_weight, task_instance.operator AS task_instance_operator, task_instance.queued_dttm AS task_instance_queued_dttm, task_instance.pid AS task_instance_pid FROM task_instance WHERE task_instance.task_id = 'X_Task' AND task_instance.dag_id = 'X_Dag' AND task_instance.execution_date = '2019-12-16 08:50:00'
                256467507 Query commit
                256467507 Query rollback
                256467507 Query SELECT 1
                256620571 Query INSERT INTO log (dttm, dag_id, task_id, event, execution_date, owner, extra) VALUES ('2019-12-16 08:51:35.491292', 'X_Dag', 'X_Task', 'running', '2019-12-16 08:50:00', 'Airflow', NULL)
                256467507 Query UPDATE task_instance, dag_run SET task_instance.state=NULL WHERE task_instance.dag_id IN ('A_Dag', 'B_Dag') AND task_instance.state IN ('queued', 'scheduled') AND dag_run.dag_id = task_instance.dag_id AND dag_run.execution_date = task_instance.execution_date AND dag_run.state != 'running'
                256620571 Query UPDATE task_instance SET start_date='2019-12-16 08:51:35.486074', state='running', try_number=1, hostname='192.168.1.1', unixname='airflow', job_id=119942, operator='PythonOperator', pid=60059 WHERE task_instance.task_id = 'X_Task' AND task_instance.dag_id = 'X_Dag' AND task_instance.execution_date = '2019-12-16 08:50:00'
                256620571 Query commit
                256620571 Query rollback
                256620571 Quit

BTW: The problem happens more frequent if task_instance is large.

BR//Zhidong



 Comments   
Comment by Elena Stepanova [ 2019-12-19 ]

Which exactly part appears to be a problem?

Comment by Zhidong [ 2019-12-20 ]

Hi, @Elena Stepanova

There are only two queries that matter for this issue.
1. UPDATE task_instance, dag_run SET ...(Q1) from thread 256467507(T1).
2. UPDATE task_instance SET ...(Q2) from thread 256620571(T2).

Q1 perform a scan on secondary index(`ti_state`) records, sets exclusive next-key lock on every record the search encounters, and also their corresponding clustered index(`PRIMARY`) records.
Q2 perform a PRIMARY index scan, sets lock on it, and also implicit lock secondary index(`ti_state`) records that are affected.

Deadlock won't happen if the locking process is fast enough.
Is this an expected behavior, or a bug?

BR//Zhidong

Comment by Elena Stepanova [ 2019-12-20 ]

Redirecting the question to marko.

Comment by Zhidong [ 2020-01-06 ]

Any updates? @marko @elenst

Generated at Thu Feb 08 09:06:25 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.