[MDEV-6751] possible deadlock with ALTER + threadpool + events Created: 2014-09-17  Updated: 2014-10-24  Resolved: 2014-10-24

Status: Closed
Project: MariaDB Server
Component/s: Data Definition - Alter Table, Events, Storage Engine - InnoDB
Affects Version/s: 10.0.13
Fix Version/s: 10.0.14

Type: Bug Priority: Major
Reporter: Sean Pringle Assignee: Jan Lindström (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Ubuntu 12.04 LTS Linux db1062 3.2.0-60-generic #91-Ubuntu SMP
Server version: 10.0.13-MariaDB-log Source distribution


Attachments: Text File db1062_lockup.ALTER.txt     Text File db1062_lockup.cmake.txt     Text File db1062_lockup.global_vars.txt     Text File db1062_lockup.i_s_query.txt     Text File db1062_lockup.processlist.txt     File db1062_lockup_gdb.2.log     File db1062_lockup_gdb.3.log     File db1062_lockup_gdb.log    

 Description   

Am running the following on a slave:

  • Largish (24h, 600M rows, 200G) ALTER TABLE
  • Events with INFORMATION_SCHEMA queries
  • Threadpool pool-of-threads active
  • Replication active
  • No other significant traffic

After several hours, MariaDB locks up with 0% CPU and disk activity, and no response on existing or new connections on port, extra_port, or socket.

Attached are gdb backtraces for two occurrences, examples of the ALTER and the INFORMATION_SCHEMA activity, and other info. Would appreciate any insight from devs to identify the deadlock, and to narrow down the variables for a test case that isn't 200G.

Am presently trialing the ALTER outside the threadpool using the extra_port, with all other settings unchanged.

Other notes:

  • It doesn't seem to be a thread pool overload, as there aren't enough threads in the backtrace.
  • The INFORMATION_SCHEMA event traffic uses GET_LOCK to serialize some activity and prevent pile-up.


 Comments   
Comment by Sean Pringle [ 2014-09-18 ]

Using extra_port for the ALTER still locked up, but managed to catch it in action. See attached processlist and gdb trace #3. Similar to https://mariadb.atlassian.net/browse/MDEV-5551 ?

The client connections in Command="Killed" and State="Cleaning up" successfully connected according to client logs, and apparently got past authentication, but the first query always failed with "Got timeout reading communication packets".

Subsequently the client did not show open TCP connections hanging around in lsof -i tcp output. However the server did continue to show open TCP connection in lsof, and did not report anything in err log with log_warnings=2.

Notice the REPLACE INTO `heartbeat` query in state Update for many seconds; that is percona toolkit pt-heartbeat writing to an InnoDB table which is usually instant. Attempting SHOW ENGINE INNODB STATUS to investigate simply hung indefinitely.

Comment by Elena Stepanova [ 2014-09-22 ]

Hi,

What happened to the ALTER query? Who or what killed it, and was it before or after you identified the situation as a deadlock?

Thanks.

Comment by Sean Pringle [ 2014-09-23 ]

In the example processlist, I killed the ALTER after everything else had already become Command=Killed without known intervention by human or tool. It didn't help.

Comment by Elena Stepanova [ 2014-09-24 ]

jplindst,

Could you please check if it has something in common with other problems reported or observed, or it's a new one?

Comment by Sean Pringle [ 2014-09-24 ]

Tried disabling thread pool, replication, other traffic, all to no effect. The problem recurred. Each time after restart I waited until transactions had completed rollback.

Observing the threads stuck in buf_, mtr_, and log_ calls I tried reverting to a single buffer pool instance, innodb_buffer_pool_instances=1 (which matches our 5.5 config). That allowed the ALTER to complete normally without a hiccup.

A flushing lock-cycle of some sort?

Comment by Elena Stepanova [ 2014-09-24 ]

If you can experiment with your instance, could you maybe try setting innodb_buffer_pool_instances to 1?
We are currently investigating a problem which looks related to multiple buffer pool instances. I am not sure at all that your deadlock is the same issue, but it won't hurt to try.

Comment by Jan Lindström (Inactive) [ 2014-09-24 ]

Do you see long semaphore wait warnings/errors on error log ?

Comment by Sean Pringle [ 2014-09-24 ]

@Elena, yes innodb_buffer_pool_instances=1 helped. See comment above.

@Jan, no long semaphore warnings, and no 600 second abort.

Comment by Elena Stepanova [ 2014-09-24 ]

Elena, yes innodb_buffer_pool_instances=1 helped. See comment above.

Sorry, I somehow missed that part of the comment. Thanks.

Generated at Thu Feb 08 07:14:18 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.