[MDEV-7998] Total lockup on two add index statements Created: 2015-04-15  Updated: 2020-12-07  Resolved: 2020-12-07

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.0.17-galera
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Jiri Kavalik Assignee: Seppo Jaakola
Resolution: Won't Fix Votes: 0
Labels: galera
Environment:

CentOS 5.11


Attachments: Text File mariadb_alter_errorlog.txt     File server.cnf     File tables.sql    

 Description   

We have galera with two nodes - main one with all access and second as active backup and reports server.
I was running alter table add index on our main node (did it already on the other node) with
SET GLOBAL wsrep_OSU_method='RSU';
SET SESSION wsrep_on=OFF;

I was monitoring progress in other window and instead of "show processlist;" I unfortunately ran from history older alter table add index on different table - it should have reported error on duplicating index, but instead entire node locked, no connection (new or already established) was responding. I had to ask our server admin to restart it and he had to kill 9 it, then it started OK. Even the added index is present, so it seems to have finished in the background (there was enough time for it before stop was issued).

Attached is relevant part of mysqld.log, (few rows before event are from day before, I added couple of comments prefixed with !!!!



 Comments   
Comment by Nirbhay Choubey (Inactive) [ 2015-04-17 ]

Hi jkavalik: I couldn't reproduce the issue using the scenario specified. Here is what I tried on a 2 node cluster :

Node 1 - Session 1:

MariaDB [test]> SET GLOBAL wsrep_OSU_method='RSU';
Query OK, 0 rows affected (0.00 sec)
 
MariaDB [test]> SET SESSION wsrep_on=OFF;
Query OK, 0 rows affected (0.00 sec)
 
MariaDB [test]> alter table t1 add index(i);
Query OK, 0 rows affected (2.51 sec)
Records: 0  Duplicates: 0  Warnings: 0

Node 1 - Session 2 (Simultaneous to the above ALTER):

MariaDB [test]> alter table t2 add index(i);
Query OK, 0 rows affected, 1 warning (3.80 sec)
Records: 0  Duplicates: 0  Warnings: 1

Can you reproduce it on you end? What are the table structures and indices that you tried to create? Galera configuration options?

Comment by Jiri Kavalik [ 2015-04-21 ]

Hello nirbhay_c, I tried it on testing two-node cluster, where I only created those two tables and filled them with enough data - and I too was NOT able to reproduce it. I even ran sysbench to generate some load on the server. Testing cluster was running the same server version "10.0.17-MariaDB-wsrep-log MariaDB Server, wsrep_25.10.r4144" but on different CentOS version (6.6 instead of 5.11) and with less memory available for buffer pool.
I am not going to try in on production server again, and I have no core dump, because priority was getting it up again and I was not in the right state of mind to remember to get some stack trace through GDB or otherwise.. I understand that bugreport like this is unfortunatelly not helpful, but will try to provide any info you might want in case you get some idea.

I uploaded server.cnf, which is the same on both nodes, only node2 has 16GB innodb buffer pool instead of 20GB.
Actual node1 had dynamicaly enabled query cache of 200MB, set max_heap_table_size to 1GB and is doing ~500 QPS on average.
Only node1 is used by our application, node2 is "active backup" and used for reporting and query optimization checks on real data to not slow the app down.

I attached create table for both tables - on testing cluster I dropped foreign key and unique constraints to be able to duplicate some existing rows to create more (300k in sazky, 600k in tickets, on real data they are around 3M rows)

My actual session was more like:

node1:
alter table sazky add index `udalost_status` (`udalost_id`,`status`); -- with TOI - 25secs
 
node2:
SET GLOBAL wsrep_OSU_method='RSU';
SET SESSION wsrep_on=OFF;
alter table ticket add index `user_zal_zrus_host` (`user_id`,`zalozen`,`zruseno`,`host_id`); -- 3 minutes
SET SESSION wsrep_on=ON;
SET GLOBAL wsrep_OSU_method='TOI';
 
node1:
SET GLOBAL wsrep_OSU_method='RSU';
SET SESSION wsrep_on=OFF;
alter table ticket add index `user_zal_zrus_host` (`user_id`,`zalozen`,`zruseno`,`host_id`); -- was running close to 5 minutes before hang
 
node1 - parallel to previous:
show processlist; -- multiple times through those 5 minutes
alter table sazky add index `udalost_status` (`udalost_id`,`status`); -- wanted to run show processlist and pressed UP one time too many..

I was expecting that last alter to return "duplicate index" but everything stopped at that moment. Connections to node2 were working OK, only those system threads for applying replication were counting time as idle so even those stopped on node1.

Our server admin issued "service mysql stop" on node2 (OK) and then on node1(did not finish), then kill -s 9 on node1, then started node1 and node2. Index user_zal_zrus_host was present in the table on node1, so it either finished just before the hang or server was working inside and only not communicating with connections. There was enough time between the unwanted second alter and stopping/killing the server for the index creation to finish (my estimate for it is ~6 minutes).

I would be glad if this could help you some way as you know the internals.. But it seems not reproducible repeating steps I remember, and I understand that without any stacktrace or coredump it will be probably closed.

Generated at Thu Feb 08 07:23:53 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.