[MCOL-4986] How to force a ColumnStore lock to be dropped and for a rollback to be abandoned? Created: 2022-02-04  Updated: 2022-03-29

Status: Open
Project: MariaDB ColumnStore
Component/s: DMLProc
Affects Version/s: 6.2.1
Fix Version/s: Icebox

Type: Bug Priority: Major
Reporter: Simon Heyes Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Community version with MariaDB 10.6.5 - ColumnStore 6.2.1 Release 1


Issue Links:
Duplicate
is duplicated by MCOL-4975 viewtablelock does not shows active l... Open

 Description   

I created a ColumnStore table and ran a series of heavy SQL statements which ended up crashing after several hours. The logs said it was rolling back and after leaving it for 24 hours with it still saying it is rolling back, I thought as the table and it's contents are not important I'll just drop it. viewtablelock showed that there is still a lock on the table from the original SQL and cleartablelock wouldn't clear it. So I just dropped the table thinking that might clear it using the drop table cs_tableau_threads_table_duplicates restrict.

root@apprep:/home/appadmin/jobs# mariadb --version
mariadb Ver 15.1 Distrib 10.6.5-MariaDB, for debian-linux-gnu (x86_64) using readline 5.2
root@apprep:/home/appadmin/jobs# viewtablelock
There is 1 table lock

Table LockID Process PID Session Txn CreationTime State DBRoots
tableau_app_uk.cs_tableau_threads_table_duplicates 1 DMLProc 40023 172493 2500 Tue Feb 1 15:09:11 2022 LOADING 1

root@apprep:/home/appadmin/jobs# mariadb
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 17275
Server version: 10.6.5-MariaDB-1:10.6.5+maria~focal-log mariadb.org binary distribution

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> use tableau_app_uk;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MariaDB [tableau_app_uk]> desc cs_tableau_threads_table_duplicates;
ERROR 1146 (42S02): Table 'tableau_app_uk.cs_tableau_threads_table_duplicates' doesn't exist
MariaDB [tableau_app_uk]>

As you can see the table has gone but the lock is still there. How do I force the lock to be dropped and for the rollback to be abandoned?



 Comments   
Comment by David Hall (Inactive) [ 2022-02-07 ]

Columnstore stores table definitions in two places. First it's stored the
same as any other engine would store it – in a .frm file in the server.
This is the place from which SHOW TABLES and decribe table get their
information. Additional table definition (Columnstore unique) data is
stored in the engine data files.
DROP TABLE ... RESTRICT removes only the server .frm files, but leaves the
columnstore definition in the engine. Now your engine is out of sync.
CREATE TABLE cs_tableau_threads_table_duplicates ... will fail because the
engine thinks it already exists. You can repair this part of the problem
with CREATE TABLE cs_tableau_threads_table_duplicates ... comment="schema
sync only"

As for the lock getting removed, you could try restarting the database. May
help. After restart, cleartablelock may work.
I make no guarantees, but I would try:
1) restore the table definition with CREATE TABLE ... comment="schema sync
only" (Only do this if you used RESTRICT on a drop table) be sure the table
definition exactly matches the original.
2) Drop the table (without the keyword restrict)
3) Restart the system
4) viewtablelock – just in case it got cleared in the restart.
5) cleartablelock

Comment by Simon Heyes [ 2022-02-08 ]

Thanks Dave. I have followed your instructions and have recreated the "lost" table using the comment="schema sync only" option. That worked and I can now query from the table and insert into it.

Before you replied I had found that one way to clear a lock that won't go is to navigate to
/var/lib/columnstore/data1/systemFiles/dbrm
If there is a file called tablelocks
rm -f tablelocks
So after doing that viewtablelock returned No Locks.

I have rebooted the server to make sure everything is clear and after the reboot have looked in the info.log file and can see that it is attempting to rollback a transaction. I'm assuming that it is the one that I abandoned.

Feb 8 10:12:54 db2_rep DMLProc[5549]: 53.064998 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll.
Feb 8 10:12:54 db2_rep DMLProc[5549]: 54.169917 |0|0|0| I 20 CAL0002: DMLProc will rollback 0 tables.
Feb 8 10:12:54 db2_rep DMLProc[5549]: 54.179472 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 transactions.
Feb 8 10:12:54 db2_rep DMLProc[5549]: 54.371549 |0|0|0| I 20 CAL0002: DMLProc will roll back transaction 2500
Feb 8 10:14:23 db2_rep DMLProc[5824]: 23.566328 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll.
Feb 8 10:14:23 db2_rep DMLProc[5824]: 23.578521 |0|0|0| I 20 CAL0002: DMLProc will rollback 0 tables.
Feb 8 10:14:23 db2_rep DMLProc[5824]: 23.581037 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 transactions.
Feb 8 10:14:23 db2_rep DMLProc[5824]: 23.780009 |0|0|0| I 20 CAL0002: DMLProc will roll back transaction 2500
Feb 8 10:15:54 db2_rep DMLProc[6010]: 54.066551 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll.
Feb 8 10:15:54 db2_rep DMLProc[6010]: 54.080572 |0|0|0| I 20 CAL0002: DMLProc will rollback 0 tables.
Feb 8 10:15:54 db2_rep DMLProc[6010]: 54.083079 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 transactions.
Feb 8 10:15:54 db2_rep DMLProc[6010]: 54.527341 |0|0|0| I 20 CAL0002: DMLProc will roll back transaction 2500
Feb 8 10:17:24 db2_rep DMLProc[6262]: 24.564156 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll.
Feb 8 10:17:24 db2_rep DMLProc[6262]: 24.576731 |0|0|0| I 20 CAL0002: DMLProc will rollback 0 tables.
Feb 8 10:17:24 db2_rep DMLProc[6262]: 24.579146 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 transactions.
Feb 8 10:17:24 db2_rep DMLProc[6262]: 24.794323 |0|0|0| I 20 CAL0002: DMLProc will roll back transaction 2500

As you can see it seems to be in a loop and I found this previous log which looks similar. https://jira.mariadb.org/browse/MCOL-4775

If I try and create the simplest table using the columnstore engine I get an error.
MariaDB [tableau_app_uk]> create table testtable (a char(1)) engine columnstore;
ERROR 1815 (HY000): Internal error: System is not ready yet. Please try again.

Shall I just leave it for a few hours and see if it sorts itself out ?

With this version of the ColumnStore Engine that has come with the Community Edition 10.6, there doesn't seem to be a MariaDB ColumnStore Admin Console that I can find. I came across a note to say that mcsadmin had been superseded but I can't find by what !
It looks like this is a vital component for tracking what's going on.

Thanks

Simon

Comment by Simon Heyes [ 2022-02-08 ]

I have now tried some standard DML inserting into one of the ColumnStore tables and now get an error.

insert into tableau_app_uk.cs_tableau_analytics_table
select * from tableau_app_uk.cs_tableau_analytics_id_view
where analytics_id between 161329121 and 161601082

Error Code: 1030. Got error -1 "Internal error < 0 (Not system error)" from storage engine ColumnStore 3.813 sec

Looking in the err.log file.

Feb 8 10:39:57 db2_rep writeenginesplit[15942]: 57.278774 |0|0|0| E 33 CAL0000: System is not ready (0). Verify that Columnstore is up and ready
Feb 8 10:39:57 db2_rep writeenginesplit[15942]: 57.279120 |0|0|0| E 33 CAL0087: BulkLoad Error: System is not ready (0). Verify that Columnstore is up and ready

So I'm not convinced that all processes/services are started correctly and I guess this is where the mcsadmin console would come in handy.

Comment by Simon Heyes [ 2022-02-08 ]

I have just used the rollback -r command which seems to have terminated the rollback process 2500 that was running.

rollback -r 2500

The database went into readonly mode but I have put it back to readwrite with the command

dbrmctl readwrite

I have now been able to create a new columnstore table and my insert into existing columnstore tables is now working.

So it looks like all errors and issues have been cleared so would be grateful for any advice on the mcsadmin console.

Comment by Roman [ 2022-02-15 ]

Hey simonthepieman
Thanks for the report.
The issue is known for some time but we can't steadily repro it ourselves. The issue here is that the failed txn(that is presumably abruptly stopped) left some bits in Extent Map, namely in vbbm/vss files in EM storage. DMLProc on its startup looks for txns to be rolled backed and sees this bits puting a cluster into read-only. We could proceed with this issue if you you share openly or secretly the Extent Map state(the contents of /var/lib/columnstore/data1/systemFiles/dbrm) when this ghost txn happens in the cluster.

Generated at Thu Feb 08 02:54:30 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.