[MCOL-5279] MCS Cluster: After software update from 6.x to 22.08, queries hang on one node Created: 2022-10-25  Updated: 2023-11-21  Resolved: 2022-11-03

Status: Closed
Project: MariaDB ColumnStore
Component/s: None
Affects Version/s: 22.08.2
Fix Version/s: 22.08.3

Type: Bug Priority: Blocker
Reporter: Edward Stoever Assignee: Roman
Resolution: Fixed Votes: 1
Labels: cluster, triage

Issue Links:
PartOf
includes MCOL-5305 Query initiated at slave MDB are stuck Closed
Problem/Incident
is caused by MCOL-5166 EM to PP facility communication speed... Closed
Sprint: 2022-22
Assigned for Testing: Daniel Lee Daniel Lee (Inactive)

 Description   

"Where this join query hangs forever on the master node, but runs fine on the replicas"

SELECT ff.acct_id, COUNT(1) 
FROM ff JOIN dd  ON ff.acct_sk =dd.acct_sk 
AND ff.db_source_sk =dd.db_source_sk 
GROUP BY ff.acct_id LIMIT 10;

mariadb -e "create database test; use test;"
 
mariadb test -e "CREATE TABLE `dd`( `acct_sk` INT(11) UNSIGNED NOT NULL, `acct_id` VARCHAR(128) NOT NULL DEFAULT 'None', `db_source_sk` INT(11) UNSIGNED NOT NULL) ENGINE=Columnstore DEFAULT CHARSET=utf8mb4;"
mariadb test -e "CREATE TABLE `ff` ( `db_source_sk` INT(11) UNSIGNED NOT NULL, `acct_id` VARCHAR(128) NOT NULL DEFAULT 'None', `acct_sk` INT(11) UNSIGNED NOT NULL DEFAULT 0 ) ENGINE=Columnstore DEFAULT CHARSET=utf8mb4;"
 
mariadb test -e "INSERT INTO dd SELECT ROUND(RAND() * 10, 2), substring(MD5(RAND()),1,1), ROUND(RAND() * 100, 2) FROM seq_1_to_300000; "
mariadb test -e "INSERT INTO ff SELECT ROUND(RAND() * 10, 2), substring(MD5(RAND()),1,1), ROUND(RAND() * 100, 2) FROM seq_1_to_300000;"



 Comments   
Comment by alexey vorovich (Inactive) [ 2022-10-26 ]

dleeyh please try to repro this and in your mulinode tests , and then we add the test

Comment by David Hall (Inactive) [ 2022-11-03 ]

Here's a test for some edge case

# This test makes sense when run on a multi-node cluster.
CREATE TABLE `dd`( `acct_sk` INT(11) UNSIGNED NOT NULL, `acct_id` VARCHAR(128) NOT NULL DEFAULT 'None', `db_source_sk` INT(11) UNSIGNED NOT NULL) ENGINE=Columnstore DEFAULT CHARSET=utf8mb4;
CREATE TABLE `ff` ( `db_source_sk` INT(11) UNSIGNED NOT NULL, `acct_id` VARCHAR(128) NOT NULL DEFAULT 'None', `acct_sk` INT(11) UNSIGNED NOT NULL DEFAULT 0 ) ENGINE=Columnstore DEFAULT CHARSET=utf8mb4;
 
INSERT INTO dd SELECT ROUND(RAND() * 10, 2), substring(MD5(RAND()),1,1), ROUND(RAND() * 100, 2) FROM seq_1_to_300000;
INSERT INTO ff SELECT ROUND(RAND() * 10, 2), substring(MD5(RAND()),1,1), ROUND(RAND() * 100, 2) FROM seq_1_to_300000;
 
# The purpose is to run the statement that doesn't hang.
SELECT * FROM (SELECT ff.acct_id, COUNT(1) FROM ff JOIN dd  ON ff.acct_sk =dd.acct_sk AND ff.db_source_sk =dd.db_source_sk GROUP BY ff.acct_id LIMIT 10)s LIMIT 0;
 
# Clean UP
DROP DATABASE mcol_5279;

Comment by Daniel Lee (Inactive) [ 2022-11-03 ]

Build verified: 22.08.3 (RC from Jenkins)

Query no longer hangs. MTR tests were able to complete.

Comment by alexey vorovich (Inactive) [ 2022-11-11 ]

additional scenario discovered and fixed

sporadic hang (repoduced in VM or k8s - not in dockercompose
In the sequence below SELECT to next host hangs. Sporadic

        self.connect(self.HOST_0)
        self.create_table()
        self.ex_sql("insert into t1 (f1) values ( 1 )",silent=False)
        self.connection.commit()
        self.ex_sql("Select count(*)  from t1")
 
        self.connect(self.HOST_1)
        self.ex_sql("use d1")
        self.ex_sql("select count(*)  from t1")

Generated at Thu Feb 08 02:56:41 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.