[MCOL-4515] Binlog replication getting stuck when running insert into <CS table> select .. from <CS table> Created: 2021-01-27  Updated: 2021-09-15  Resolved: 2021-02-19

Status: Closed
Project: MariaDB ColumnStore
Component/s: None
Affects Version/s: 5.5.1
Fix Version/s: 5.5.2

Type: Bug Priority: Blocker
Reporter: Allen Lee (Inactive) Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

CentOS Linux release 8.2.2004 (Core)
Virtualized, On Premise


Issue Links:
Duplicate
is duplicated by MCOL-4518 Replication Lagging Way Behind On 5.5.1 Closed
Relates
relates to MCOL-4784 Replication lag due to waiting for me... Closed
Sprint: 2021-2, 2021-3

 Description   

*Binlog replication is falling way behind on a 3 node MariaDB async cluster with Columnstore engine activated.

io_thread is fine. sql_thread is stuck on a columnstore query on slaves/replicas.*

Customer reported that insert into <CS table> select .. from <CStable> breaks the replication, meaning replication got stuck and never moving forward. He tried to stop replication, but also got hung and there wasn't able to stop slave. So, he had to kill mysqld process manually, but then it hit this again. He also tried to skip the event, but was not able to do it because once again stop slave hung. More details are in the ticket.

MariaDB [(none)]> show global variables like 'autocommit';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| autocommit | ON |
+---------------+-------+
1 row in set (0.001 sec)
 
MariaDB [(none)]> SHOW VARIABLES LIKE 'columnstore%';
+-------------------------------------------------+--------+
| Variable_name | Value |
+-------------------------------------------------+--------+
| columnstore_cache_flush_threshold | 500000 |
| columnstore_cache_inserts | OFF |
| columnstore_compression_type | SNAPPY |
| columnstore_decimal_scale | 8 |
| columnstore_derived_handler | ON |
| columnstore_diskjoin_bucketsize | 100 |
| columnstore_diskjoin_largesidelimit | 0 |
| columnstore_diskjoin_smallsidelimit | 0 |
| columnstore_double_for_decimal_math | OFF |
| columnstore_group_by_handler | ON |
| columnstore_import_for_batchinsert_delimiter | 7 |
| columnstore_import_for_batchinsert_enclosed_by | 17 |
| columnstore_local_query | 0 |
| columnstore_orderby_threads | 16 |
| columnstore_ordered_only | OFF |
| columnstore_replication_slave | OFF |
| columnstore_select_handler | ON |
| columnstore_select_handler_in_stored_procedures | ON |
| columnstore_string_scan_threshold | 10 |
| columnstore_stringtable_threshold | 20 |
| columnstore_um_mem_limit | 0 |
| columnstore_use_decimal_scale | OFF |
| columnstore_use_import_for_batchinsert | ALWAYS |
| columnstore_varbin_always_hex | OFF |
+-------------------------------------------------+--------+
24 rows in set (0.000 sec)
 
MariaDB [dwhs]> show global variables like '%bin%form%';
+----------------------------+-------+
| Variable_name | Value |
+----------------------------+-------+
| binlog_format | MIXED |
| wsrep_forced_binlog_format | NONE |
+----------------------------+-------+

  • show slave status and binlog output.

MariaDB [dwhs]> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 10.201.64.92
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mariadb-bin.000007
Read_Master_Log_Pos: 19867074
Relay_Log_File: mariadb-relay.000002
Relay_Log_Pos: 5766
Relay_Master_Log_File: mariadb-bin.000007
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 19858617
Relay_Log_Space: 14530
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 19689
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 20
Master_SSL_Crl:
Master_SSL_Crlpath:
Using_Gtid: Slave_Pos
Gtid_IO_Pos: 0-20-8255
Replicate_Do_Domain_Ids:
Replicate_Ignore_Domain_Ids:
Parallel_Mode: optimistic
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Executing
Slave_DDL_Groups: 7
Slave_Non_Transactional_Groups: 0
Slave_Transactional_Groups: 4
 
 
MariaDB [dwhs]> show binlog events in 'mariadb-bin.000007' from 19858617 limit 2\G
*************************** 1. row ***************************
Log_name: mariadb-bin.000007
Pos: 19858617
Event_type: Gtid
Server_id: 20
End_log_pos: 19858659
Info: BEGIN GTID 0-20-8255
*************************** 2. row ***************************
Log_name: mariadb-bin.000007
Pos: 19858659
Event_type: Query
Server_id: 20
End_log_pos: 19867001
Info: use `dwhs`; insert into lead_details_ins_smry1
SELECT daydate,
hour_of_day,
dataheader_type,
dest_adv_key,
destination_program_key,
destination_program_name,
vertical,
supervertical,
csd_salesrep,
.
.
.

  • node info

    $ curl -k https://127.0.0.1:8640/cmapi/0.4.0/cluster/status --header 'Content-Type:application/json' --header 'x-api-key:********' | jq .
    % Total % Received % Xferd Average Speed Time Time Time Current
    Dload Upload Total Spent Left Speed
    100 1160 100 1160 0 0 4280 0 --:--:-- --:--:-- --:--:-- 4264
    {
    "timestamp": "2021-01-12 16:27:25.647409",
    "xx.xx.xx.91": {
    "timestamp": "2021-01-12 16:27:25.654759",
    "uptime": 683831,
    "dbrm_mode": "master",
    "cluster_mode": "readwrite",
    "dbroots": [
    "1"
    ],
    "module_id": 1,
    "services": [
    {
    "name": "workernode",
    "pid": 635181
    },
    {
    "name": "controllernode",
    "pid": 635200
    },
    {
    "name": "PrimProc",
    "pid": 635240
    },
    {
    "name": "ExeMgr",
    "pid": 635384
    },
    {
    "name": "WriteEngine",
    "pid": 635423
    },
    {
    "name": "DMLProc",
    "pid": 635444
    },
    {
    "name": "DDLProc",
    "pid": 635496
    }
    ]
    },
    "xx.xx.xx.92": {
    "timestamp": "2021-01-12 16:27:25.737338",
    "uptime": 683834,
    "dbrm_mode": "slave",
    "cluster_mode": "readonly",
    "dbroots": [
    "2"
    ],
    "module_id": 2,
    "services": [
    {
    "name": "workernode",
    "pid": 630976
    },
    {
    "name": "PrimProc",
    "pid": 631016
    },
    {
    "name": "ExeMgr",
    "pid": 631157
    },
    {
    "name": "WriteEngine",
    "pid": 631195
    }
    ]
    },
    "xx.xx.xx.93": {
    "timestamp": "2021-01-12 16:27:25.818605",
    "uptime": 683838,
    "dbrm_mode": "slave",
    "cluster_mode": "readonly",
    "dbroots": [
    "3"
    ],
    "module_id": 3,
    "services": [
    {
    "name": "workernode",
    "pid": 619787
    },
    {
    "name": "PrimProc",
    "pid": 619828
    },
    {
    "name": "ExeMgr",
    "pid": 619976
    },
    {
    "name": "WriteEngine",
    "pid": 620012
    }
    ]
    },
    "num_nodes": 3
    }
    

  • table schema

MariaDB [dwhs]> show create table lead_details_ins_smry1\G
*************************** 1. row ***************************
Table: lead_details_ins_smry1
Create Table: CREATE TABLE `lead_details_ins_smry1` (
`daydate` date DEFAULT NULL,
`hour_of_day` int(11) DEFAULT NULL,
`dataheader_type` varchar(30) DEFAULT NULL,
`dest_adv_key` int(11) DEFAULT NULL,
`destination_program_key` int(11) DEFAULT NULL,
.
.
.
`payable_leads` int(11) DEFAULT NULL,
`payable_calls` int(11) DEFAULT NULL
) ENGINE=Columnstore DEFAULT CHARSET=latin1
 
 
MariaDB [dwhs]> show create table lead_details_ins\G
*************************** 1. row ***************************
Table: lead_details_ins
Create Table: CREATE TABLE `lead_details_ins` (
`daydate` date DEFAULT NULL,
`datacapturekey` bigint(20) DEFAULT NULL,
`parent_datacapturekey` bigint(20) DEFAULT NULL,
`hour_of_day` int(11) DEFAULT NULL,
`dataheader_key` int(11) DEFAULT NULL,
`dataheader_type` varchar(30) DEFAULT NULL,
`datacapture_status` int(11) DEFAULT NULL,
`datacapture_status_name` varchar(100) DEFAULT NULL,
`dest_adv_key` int(11) DEFAULT NULL,
.
.
.
) ENGINE=Columnstore DEFAULT CHARSET=latin1



 Comments   
Comment by Todd Stoffel (Inactive) [ 2021-01-27 ]

The temporary workaround is to

SET [SESSION] sql_log_bin = 0;

before doing:

INSERT INTO columnstore_table_2 SELECT * FROM columstore_table_1;

Comment by Alexander Barkov [ 2021-02-02 ]

select_handler::execute() got stuck in the following loop:

│  >142           while (!(err= next_row()))                                                                                                            │
│   143           {                                                                                                                                     │
│   144             if (thd->check_killed() || send_data())                                                                                             │
│   145             {                                                                                                                                   │
│   146               end_scan();                                                                                                                       │
│   147               DBUG_RETURN(-1);                                                                                                                  │
│   148             }                                                                                                                                   │
│   149           }  

This patch seems to fix the problem:

diff --git a/dbcon/mysql/ha_mcs_impl.cpp b/dbcon/mysql/ha_mcs_impl.cpp
index 622636cc..6fca51a0 100644
--- a/dbcon/mysql/ha_mcs_impl.cpp
+++ b/dbcon/mysql/ha_mcs_impl.cpp
@@ -5407,7 +5407,7 @@ int ha_mcs_impl_select_next(uchar* buf, TABLE* table)
                 thd->lex->sql_command == SQLCOM_DELETE_MULTI ||
                 thd->lex->sql_command == SQLCOM_TRUNCATE ||
                 thd->lex->sql_command == SQLCOM_LOAD))
-        return 0;
+        return HA_ERR_END_OF_FILE;
 
     if (((thd->lex)->sql_command == SQLCOM_UPDATE)  || ((thd->lex)->sql_command == SQLCOM_DELETE) ||
             ((thd->lex)->sql_command == SQLCOM_DELETE_MULTI) || ((thd->lex)->sql_command == SQLCOM_UPDATE_MULTI))

Comment by Gagan Goel (Inactive) [ 2021-02-04 ]

For QA: Set up a 2-node CS cluster, with 1 master and 1 slave, with shared storage (dbroots) between the nodes. Instructions to set up replication: https://mariadb.com/kb/en/setting-up-replication/

Set columnstore_replication_slave=OFF(default), binlog_format=MIXED (default) in the .cnf files for both master and slave.

Run the following commands on master:

CREATE TABLE cs1 (a INT)engine=columnstore; -- the DDL will be replicated on the slave. Verify this by running SELECT * FROM cs1; on the slave node.
INSERT INTO cs1 VALUES (123);
INSERT INTO cs1 SELECT * FROM cs1;

Now run show slave status;. Observe the value of Seconds_Behind_Master field. You will notice this value will keep on increasing indefinitely. Run show processlist;, and observe the value of Slave_SQL command. This will stay in the Executing state.

Any subsequent queries performed on the slave node will not respond. With the fix, run the above SQL queries again, and you will notice show slave status; and show processlist; outputs showing the slave is caught up with the master, and subsequent queries on the slave will now execute as expected.

Comment by Daniel Lee (Inactive) [ 2021-02-19 ]

Build verified: 5.5.2 (Drone #1685)

Reproduced the issue in 5.5.1 setting up a 2-node cluster described in the last comment.
Observed the issue that Seconds_Behind_Master value keeps increasing (from 142 to 497), and "show process" show Slave_SQL stays in Executing state. But rows inserted in the master did get replicated to the slave.

MariaDB [mytest]> show slave status\G

                                                      • 1. row ***************************
                                                        Slave_IO_State: Waiting for master to send event
                                                        Master_Host: s1pm1
                                                        Master_User: replication_user
                                                        Master_Port: 3306
                                                        Connect_Retry: 10
                                                        Master_Log_File: master1-bin.000001
                                                        Read_Master_Log_Pos: 1404
                                                        Relay_Log_File: centos-8-relay-bin.000002
                                                        Relay_Log_Pos: 924
                                                        Relay_Master_Log_File: master1-bin.000001
                                                        Slave_IO_Running: Yes
                                                        Slave_SQL_Running: Yes
                                                        Replicate_Do_DB:
                                                        Replicate_Ignore_DB:
                                                        Replicate_Do_Table:
                                                        Replicate_Ignore_Table:
                                                        Replicate_Wild_Do_Table:
                                                        Replicate_Wild_Ignore_Table:
                                                        Last_Errno: 0
                                                        Last_Error:
                                                        Skip_Counter: 0
                                                        Exec_Master_Log_Pos: 1185
                                                        Relay_Log_Space: 1455
                                                        Until_Condition: None
                                                        Until_Log_File:
                                                        Until_Log_Pos: 0
                                                        Master_SSL_Allowed: No
                                                        Master_SSL_CA_File:
                                                        Master_SSL_CA_Path:
                                                        Master_SSL_Cert:
                                                        Master_SSL_Cipher:
                                                        Master_SSL_Key:
                                                        Seconds_Behind_Master: 142
                                                        Master_SSL_Verify_Server_Cert: No
                                                        Last_IO_Errno: 0
                                                        Last_IO_Error:
                                                        Last_SQL_Errno: 0
                                                        Last_SQL_Error:
                                                        Replicate_Ignore_Server_Ids:
                                                        Master_Server_Id: 1
                                                        Master_SSL_Crl:
                                                        Master_SSL_Crlpath:
                                                        Using_Gtid: No
                                                        Gtid_IO_Pos:
                                                        Replicate_Do_Domain_Ids:
                                                        Replicate_Ignore_Domain_Ids:
                                                        Parallel_Mode: optimistic
                                                        SQL_Delay: 0
                                                        SQL_Remaining_Delay: NULL
                                                        Slave_SQL_Running_State: Executing
                                                        Slave_DDL_Groups: 1
                                                        Slave_Non_Transactional_Groups: 0
                                                        Slave_Transactional_Groups: 2
                                                        1 row in set (0.000 sec)

MariaDB [mytest]> show slave status\G

                                                      • 1. row ***************************
                                                        Slave_IO_State: Waiting for master to send event
                                                        Master_Host: s1pm1
                                                        Master_User: replication_user
                                                        Master_Port: 3306
                                                        Connect_Retry: 10
                                                        Master_Log_File: master1-bin.000001
                                                        Read_Master_Log_Pos: 2056
                                                        Relay_Log_File: centos-8-relay-bin.000002
                                                        Relay_Log_Pos: 924
                                                        Relay_Master_Log_File: master1-bin.000001
                                                        Slave_IO_Running: Yes
                                                        Slave_SQL_Running: Yes
                                                        Replicate_Do_DB:
                                                        Replicate_Ignore_DB:
                                                        Replicate_Do_Table:
                                                        Replicate_Ignore_Table:
                                                        Replicate_Wild_Do_Table:
                                                        Replicate_Wild_Ignore_Table:
                                                        Last_Errno: 0
                                                        Last_Error:
                                                        Skip_Counter: 0
                                                        Exec_Master_Log_Pos: 1185
                                                        Relay_Log_Space: 2107
                                                        Until_Condition: None
                                                        Until_Log_File:
                                                        Until_Log_Pos: 0
                                                        Master_SSL_Allowed: No
                                                        Master_SSL_CA_File:
                                                        Master_SSL_CA_Path:
                                                        Master_SSL_Cert:
                                                        Master_SSL_Cipher:
                                                        Master_SSL_Key:
                                                        Seconds_Behind_Master: 497
                                                        Master_SSL_Verify_Server_Cert: No
                                                        Last_IO_Errno: 0
                                                        Last_IO_Error:
                                                        Last_SQL_Errno: 0
                                                        Last_SQL_Error:
                                                        Replicate_Ignore_Server_Ids:
                                                        Master_Server_Id: 1
                                                        Master_SSL_Crl:
                                                        Master_SSL_Crlpath:
                                                        Using_Gtid: No
                                                        Gtid_IO_Pos:
                                                        Replicate_Do_Domain_Ids:
                                                        Replicate_Ignore_Domain_Ids:
                                                        Parallel_Mode: optimistic
                                                        SQL_Delay: 0
                                                        SQL_Remaining_Delay: NULL
                                                        Slave_SQL_Running_State: Executing
                                                        Slave_DDL_Groups: 1
                                                        Slave_Non_Transactional_Groups: 0
                                                        Slave_Transactional_Groups: 2
                                                        1 row in set (0.000 sec)

MariaDB [mytest]> show processlist;
-------------------------------------------------------------------------------------------------------------------------

Id User Host db Command Time State Info Progress

-------------------------------------------------------------------------------------------------------------------------

4 mxs maxscale:46084 NULL Sleep 0   NULL 0.000
11 root localhost mytest Query 0 starting show processlist 0.000
12 system user   NULL Slave_IO 300 Waiting for master to send event NULL 0.000
13 system user   mytest Slave_SQL 108 Executing INSERT INTO cs1 SELECT * FROM cs1 0.000

-------------------------------------------------------------------------------------------------------------------------
4 rows in set (0.000 sec)

With 5.5.2

MariaDB [mytest]> show slave status \G

                                                      • 1. row ***************************
                                                        Slave_IO_State: Waiting for master to send event
                                                        Master_Host: s2pm1
                                                        Master_User: replication_user
                                                        Master_Port: 3306
                                                        Connect_Retry: 10
                                                        Master_Log_File: master1-bin.000001
                                                        Read_Master_Log_Pos: 1135
                                                        Relay_Log_File: centos-8-relay-bin.000002
                                                        Relay_Log_Pos: 1362
                                                        Relay_Master_Log_File: master1-bin.000001
                                                        Slave_IO_Running: Yes
                                                        Slave_SQL_Running: Yes
                                                        Replicate_Do_DB:
                                                        Replicate_Ignore_DB:
                                                        Replicate_Do_Table:
                                                        Replicate_Ignore_Table:
                                                        Replicate_Wild_Do_Table:
                                                        Replicate_Wild_Ignore_Table:
                                                        Last_Errno: 0
                                                        Last_Error:
                                                        Skip_Counter: 0
                                                        Exec_Master_Log_Pos: 1135
                                                        Relay_Log_Space: 1674
                                                        Until_Condition: None
                                                        Until_Log_File:
                                                        Until_Log_Pos: 0
                                                        Master_SSL_Allowed: No
                                                        Master_SSL_CA_File:
                                                        Master_SSL_CA_Path:
                                                        Master_SSL_Cert:
                                                        Master_SSL_Cipher:
                                                        Master_SSL_Key:
                                                        Seconds_Behind_Master: 0
                                                        Master_SSL_Verify_Server_Cert: No
                                                        Last_IO_Errno: 0
                                                        Last_IO_Error:
                                                        Last_SQL_Errno: 0
                                                        Last_SQL_Error:
                                                        Replicate_Ignore_Server_Ids:
                                                        Master_Server_Id: 1
                                                        Master_SSL_Crl:
                                                        Master_SSL_Crlpath:
                                                        Using_Gtid: No
                                                        Gtid_IO_Pos:
                                                        Replicate_Do_Domain_Ids:
                                                        Replicate_Ignore_Domain_Ids:
                                                        Parallel_Mode: optimistic
                                                        SQL_Delay: 0
                                                        SQL_Remaining_Delay: NULL
                                                        Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
                                                        Slave_DDL_Groups: 1
                                                        Slave_Non_Transactional_Groups: 0
                                                        Slave_Transactional_Groups: 3
                                                        1 row in set (0.000 sec)

MariaDB [mytest]> show processlist;
-------------------------------------------------------------------------------------------------------------------------

Id User Host db Command Time State Info Progress

-------------------------------------------------------------------------------------------------------------------------

8 root localhost mytest Query 0 starting show processlist 0.000
9 system user   NULL Slave_IO 123 Waiting for master to send event NULL 0.000
10 system user   NULL Slave_SQL 20 Slave has read all relay log; waiting for more updates NULL 0.000

-------------------------------------------------------------------------------------------------------------------------
3 rows in set (0.000 sec)

Generated at Thu Feb 08 02:50:56 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.