Uploaded image for project: 'MariaDB ColumnStore'
  1. MariaDB ColumnStore
  2. MCOL-5093

mariadb services on column cluster nodes occasionally enters a restart loop

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 6.1.1
    • 22.08.1
    • ExeMgr, MariaDB Server

    Description

      on the Pixid ColumnStore, for a couple of days we've observed that occasionally the Mariadb service gets in a crash restart loop. It has also been observed that sometimes when this happens, this also affects the ColumnStore performance, as it puts the cluster in read-only mode or crashes the ExeMgr (with a signal 6 or 11).

      Mariadb version is: 10.6.4-1-MariaDB-enterprise-log
      Columnstore version is: 6.1.1

      Attachments

        1. Columnstore.xml_CSX1.txt
          20 kB
        2. Columnstore.xml_CSX2.xml
          21 kB
        3. Columnstore.xml_CSX3.txt
          20 kB
        4. Columnstore.xml_svmcs2.txt
          19 kB
        5. CSX1_logs.tar.gz
          955 kB
        6. mariadb_logs.tar.gz
          21 kB
        7. pixid-full-query.txt
          10 kB
        8. qberg_error_swmcs2_20062022.txt
          48 kB

        Issue Links

          Activity

            nhat.ho Nhat Ho (Inactive) added a comment - centos-pixid template -> https://drive.google.com/file/d/1dFs9XM7SG7fpCYiS1qBQrRNNf1ENzSWp/view?usp=sharing

            duplicate of MCOL-4775

            ccalender Chris Calender (Inactive) added a comment - duplicate of MCOL-4775

            also duplicate of mcol-5105

            alexey.vorovich alexey vorovich (Inactive) added a comment - also duplicate of mcol-5105

            dleeyh will see if more testing is possible

            alexey.vorovich alexey vorovich (Inactive) added a comment - dleeyh will see if more testing is possible

            Build tested: 6.4.1-1 (#4710)

            Some testing was done and logged in the duplicate ticket MCOL-5105.

            I did a large DELETE test on this build. During ColumnStore start up, rollback kept on failing and DMLProc kept on crashing in a loop.

            Table used: dbt3 lineitem
            dataset: 305,931,475 rows
            VM memory: 4 gb
            version buffer: 20gb

            1. Attempted to delete all rows in the table.

            MariaDB [tpch10]> select count(*) from lineitem;
            +-----------+
            | count(*)  |
            +-----------+
            | 305931475 |
            +-----------+
            1 row in set (1.286 sec)
             
            MariaDB [tpch10]> delete from lineitem;
            ERROR 1815 (HY000): Internal error: CAL0002: Delete Failed:  MCS-2008: The version buffer overflowed. Increase VersionBufferFileSize or limit the rows to be processed.
            

            The delete operation did not run out of memory. Instead, a version buffer flow error returned. According to the debug log, rollback operation last for 40 minutes and failed.

            debug.log

            Jun 24 18:59:48 rocky8 dmlpackageproc[7115]: 48.354018 |5|26|0| D 21 CAL0041: Start SQL statement: delete from lineitem;|tpch10|
            Jun 24 18:59:48 rocky8 ExeMgr[7103]: 48.364928 |5|0|0| D 16 CAL0041: Start SQL statement: delete from lineitem; ||
             
             
            Jun 24 19:17:47 rocky8 workernode[7033]: 47.670848 |0|0|0| C 30 CAL0000: VBBM::getBlocks(): version buffer overflow. Increase VersionBufferFileSize. Overflow occured in aged blocks. Requested NumBlocks:VbOid:vbFBO:lastFBO = 90:0:200:299 lbid locked is 955396
            Jun 24 19:17:47 rocky8 dmlpackageproc[7115]: 47.686747 |5|26|0| D 21 CAL0042: End SQL statement
            Jun 24 19:17:47 rocky8 dmlpackageproc[7115]: 47.730642 |0|0|0| E 21 CAL0002: Delete Failed:  MCS-2008: The version buffer overflowed. Increase VersionBufferFileSize or limit the rows to be processed.
            Jun 24 19:17:47 rocky8 ExeMgr[7103]: 47.748507 |5|0|0| D 16 CAL0042: End SQL statement
            Jun 24 19:17:47 rocky8 dmlpackageproc[7115]: 47.754052 |5|26|0| D 21 CAL0001: Start SQL statement:  ROLLBACK
             
            Jun 24 19:57:39 rocky8 workernode[7033]: 39.956957 |0|0|0| C 30 CAL0000: SlaveDBRMNode::vbRollback error. std::bad_alloc
            Jun 24 19:57:40 rocky8 DMLProc[7115]: 39.997409 |0|0|0| E 20 CAL0002: ROLLBACK failed due to: DBRM vbRollback [ UNKNOWN (255)]
            Jun 24 19:57:40 rocky8 dmlpackageproc[7115]: 40.004365 |5|26|0| D 21 CAL0001: End SQL statement
            Jun 24 19:57:40 rocky8 dmlpackageproc[7115]: 40.006629 |0|0|0| E 21 CAL0001: ROLLBACK failed due to: DBRM vbRollback [ UNKNOWN (255)]
            

            2. restart mariadb and columnstore-columnstore
            Since rollback failed in step #1, it is expected the transaction will be rollback during ColumnStore start up.

            DMLProc kept on crashing during rollback. But during the rollback attempts, SELECT COUNT() statements returned the correct count.

            [rocky8:root~]# systemctl restart mariadb mariadb-columnstore
            [rocky8:root~]# mariadb tpch10
            Reading table information for completion of table and column names
            You can turn off this feature to get a quicker startup with -A
             
            Welcome to the MariaDB monitor.  Commands end with ; or \g.
            Your MariaDB connection id is 3
            Server version: 10.6.9-5-MariaDB MariaDB Server
             
            Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
             
            Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
             
            MariaDB [tpch10]> select count(*) from lineitem;
            +-----------+
            | count(*)  |
            +-----------+
            | 305931475 |
            +-----------+
            1 row in set (16 min 56.345 sec)
             
            MariaDB [tpch10]> select count(l_comment) from lineitem;
            +------------------+
            | count(l_comment) |
            +------------------+
            |        305931475 |
            +------------------+
            1 row in set (57 min 19.110 sec)
            

            debug.log

            Jun 24 19:59:13 rocky8 controllernode[8451]: 13.745665 |0|0|0| D 29 CAL0000: DBRM Controller: Connected to DBRM_Worker1
            Jun 24 19:59:13 rocky8 controllernode[8449]: 13.746249 |0|0|0| I 29 CAL0000: ControllerNode main process has started
            Jun 24 19:59:13 rocky8 writeengineserver[8492]: 13.810300 |0|0|0| I 32 CAL0001: WriteEngine main process has started
            Jun 24 19:59:13 rocky8 ExeMgr[8505]: 13.837356 |0|0|0| I 16 CAL0008: No such file or directory
            Jun 24 19:59:13 rocky8 DMLProc[8524]: 13.861219 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll.
            Jun 24 19:59:13 rocky8 DMLProc[8524]: 13.868685 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables.
            Jun 24 19:59:13 rocky8 DMLProc[8524]: 13.868717 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26
             
            Jun 24 20:14:14 rocky8 ddlpackageproc[8627]: 14.056136 |0|0|0| C 23 CAL0008: DDLProc main process has started
            Jun 24 20:14:14 rocky8 DMLProc[8644]: 14.103672 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll.
            Jun 24 20:14:14 rocky8 DMLProc[8644]: 14.125678 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables.
            Jun 24 20:14:14 rocky8 DMLProc[8644]: 14.125742 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26
            Jun 24 20:14:14 rocky8 writeengine[8659]: 14.179792 |0|0|0| I 19 CAL0060: dbbuilder system catalog status: System catalog appears to exist.  It will remain intact for reuse.  The database is not recreated.
            Jun 24 20:14:14 rocky8 writeengineserver[8498]: 14.446670 |0|0|0| I 32 CAL0007: RollbackTran found chk name /var/lib/columnstore/data1/000.dir/000.dir/012.dir/044.dir/001.dir/FILE003.cdf size: 483328 offset: 4358144
            Jun 24 20:14:14 rocky8 writeengineserver[8498]: 14.447356 |0|0|0| I 32 CAL0007: Rollback copied to file  /var/lib/columnstore/data1/000.dir/000.dir/012.dir/044.dir/001.dir/FILE003.cdf from file  /var/lib/columnstore/data1/000.dir/000.dir/012.dir/044.dir/001.dir/FILE003.cdf.chk
            Jun 24 20:15:13 rocky8 ExeMgr[8512]: 13.786728 |2147483651|0|0| D 16 CAL0041: Start SQL statement: select objectid,columnname from syscolumn where schema='tpch10' and tablename='lineitem' --columnRIDs/FE; ||
            Jun 24 20:15:13 rocky8 ExeMgr[8512]: 13.810616 |2147483651|0|0| D 16 CAL0042: End SQL statement
            Jun 24 20:15:13 rocky8 ExeMgr[8512]: 13.851410 |3|0|0| D 16 CAL0041: Start SQL statement: select count(*) from lineitem; |tpch10|
            Jun 24 20:29:14 rocky8 DMLProc[9155]: 14.567967 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll.
            Jun 24 20:29:14 rocky8 DMLProc[9155]: 14.594511 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables.
            Jun 24 20:29:14 rocky8 DMLProc[9155]: 14.594584 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26
            Jun 24 20:32:10 rocky8 ExeMgr[8512]: 10.069535 |3|0|0| D 16 CAL0042: End SQL statement
            Jun 24 20:34:53 rocky8 ExeMgr[8512]: 53.033427 |3|0|0| D 16 CAL0041: Start SQL statement: select count(l_comment) from lineitem; |tpch10|
             
            Jun 24 20:44:15 rocky8 DMLProc[9359]: 15.053499 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll.
            Jun 24 20:44:15 rocky8 DMLProc[9359]: 15.076859 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables.
            Jun 24 20:44:15 rocky8 DMLProc[9359]: 15.076907 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26
            Jun 24 20:59:15 rocky8 DMLProc[9564]: 15.637085 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll.
            Jun 24 20:59:15 rocky8 DMLProc[9564]: 15.695066 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables.
            Jun 24 20:59:15 rocky8 DMLProc[9564]: 15.695163 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26
            Jun 24 21:14:16 rocky8 DMLProc[9798]: 16.617281 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll.
            Jun 24 21:14:16 rocky8 DMLProc[9798]: 16.802781 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables.
            Jun 24 21:14:16 rocky8 DMLProc[9798]: 16.802881 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26
            Jun 24 21:29:16 rocky8 DMLProc[10020]: 16.615797 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll.
            Jun 24 21:29:16 rocky8 DMLProc[10020]: 16.647628 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables.
            Jun 24 21:29:16 rocky8 DMLProc[10020]: 16.647700 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26
            Jun 24 21:32:12 rocky8 ExeMgr[8512]: 12.090052 |3|0|0| D 16 CAL0042: End SQL statement
            Jun 24 21:44:17 rocky8 DMLProc[10129]: 17.090422 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll.
            Jun 24 21:44:17 rocky8 DMLProc[10129]: 17.122433 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables.
            Jun 24 21:44:17 rocky8 DMLProc[10129]: 17.122474 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26
            Jun 24 21:59:17 rocky8 DMLProc[10183]: 17.588463 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll.
            Jun 24 21:59:58 rocky8 DMLProc[10183]: 51.436649 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables.
            Jun 24 22:00:27 rocky8 DMLProc[10183]: 19.968375 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26
            Jun 24 22:00:35 rocky8 controllernode[8451]: 35.666875 |0|0|0| C 29 CAL0000: InetStreamSocket::readToMagic(): I/O error1: rc-1; poll signal interrupt ( POLLHUP POLLERR )
            Jun 24 22:00:35 rocky8 joblist[10183]: 35.817497 |0|0|0| D 05 CAL0000: /mdb/verylongdirnameforverystrangecpackbehavior/storage/columnstore/columnstore/writeengine/client/we_clients.cpp @ 475 WECLIENT: There is no connection to WES and this = 0x7f0cb80096d0
            Jun 24 22:00:35 rocky8 DMLProc[10183]: 35.830536 |0|0|0| E 20 CAL0002: DDLProc cannot get clean up information from DDL log files due to There is no WriteEngineServer to send message to.
            Jun 24 22:00:35 rocky8 joblist[10183]: 35.830839 |0|0|0| D 05 CAL0000: /mdb/verylongdirnameforverystrangecpackbehavior/storage/columnstore/columnstore/writeengine/client/we_clients.cpp @ 495 WECLIENT:  There is no connection to WES and this = 0x7f0cb80096d0
            Jun 24 22:00:36 rocky8 writeengineserver[10201]: 36.015959 |0|0|0| I 32 CAL0001: WriteEngine main process has started
            Jun 24 22:00:36 rocky8 ddlpackageproc[10179]: 36.038640 |0|0|0| C 23 CAL0008: Interrupted system call
            Jun 24 22:00:36 rocky8 ddlpackageproc[10219]: 36.073536 |0|0|0| C 23 CAL0008: DDLProc main process has started
            Jun 24 22:00:36 rocky8 DMLProc[10229]: 36.235050 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll.
            Jun 24 22:00:36 rocky8 DMLProc[10229]: 36.248034 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables.
            Jun 24 22:00:36 rocky8 DMLProc[10229]: 36.248129 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26
            Jun 24 22:04:08 rocky8 Calpont[8442]: 08.537380 |0|0|0| W 00 CAL0094: Attempting to fix the BRM lock state. Diagnostic values: r=2 rwt=0 w=0 wwt=0.
            Jun 24 22:04:08 rocky8 Calpont[8442]: 08.537487 |0|0|0| W 00 CAL0095: BRM lock state appears to be functional again.
            Jun 24 22:15:36 rocky8 DMLProc[10314]: 36.782487 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll.
            Jun 24 22:15:36 rocky8 DMLProc[10314]: 36.798260 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables.
            Jun 24 22:15:36 rocky8 DMLProc[10314]: 36.798293 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26
            Jun 24 22:30:37 rocky8 DMLProc[10363]: 37.305528 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll.
            Jun 24 22:30:37 rocky8 DMLProc[10363]: 37.337576 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables.
            Jun 24 22:30:37 rocky8 DMLProc[10363]: 37.337616 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26
            Jun 24 22:45:37 rocky8 DMLProc[10413]: 37.826611 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll.
            Jun 24 22:45:38 rocky8 DMLProc[10413]: 38.118809 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables.
            Jun 24 22:45:38 rocky8 DMLProc[10413]: 38.118886 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26
            Jun 24 23:00:38 rocky8 DMLProc[10470]: 38.349769 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll.
            Jun 24 23:00:38 rocky8 DMLProc[10470]: 38.387448 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables.
            Jun 24 23:00:38 rocky8 DMLProc[10470]: 38.387483 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26
            Jun 24 23:15:38 rocky8 DMLProc[10528]: 38.840404 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll.
            Jun 24 23:15:38 rocky8 DMLProc[10528]: 38.886049 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables.
            Jun 24 23:15:38 rocky8 DMLProc[10528]: 38.886099 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26
            Jun 24 23:30:39 rocky8 DMLProc[10593]: 39.361633 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll.
            Jun 24 23:30:39 rocky8 DMLProc[10593]: 39.493635 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables.
            Jun 24 23:30:39 rocky8 DMLProc[10593]: 39.493674 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26
            Jun 24 23:45:39 rocky8 DMLProc[10641]: 39.752505 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll.
            Jun 24 23:45:39 rocky8 DMLProc[10641]: 39.780141 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables.
            Jun 24 23:45:39 rocky8 DMLProc[10641]: 39.780254 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26
            Jun 25 00:00:40 rocky8 DMLProc[10705]: 40.073333 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll.
            Jun 25 00:00:40 rocky8 DMLProc[10705]: 40.084897 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables.
            Jun 25 00:00:40 rocky8 DMLProc[10705]: 40.084929 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26
            Jun 25 00:16:22 rocky8 DMLProc[10781]: 22.376086 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll.
            Jun 25 00:17:05 rocky8 joblist[10781]: 50.357426 |0|0|0| E 05 CAL0000: /mdb/verylongdirnameforverystrangecpackbehavior/storage/columnstore/columnstore/writeengine/client/we_clients.cpp @ 282 Could not connect to pm1_WriteEngineServer: InetStreamSocket::connect: connection timed out
            Jun 25 00:18:00 rocky8 writeengine[10211]: 54.363513 |0|0|0| E 19 CAL0001: ReadThreadFactory::CreateReadThread: read() error InetStreamSocket::readToMagic(): I/O error1: rc-1; poll signal interrupt ( POLLHUP POLLERR )
            Jun 25 00:19:30 rocky8 DMLProc[10781]: 30.207575 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables.
            Jun 25 00:19:30 rocky8 DMLProc[10781]: 30.223528 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26
            Jun 25 00:19:30 rocky8 controllernode[8451]: 30.474215 |0|0|0| C 29 CAL0000: InetStreamSocket::readToMagic(): I/O error1: rc-1; poll signal interrupt ( POLLHUP POLLERR )
            Jun 25 00:19:30 rocky8 joblist[10781]: 30.635974 |0|0|0| D 05 CAL0000: /mdb/verylongdirnameforverystrangecpackbehavior/storage/columnstore/columnstore/writeengine/client/we_clients.cpp @ 495 WECLIENT:  There is no connection to WES and this = 0x5585e4112d60
            Jun 25 00:19:30 rocky8 DMLProc[10781]: 30.636205 |0|0|0| C 20 CAL0002:  problem with rollback transaction 26and DBRM is setting to readonly and table lock is not released: Network error occured when rolling back blocksThere is no WriteEngineServer to send message to.
            Jun 25 00:19:30 rocky8 DMLProc[10781]: 30.646342 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 transactions.
            Jun 25 00:19:30 rocky8 DMLProc[10781]: 30.769315 |0|0|0| I 20 CAL0002: DMLProc will roll back transaction 26
            Jun 25 00:19:30 rocky8 DMLProc[10781]: 30.769409 |0|0|0| C 20 CAL0002:  problem with rollback transaction 26and DBRM is setting to readonly and table lock is not released: Can't read DBRM isReadWrite [ DBRM is in READ-ONLY mode]
            Jun 25 00:19:30 rocky8 writeengineserver[10814]: 30.773146 |0|0|0| I 32 CAL0001: WriteEngine main process has started
            Jun 25 00:19:31 rocky8 DMLProc[10781]: 31.021308 |0|0|0| I 20 CAL0002: DMLProc finished rollbackAll.
            Jun 25 00:19:31 rocky8 joblist[10781]: 31.022257 |0|0|0| D 05 CAL0000: /mdb/verylongdirnameforverystrangecpackbehavior/storage/columnstore/columnstore/writeengine/client/we_clients.cpp @ 495 WECLIENT:  There is no connection to WES and this = 0x5585e4112d60
            Jun 25 00:19:31 rocky8 ddlpackageproc[10777]: 31.443311 |0|0|0| C 23 CAL0008: Interrupted system call
            Jun 25 00:19:31 rocky8 ddlpackageproc[10833]: 31.470520 |0|0|0| C 23 CAL0008: DDLProc main process has started
            Jun 25 00:19:31 rocky8 DMLProc[10844]: 31.732514 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll.
            Jun 25 00:19:31 rocky8 DMLProc[10844]: 31.733110 |0|0|0| C 20 CAL0002: DMLProc failed to start due to : Rollback will be deferred due to DBRM is in read only state.
            Jun 25 00:19:31 rocky8 ddlpackageproc[10841]: 31.733644 |0|0|0| C 23 CAL0008: DMLProc main process initialization failed
            

            dleeyh Daniel Lee (Inactive) added a comment - Build tested: 6.4.1-1 (#4710) Some testing was done and logged in the duplicate ticket MCOL-5105 . I did a large DELETE test on this build. During ColumnStore start up, rollback kept on failing and DMLProc kept on crashing in a loop. Table used: dbt3 lineitem dataset: 305,931,475 rows VM memory: 4 gb version buffer: 20gb 1. Attempted to delete all rows in the table. MariaDB [tpch10]> select count(*) from lineitem; +-----------+ | count(*) | +-----------+ | 305931475 | +-----------+ 1 row in set (1.286 sec)   MariaDB [tpch10]> delete from lineitem; ERROR 1815 (HY000): Internal error: CAL0002: Delete Failed: MCS-2008: The version buffer overflowed. Increase VersionBufferFileSize or limit the rows to be processed. The delete operation did not run out of memory. Instead, a version buffer flow error returned. According to the debug log, rollback operation last for 40 minutes and failed. debug.log Jun 24 18:59:48 rocky8 dmlpackageproc[7115]: 48.354018 |5|26|0| D 21 CAL0041: Start SQL statement: delete from lineitem;|tpch10| Jun 24 18:59:48 rocky8 ExeMgr[7103]: 48.364928 |5|0|0| D 16 CAL0041: Start SQL statement: delete from lineitem; ||     Jun 24 19:17:47 rocky8 workernode[7033]: 47.670848 |0|0|0| C 30 CAL0000: VBBM::getBlocks(): version buffer overflow. Increase VersionBufferFileSize. Overflow occured in aged blocks. Requested NumBlocks:VbOid:vbFBO:lastFBO = 90:0:200:299 lbid locked is 955396 Jun 24 19:17:47 rocky8 dmlpackageproc[7115]: 47.686747 |5|26|0| D 21 CAL0042: End SQL statement Jun 24 19:17:47 rocky8 dmlpackageproc[7115]: 47.730642 |0|0|0| E 21 CAL0002: Delete Failed: MCS-2008: The version buffer overflowed. Increase VersionBufferFileSize or limit the rows to be processed. Jun 24 19:17:47 rocky8 ExeMgr[7103]: 47.748507 |5|0|0| D 16 CAL0042: End SQL statement Jun 24 19:17:47 rocky8 dmlpackageproc[7115]: 47.754052 |5|26|0| D 21 CAL0001: Start SQL statement: ROLLBACK   Jun 24 19:57:39 rocky8 workernode[7033]: 39.956957 |0|0|0| C 30 CAL0000: SlaveDBRMNode::vbRollback error. std::bad_alloc Jun 24 19:57:40 rocky8 DMLProc[7115]: 39.997409 |0|0|0| E 20 CAL0002: ROLLBACK failed due to: DBRM vbRollback [ UNKNOWN (255)] Jun 24 19:57:40 rocky8 dmlpackageproc[7115]: 40.004365 |5|26|0| D 21 CAL0001: End SQL statement Jun 24 19:57:40 rocky8 dmlpackageproc[7115]: 40.006629 |0|0|0| E 21 CAL0001: ROLLBACK failed due to: DBRM vbRollback [ UNKNOWN (255)] 2. restart mariadb and columnstore-columnstore Since rollback failed in step #1, it is expected the transaction will be rollback during ColumnStore start up. DMLProc kept on crashing during rollback. But during the rollback attempts, SELECT COUNT() statements returned the correct count. [rocky8:root~]# systemctl restart mariadb mariadb-columnstore [rocky8:root~]# mariadb tpch10 Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A   Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 3 Server version: 10.6.9-5-MariaDB MariaDB Server   Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.   Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.   MariaDB [tpch10]> select count(*) from lineitem; +-----------+ | count(*) | +-----------+ | 305931475 | +-----------+ 1 row in set (16 min 56.345 sec)   MariaDB [tpch10]> select count(l_comment) from lineitem; +------------------+ | count(l_comment) | +------------------+ | 305931475 | +------------------+ 1 row in set (57 min 19.110 sec) debug.log Jun 24 19:59:13 rocky8 controllernode[8451]: 13.745665 |0|0|0| D 29 CAL0000: DBRM Controller: Connected to DBRM_Worker1 Jun 24 19:59:13 rocky8 controllernode[8449]: 13.746249 |0|0|0| I 29 CAL0000: ControllerNode main process has started Jun 24 19:59:13 rocky8 writeengineserver[8492]: 13.810300 |0|0|0| I 32 CAL0001: WriteEngine main process has started Jun 24 19:59:13 rocky8 ExeMgr[8505]: 13.837356 |0|0|0| I 16 CAL0008: No such file or directory Jun 24 19:59:13 rocky8 DMLProc[8524]: 13.861219 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll. Jun 24 19:59:13 rocky8 DMLProc[8524]: 13.868685 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables. Jun 24 19:59:13 rocky8 DMLProc[8524]: 13.868717 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26   Jun 24 20:14:14 rocky8 ddlpackageproc[8627]: 14.056136 |0|0|0| C 23 CAL0008: DDLProc main process has started Jun 24 20:14:14 rocky8 DMLProc[8644]: 14.103672 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll. Jun 24 20:14:14 rocky8 DMLProc[8644]: 14.125678 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables. Jun 24 20:14:14 rocky8 DMLProc[8644]: 14.125742 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26 Jun 24 20:14:14 rocky8 writeengine[8659]: 14.179792 |0|0|0| I 19 CAL0060: dbbuilder system catalog status: System catalog appears to exist. It will remain intact for reuse. The database is not recreated. Jun 24 20:14:14 rocky8 writeengineserver[8498]: 14.446670 |0|0|0| I 32 CAL0007: RollbackTran found chk name /var/lib/columnstore/data1/000.dir/000.dir/012.dir/044.dir/001.dir/FILE003.cdf size: 483328 offset: 4358144 Jun 24 20:14:14 rocky8 writeengineserver[8498]: 14.447356 |0|0|0| I 32 CAL0007: Rollback copied to file /var/lib/columnstore/data1/000.dir/000.dir/012.dir/044.dir/001.dir/FILE003.cdf from file /var/lib/columnstore/data1/000.dir/000.dir/012.dir/044.dir/001.dir/FILE003.cdf.chk Jun 24 20:15:13 rocky8 ExeMgr[8512]: 13.786728 |2147483651|0|0| D 16 CAL0041: Start SQL statement: select objectid,columnname from syscolumn where schema='tpch10' and tablename='lineitem' --columnRIDs/FE; || Jun 24 20:15:13 rocky8 ExeMgr[8512]: 13.810616 |2147483651|0|0| D 16 CAL0042: End SQL statement Jun 24 20:15:13 rocky8 ExeMgr[8512]: 13.851410 |3|0|0| D 16 CAL0041: Start SQL statement: select count(*) from lineitem; |tpch10| Jun 24 20:29:14 rocky8 DMLProc[9155]: 14.567967 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll. Jun 24 20:29:14 rocky8 DMLProc[9155]: 14.594511 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables. Jun 24 20:29:14 rocky8 DMLProc[9155]: 14.594584 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26 Jun 24 20:32:10 rocky8 ExeMgr[8512]: 10.069535 |3|0|0| D 16 CAL0042: End SQL statement Jun 24 20:34:53 rocky8 ExeMgr[8512]: 53.033427 |3|0|0| D 16 CAL0041: Start SQL statement: select count(l_comment) from lineitem; |tpch10|   Jun 24 20:44:15 rocky8 DMLProc[9359]: 15.053499 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll. Jun 24 20:44:15 rocky8 DMLProc[9359]: 15.076859 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables. Jun 24 20:44:15 rocky8 DMLProc[9359]: 15.076907 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26 Jun 24 20:59:15 rocky8 DMLProc[9564]: 15.637085 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll. Jun 24 20:59:15 rocky8 DMLProc[9564]: 15.695066 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables. Jun 24 20:59:15 rocky8 DMLProc[9564]: 15.695163 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26 Jun 24 21:14:16 rocky8 DMLProc[9798]: 16.617281 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll. Jun 24 21:14:16 rocky8 DMLProc[9798]: 16.802781 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables. Jun 24 21:14:16 rocky8 DMLProc[9798]: 16.802881 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26 Jun 24 21:29:16 rocky8 DMLProc[10020]: 16.615797 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll. Jun 24 21:29:16 rocky8 DMLProc[10020]: 16.647628 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables. Jun 24 21:29:16 rocky8 DMLProc[10020]: 16.647700 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26 Jun 24 21:32:12 rocky8 ExeMgr[8512]: 12.090052 |3|0|0| D 16 CAL0042: End SQL statement Jun 24 21:44:17 rocky8 DMLProc[10129]: 17.090422 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll. Jun 24 21:44:17 rocky8 DMLProc[10129]: 17.122433 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables. Jun 24 21:44:17 rocky8 DMLProc[10129]: 17.122474 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26 Jun 24 21:59:17 rocky8 DMLProc[10183]: 17.588463 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll. Jun 24 21:59:58 rocky8 DMLProc[10183]: 51.436649 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables. Jun 24 22:00:27 rocky8 DMLProc[10183]: 19.968375 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26 Jun 24 22:00:35 rocky8 controllernode[8451]: 35.666875 |0|0|0| C 29 CAL0000: InetStreamSocket::readToMagic(): I/O error1: rc-1; poll signal interrupt ( POLLHUP POLLERR ) Jun 24 22:00:35 rocky8 joblist[10183]: 35.817497 |0|0|0| D 05 CAL0000: /mdb/verylongdirnameforverystrangecpackbehavior/storage/columnstore/columnstore/writeengine/client/we_clients.cpp @ 475 WECLIENT: There is no connection to WES and this = 0x7f0cb80096d0 Jun 24 22:00:35 rocky8 DMLProc[10183]: 35.830536 |0|0|0| E 20 CAL0002: DDLProc cannot get clean up information from DDL log files due to There is no WriteEngineServer to send message to. Jun 24 22:00:35 rocky8 joblist[10183]: 35.830839 |0|0|0| D 05 CAL0000: /mdb/verylongdirnameforverystrangecpackbehavior/storage/columnstore/columnstore/writeengine/client/we_clients.cpp @ 495 WECLIENT: There is no connection to WES and this = 0x7f0cb80096d0 Jun 24 22:00:36 rocky8 writeengineserver[10201]: 36.015959 |0|0|0| I 32 CAL0001: WriteEngine main process has started Jun 24 22:00:36 rocky8 ddlpackageproc[10179]: 36.038640 |0|0|0| C 23 CAL0008: Interrupted system call Jun 24 22:00:36 rocky8 ddlpackageproc[10219]: 36.073536 |0|0|0| C 23 CAL0008: DDLProc main process has started Jun 24 22:00:36 rocky8 DMLProc[10229]: 36.235050 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll. Jun 24 22:00:36 rocky8 DMLProc[10229]: 36.248034 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables. Jun 24 22:00:36 rocky8 DMLProc[10229]: 36.248129 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26 Jun 24 22:04:08 rocky8 Calpont[8442]: 08.537380 |0|0|0| W 00 CAL0094: Attempting to fix the BRM lock state. Diagnostic values: r=2 rwt=0 w=0 wwt=0. Jun 24 22:04:08 rocky8 Calpont[8442]: 08.537487 |0|0|0| W 00 CAL0095: BRM lock state appears to be functional again. Jun 24 22:15:36 rocky8 DMLProc[10314]: 36.782487 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll. Jun 24 22:15:36 rocky8 DMLProc[10314]: 36.798260 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables. Jun 24 22:15:36 rocky8 DMLProc[10314]: 36.798293 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26 Jun 24 22:30:37 rocky8 DMLProc[10363]: 37.305528 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll. Jun 24 22:30:37 rocky8 DMLProc[10363]: 37.337576 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables. Jun 24 22:30:37 rocky8 DMLProc[10363]: 37.337616 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26 Jun 24 22:45:37 rocky8 DMLProc[10413]: 37.826611 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll. Jun 24 22:45:38 rocky8 DMLProc[10413]: 38.118809 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables. Jun 24 22:45:38 rocky8 DMLProc[10413]: 38.118886 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26 Jun 24 23:00:38 rocky8 DMLProc[10470]: 38.349769 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll. Jun 24 23:00:38 rocky8 DMLProc[10470]: 38.387448 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables. Jun 24 23:00:38 rocky8 DMLProc[10470]: 38.387483 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26 Jun 24 23:15:38 rocky8 DMLProc[10528]: 38.840404 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll. Jun 24 23:15:38 rocky8 DMLProc[10528]: 38.886049 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables. Jun 24 23:15:38 rocky8 DMLProc[10528]: 38.886099 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26 Jun 24 23:30:39 rocky8 DMLProc[10593]: 39.361633 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll. Jun 24 23:30:39 rocky8 DMLProc[10593]: 39.493635 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables. Jun 24 23:30:39 rocky8 DMLProc[10593]: 39.493674 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26 Jun 24 23:45:39 rocky8 DMLProc[10641]: 39.752505 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll. Jun 24 23:45:39 rocky8 DMLProc[10641]: 39.780141 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables. Jun 24 23:45:39 rocky8 DMLProc[10641]: 39.780254 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26 Jun 25 00:00:40 rocky8 DMLProc[10705]: 40.073333 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll. Jun 25 00:00:40 rocky8 DMLProc[10705]: 40.084897 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables. Jun 25 00:00:40 rocky8 DMLProc[10705]: 40.084929 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26 Jun 25 00:16:22 rocky8 DMLProc[10781]: 22.376086 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll. Jun 25 00:17:05 rocky8 joblist[10781]: 50.357426 |0|0|0| E 05 CAL0000: /mdb/verylongdirnameforverystrangecpackbehavior/storage/columnstore/columnstore/writeengine/client/we_clients.cpp @ 282 Could not connect to pm1_WriteEngineServer: InetStreamSocket::connect: connection timed out Jun 25 00:18:00 rocky8 writeengine[10211]: 54.363513 |0|0|0| E 19 CAL0001: ReadThreadFactory::CreateReadThread: read() error InetStreamSocket::readToMagic(): I/O error1: rc-1; poll signal interrupt ( POLLHUP POLLERR ) Jun 25 00:19:30 rocky8 DMLProc[10781]: 30.207575 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables. Jun 25 00:19:30 rocky8 DMLProc[10781]: 30.223528 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 26 Jun 25 00:19:30 rocky8 controllernode[8451]: 30.474215 |0|0|0| C 29 CAL0000: InetStreamSocket::readToMagic(): I/O error1: rc-1; poll signal interrupt ( POLLHUP POLLERR ) Jun 25 00:19:30 rocky8 joblist[10781]: 30.635974 |0|0|0| D 05 CAL0000: /mdb/verylongdirnameforverystrangecpackbehavior/storage/columnstore/columnstore/writeengine/client/we_clients.cpp @ 495 WECLIENT: There is no connection to WES and this = 0x5585e4112d60 Jun 25 00:19:30 rocky8 DMLProc[10781]: 30.636205 |0|0|0| C 20 CAL0002: problem with rollback transaction 26and DBRM is setting to readonly and table lock is not released: Network error occured when rolling back blocksThere is no WriteEngineServer to send message to. Jun 25 00:19:30 rocky8 DMLProc[10781]: 30.646342 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 transactions. Jun 25 00:19:30 rocky8 DMLProc[10781]: 30.769315 |0|0|0| I 20 CAL0002: DMLProc will roll back transaction 26 Jun 25 00:19:30 rocky8 DMLProc[10781]: 30.769409 |0|0|0| C 20 CAL0002: problem with rollback transaction 26and DBRM is setting to readonly and table lock is not released: Can't read DBRM isReadWrite [ DBRM is in READ-ONLY mode] Jun 25 00:19:30 rocky8 writeengineserver[10814]: 30.773146 |0|0|0| I 32 CAL0001: WriteEngine main process has started Jun 25 00:19:31 rocky8 DMLProc[10781]: 31.021308 |0|0|0| I 20 CAL0002: DMLProc finished rollbackAll. Jun 25 00:19:31 rocky8 joblist[10781]: 31.022257 |0|0|0| D 05 CAL0000: /mdb/verylongdirnameforverystrangecpackbehavior/storage/columnstore/columnstore/writeengine/client/we_clients.cpp @ 495 WECLIENT: There is no connection to WES and this = 0x5585e4112d60 Jun 25 00:19:31 rocky8 ddlpackageproc[10777]: 31.443311 |0|0|0| C 23 CAL0008: Interrupted system call Jun 25 00:19:31 rocky8 ddlpackageproc[10833]: 31.470520 |0|0|0| C 23 CAL0008: DDLProc main process has started Jun 25 00:19:31 rocky8 DMLProc[10844]: 31.732514 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll. Jun 25 00:19:31 rocky8 DMLProc[10844]: 31.733110 |0|0|0| C 20 CAL0002: DMLProc failed to start due to : Rollback will be deferred due to DBRM is in read only state. Jun 25 00:19:31 rocky8 ddlpackageproc[10841]: 31.733644 |0|0|0| C 23 CAL0008: DMLProc main process initialization failed
            drrtuy Roman added a comment -

            dleeyh Just to remind you that we limit the time systemd waits for the rollback on startup/shutdown to 15 minutes and this is what happens here. The only way to solve this is a maximum shutdown timeout for mcs-dmlproc systemd unit b/c startup timeout is bound by 20 minutes cap hardcoded in the engine code.

            drrtuy Roman added a comment - dleeyh Just to remind you that we limit the time systemd waits for the rollback on startup/shutdown to 15 minutes and this is what happens here. The only way to solve this is a maximum shutdown timeout for mcs-dmlproc systemd unit b/c startup timeout is bound by 20 minutes cap hardcoded in the engine code.
            drrtuy Roman added a comment -

            dleeyh I raised the hardcoded TO up to 2 hours as discussed so that it is possible to raise systemd startup TO up to the same limit.

            drrtuy Roman added a comment - dleeyh I raised the hardcoded TO up to 2 hours as discussed so that it is possible to raise systemd startup TO up to the same limit.

            People

              dleeyh Daniel Lee (Inactive)
              michael.amadi Michael Amadi
              Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.