[MCOL-2140] timeout for replication of 1 minute is to small - timing out on system with 4 nodes Created: 2019-02-05  Updated: 2023-03-20  Resolved: 2023-03-06

Status: Closed
Project: MariaDB ColumnStore
Component/s: N/A
Affects Version/s: None
Fix Version/s: Icebox

Type: Bug Priority: Major
Reporter: David Hill (Inactive) Assignee: Unassigned
Resolution: Won't Do Votes: 1
Labels: None
Environment:

2um 2pm with local query



 Description   

Customer reported that the replication wasnt working and the slaves wasnt being setup on there 2pm 2um with local query. It turns out that the distributed request failed dur to a timeout on PM1 procmgr waiting on UM1 procmon. The distrbute command took longer than 1 minute on a 4 node system where it has to distribute to 3 slave nodes.

PM1

Feb 4 15:29:45 usfit-scdb6 ProcessManager[171189]: 45.342198 |0|0|0| D 17 CAL0000: sendMsgProcMon: Process module um1
Feb 4 15:30:45 usfit-scdb6 ProcessManager[171189]: 45.393626 |0|0|0| E 17 CAL0000: line: 6901 sendMsgProcMon: ProcMon Msg timeout on module um1

UM1 15:29:45 to 15:31:06

Feb 4 15:29:45 usfit-scdb1 ProcessMonitor[101017]: 45.338509 |0|0|0| I 18 CAL0000: MSG RECEIVED: Run Master DB Distribute command
Feb 4 15:29:45 usfit-scdb1 ProcessMonitor[101017]: 45.338704 |0|0|0| D 18 CAL0000: runMasterDist function called

Feb 4 15:29:45 usfit-scdb1 ProcessMonitor[101017]: 45.350897 |0|0|0| D 18 CAL0000: cmd = /usr/local/mariadb/columnstore/bin/rsync.sh 192.168.212.39 ssh /usr/local/mariadb/columnstore 1 > /scdbprd_tmp//master-dist_um2.log
Feb 4 15:30:08 usfit-scdb1 ProcessMonitor[101017]: 08.522846 |0|0|0| D 18 CAL0000: runMasterDist: Success rsync to module: um2

Feb 4 15:30:08 usfit-scdb1 ProcessMonitor[101017]: 08.522949 |0|0|0| D 18 CAL0000: cmd = /usr/local/mariadb/columnstore/bin/rsync.sh 192.168.212.47 ssh /usr/local/mariadb/columnstore 1 > /scdbprd_tmp//master-dist_pm1.log
Feb 4 15:30:38 usfit-scdb1 ProcessMonitor[101017]: 38.745679 |0|0|0| D 18 CAL0000: runMasterDist: Success rsync to module: pm1

Feb 4 15:30:38 usfit-scdb1 ProcessMonitor[101017]: 38.745774 |0|0|0| D 18 CAL0000: cmd = /usr/local/mariadb/columnstore/bin/rsync.sh 192.168.212.48 ssh /usr/local/mariadb/columnstore 1 > /scdbprd_tmp//master-dist_pm2.log
Feb 4 15:31:06 usfit-scdb1 ProcessMonitor[101017]: 06.437719 |0|0|0| D 18 CAL0000: runMasterDist: Success rsync to module: pm2

Feb 4 15:31:06 usfit-scdb1 ProcessMonitor[101017]: 06.437841 |0|0|0| I 18 CAL0000: MASTERDIST: runMasterRep - ACK back to ProcMgr return status = 0



 Comments   
Comment by David Hill (Inactive) [ 2019-02-05 ]

Also the timeouts in the rsync script itself needs to be increased if its taking 30 seconds.

set timeout 20
set timeout 10

Comment by Nico [ 2019-03-19 ]

It's the same I notice in MCOL-1573
The timeouts are too short.
In my rsync.sh version (2019-01-25) I take in care of this using 600 like timeout.

Comment by Todd Stoffel (Inactive) [ 2023-03-06 ]

This ticket was opened prior to convergence with the server. It may have been rendered obsolete. If this issue still exists in a modern version, please open a new request.

Generated at Thu Feb 08 02:34:02 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.