[MCOL-814] PrimProc could not open file for OID after a outage recover from pm2 PrimProc Created: 2017-07-14  Updated: 2023-10-26  Resolved: 2017-07-25

Status: Closed
Project: MariaDB ColumnStore
Component/s: ?, ExeMgr
Affects Version/s: 1.0.9
Fix Version/s: 1.0.10

Type: Bug Priority: Major
Reporter: David Hill (Inactive) Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Sprint: 2017-14, 2017-15

 Description   

Had customer report errors with PrimProc could not open file for OID for unknown reasons, MCOL-801 and MCOL-804.

Was able to reproduce this error by doing the following. Wasnt sure if same problem with 801/804, so opened a new BUG.

1. setup a 1um / 2 pm system with 50gb tpch1 database
2. run a script that continually did the following query:
[root@ip-172-30-0-161 ~]# cat query.sh
#!/bin/bash
while [ true ]; do
echo "select count from lineitem" | /usr/local//mariadb/columnstore/mysql/bin/mysql --defaults-extra-file=/usr/local//mariadb/columnstore/mysql/my.cnf -u root tpch100
sleep 1
done
exit 0

3. Did a pkill on pm2 PrimProc

pm1 errors logs soon started after the recovery was performed:

Jul 14 16:24:46 ip-172-30-0-176 PrimProc[93531]: 46.550644 |0|0|0| W 28 CAL0000: IDB-2039: Data file does not exist, please contact your system administrator for more information.
Jul 14 16:24:47 ip-172-30-0-176 IDBFile[93531]: 47.550530 |0|0|0| D 35 CAL0002: Failed to open file: /000.dir/000.dir/012.dir/012.dir/000.dir/FILE002.cdf, exception: unable to open Unbuffered file
Jul 14 16:24:48 ip-172-30-0-176 IDBFile[93531]: 48.550839 |0|0|0| D 35 CAL0002: Failed to open file: /000.dir/000.dir/012.dir/012.dir/000.dir/FILE002.cdf, exception: unable to open Unbuffered file
Jul 14 16:24:49 ip-172-30-0-176 IDBFile[93531]: 49.551158 |0|0|0| D 35 CAL0002: Failed to open file: /000.dir/000.dir/012.dir/012.dir/000.dir/FILE002.cdf, exception: unable to open Unbuffered file

This file exist on pm2, so ExeMgr is sending the request to the wrong pm1/PrimProc

data2]# ll 000.dir/000.dir/012.dir/012.dir/000.dir/FILE002.cdf
rw-rr- 1 root root 11345920 Jul 14 16:00 000.dir/000.dir/012.dir/012.dir/000.dir/FILE002.cdf



 Comments   
Comment by David Hill (Inactive) [ 2017-07-24 ]

Logic Changes:

oam - There is a api to stop system queries from going through, so that is now called before ExeMgr is restarted and its cleared after Exemgr are active. This is to try to block queries when we are in a process restarting case. So you might now see if the error code system is not ready for queries during a this for this error.

distrubutedenginecom -I just commented out some failover code where it was removing primproc from its local connections list. Initially just did this as a test to see if we dont remove the pm that lost connetion, maybe we will not send its request over to pm1. But with this change and the change to OAM, the problem stopped. It didn't stop just with the OAM changes.

Comment by David Hill (Inactive) [ 2017-07-24 ]

in development test

Comment by David Hill (Inactive) [ 2017-07-24 ]

My test scenario

1. start with a 1um / 2 pm system, shared nothing setup
2. created a tpch 50gb database
3. created a script and run this query continuously

4. on pm1 or um1, run ma getsystemi
4. on pm2 - # pkill PrimPric
5. watch log files and make sure no file not missing errors get reported
6. on um1 - you might see the error about cant query at this time, but once primproc is active and all processes have been restarted and system goes from a BUSY_INIT to ACTIVE. Then queries will start.
Sometimes the script has, like a query didnt get a disconnet because of all the processes starting, so you might have to stop and start the script again.

B I did the same test on a amazon shared storage system
C did test with pm2 server outage

Comment by David Hill (Inactive) [ 2017-07-24 ]

commit 93794c9c3ffd02b5907622d38a7e5af013f2b120
Author: david hill <david.hill@mariadb.com>
Date: Mon Jul 24 10:09:04 2017 -0500

MCOL-814 - more changes for failover query handling

Comment by Daniel Lee (Inactive) [ 2017-07-25 ]

Build verified: Github source 1.0.10

1.0.10

[root@localhost mariadb-columnstore-server]# git show
commit 6e32a494b4387f3a501bc09addeffacb68eb8e99
Merge: 435972e 87f4873
Author: David.Hall <david.hall@mariadb.com>
Date: Tue Jul 25 00:25:21 2017 -0500

[root@localhost mariadb-columnstore-engine]# git show
commit aa27537874868745ea4086b0a1279191e07779df
Merge: 93794c9 7e568f5
Author: david hill <david.hill@mariadb.com>
Date: Mon Jul 24 10:09:17 2017 -0500

Comment by Daniel Lee (Inactive) [ 2017-07-25 ]

Test output. Verified that there are no "PrimProc could not open file" errors from both PMs.

299930260
count
299930260
ERROR 1815 (HY000) at line 1: Internal error: st: 0 TupleBPS::sendPrimitiveMessages() caught an exception: DistributedEngineComm::write: Broken Pipe error

ERROR 1815 (HY000) at line 1: Internal error: DistributedEngineComm::write: Broken Pipe error
ERROR 1815 (HY000) at line 1: Internal error: DistributedEngineComm::write: Broken Pipe error
ERROR 1815 (HY000) at line 1: Internal error: The system is not yet ready to accept queries
ERROR 1815 (HY000) at line 1: Internal error: The system is not yet ready to accept queries
ERROR 1815 (HY000) at line 1: Internal error: The system is not yet ready to accept queries
ERROR 1815 (HY000) at line 1: Internal error: The system is not yet ready to accept queries
ERROR 1815 (HY000) at line 1: Internal error: The system is not yet ready to accept queries
count
299930260
count
299930260

Generated at Thu Feb 08 02:24:01 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.