[MCOL-833]  could not open file for OID after a outage recover from pm2 PrimProc Created: 2017-07-25  Updated: 2023-10-26  Resolved: 2017-09-01

Status: Closed
Project: MariaDB ColumnStore
Component/s: ?, ExeMgr
Affects Version/s: 1.0.9
Fix Version/s: 1.1.0

Type: Bug Priority: Major
Reporter: David Hill (Inactive) Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 0
Labels: None


 Description   

Had customer report errors with PrimProc could not open file for OID for unknown reasons, MCOL-801 and MCOL-804.

Was able to reproduce this error by doing the following. Wasnt sure if same problem with 801/804, so opened a new BUG.

1. setup a 1um / 2 pm system with 50gb tpch1 database
2. run a script that continually did the following query:
[root@ip-172-30-0-161 ~]# cat query.sh
#!/bin/bash
while [ true ]; do
echo "select count from lineitem" | /usr/local//mariadb/columnstore/mysql/bin/mysql --defaults-extra-file=/usr/local//mariadb/columnstore/mysql/my.cnf -u root tpch100
sleep 1
done
exit 0

3. Did a pkill on pm2 PrimProc

pm1 errors logs soon started after the recovery was performed:

Jul 14 16:24:46 ip-172-30-0-176 PrimProc[93531]: 46.550644 |0|0|0| W 28 CAL0000: IDB-2039: Data file does not exist, please contact your system administrator for more information.
Jul 14 16:24:47 ip-172-30-0-176 IDBFile[93531]: 47.550530 |0|0|0| D 35 CAL0002: Failed to open file: /000.dir/000.dir/012.dir/012.dir/000.dir/FILE002.cdf, exception: unable to open Unbuffered file
Jul 14 16:24:48 ip-172-30-0-176 IDBFile[93531]: 48.550839 |0|0|0| D 35 CAL0002: Failed to open file: /000.dir/000.dir/012.dir/012.dir/000.dir/FILE002.cdf, exception: unable to open Unbuffered file
Jul 14 16:24:49 ip-172-30-0-176 IDBFile[93531]: 49.551158 |0|0|0| D 35 CAL0002: Failed to open file: /000.dir/000.dir/012.dir/012.dir/000.dir/FILE002.cdf, exception: unable to open Unbuffered file

This file exist on pm2, so ExeMgr is sending the request to the wrong pm1/PrimProc

data2]# ll 000.dir/000.dir/012.dir/012.dir/000.dir/FILE002.cdf
rw-rr- 1 root root 11345920 Jul 14 16:00 000.dir/000.dir/012.dir/012.dir/000.dir/FILE002.cdf



 Comments   
Comment by David Hill (Inactive) [ 2017-07-26 ]

commit 842838e5cec49d42e209cd8a9284ac4699a53d99
Author: David Hill <david.hill@mariadb.com>
Date: Wed Jul 26 15:30:01 2017 -0500

MCOL-833 - fix code merge issue

dbcon/joblist/distributedenginecomm.cpp | 2 +-
1 file changed, 1 insertion, 1 deletion

commit 26ac4aa31c98bb937c01ae9d0287df7e027c4ec3
Author: david hill <david.hill@mariadb.com>
Date: Wed Jul 26 15:03:52 2017 -0500

mcol-833 - merge code from 1.0 for missing file fix

dbcon/joblist/distributedenginecomm.cpp | 8 ++++----
procmgr/main.cpp | 101 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------------------------
procmgr/processmanager.cpp | 35 ++++++++++++++++++-----------------
procmon/processmonitor.cpp | 4 ++--
4 files changed, 89 insertions, 59 deletions

test scenarios in MCOL-814

Comment by Daniel Lee (Inactive) [ 2017-09-01 ]

Build verified: 1.1.0 GitHub source

/root/columnstore/mariadb-columnstore-server
commit 6ed33d194819aaa5f2521c888639f44546fb7ce2
Merge: 97284ea 770537e
Author: Andrew Hutchings <andrew@linuxjedi.co.uk>
Date: Thu Aug 3 20:54:13 2017 +0100

/root/columnstore/mariadb-columnstore-server/mariadb-columnstore-engine
commit 44ea79d49c0354a5dd1d5c97d95ad2b8f366bc8b
Author: david hill <david.hill@mariadb.com>
Date: Thu Aug 31 11:34:17 2017 -0500

Generated at Thu Feb 08 02:24:09 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.