[MCOL-814] PrimProc could not open file for OID after a outage recover from pm2 PrimProc Created: 2017-07-14 Updated: 2023-10-26 Resolved: 2017-07-25 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | ?, ExeMgr |
| Affects Version/s: | 1.0.9 |
| Fix Version/s: | 1.0.10 |
| Type: | Bug | Priority: | Major |
| Reporter: | David Hill (Inactive) | Assignee: | Daniel Lee (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Sprint: | 2017-14, 2017-15 |
| Description |
|
Had customer report errors with PrimProc could not open file for OID for unknown reasons, Was able to reproduce this error by doing the following. Wasnt sure if same problem with 801/804, so opened a new BUG. 1. setup a 1um / 2 pm system with 50gb tpch1 database 3. Did a pkill on pm2 PrimProc pm1 errors logs soon started after the recovery was performed: Jul 14 16:24:46 ip-172-30-0-176 PrimProc[93531]: 46.550644 |0|0|0| W 28 CAL0000: IDB-2039: Data file does not exist, please contact your system administrator for more information. This file exist on pm2, so ExeMgr is sending the request to the wrong pm1/PrimProc data2]# ll 000.dir/000.dir/012.dir/012.dir/000.dir/FILE002.cdf |
| Comments |
| Comment by David Hill (Inactive) [ 2017-07-24 ] |
|
Logic Changes: oam - There is a api to stop system queries from going through, so that is now called before ExeMgr is restarted and its cleared after Exemgr are active. This is to try to block queries when we are in a process restarting case. So you might now see if the error code system is not ready for queries during a this for this error. distrubutedenginecom -I just commented out some failover code where it was removing primproc from its local connections list. Initially just did this as a test to see if we dont remove the pm that lost connetion, maybe we will not send its request over to pm1. But with this change and the change to OAM, the problem stopped. It didn't stop just with the OAM changes. |
| Comment by David Hill (Inactive) [ 2017-07-24 ] |
|
in development test |
| Comment by David Hill (Inactive) [ 2017-07-24 ] |
|
My test scenario 1. start with a 1um / 2 pm system, shared nothing setup 4. on pm1 or um1, run ma getsystemi B I did the same test on a amazon shared storage system |
| Comment by David Hill (Inactive) [ 2017-07-24 ] |
|
commit 93794c9c3ffd02b5907622d38a7e5af013f2b120 |
| Comment by Daniel Lee (Inactive) [ 2017-07-25 ] |
|
Build verified: Github source 1.0.10 1.0.10 [root@localhost mariadb-columnstore-server]# git show [root@localhost mariadb-columnstore-engine]# git show |
| Comment by Daniel Lee (Inactive) [ 2017-07-25 ] |
|
Test output. Verified that there are no "PrimProc could not open file" errors from both PMs. 299930260 ERROR 1815 (HY000) at line 1: Internal error: DistributedEngineComm::write: Broken Pipe error |