[MCOL-259] intermediate regression test failures - At least one DBRoot required for that query is offline. Created: 2016-08-03  Updated: 2016-09-21  Resolved: 2016-09-16

Status: Closed
Project: MariaDB ColumnStore
Component/s: ExeMgr
Affects Version/s: 1.0.2
Fix Version/s: 1.0.3

Type: Bug Priority: Major
Reporter: David Hill (Inactive) Assignee: David Hill (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

centos 7 nightly regression test run


Issue Links:
PartOf
is part of MCOL-280 Beta issues Closed
Relates
relates to MCOL-137 create table and queries failed after... Closed

 Description   

test suite 299 failed on nightly regression test. The error below was reported at the time of the failures. This error mean that after a system restart, ExeMgr didnt have a good connection with a PrimProc thinking that pm1 was down or it had an issue reading the dbroot to pm assignments, which lead to the error. And when this error occurs, all queries will fail until the system is restarted.

Aug 3 07:34:37 centos7 joblist[5536]: 37.749540 |2147483653|0|0| C 05 CAL0000: IDB-2034: At least one DBRoot required for that query is offline.



 Comments   
Comment by David Hall (Inactive) [ 2016-08-10 ]

Trying:
Add code to reload the xml and retry when this occurs. It seems to help. We'll see if this issue disappears after this.

Comment by David Hall (Inactive) [ 2016-08-23 ]

The issue has been seen since this modification.

It seems that the restart.sh script used in the regression tests (101 and 299) calls mysql-columnstore stop after it calls mcsadmin restart. This will cause mysqld to go into MAN_OFFLINE state temporarily. Such an action causes the PM to go into a DEGRADED state. Current code thinks that DEGRADED is invalid. DEGRADED is also used when one nic of a multi-nic PM is down. We want to keep such a PM in the list.

Sometimes, our scripts are so fast that some SQL leaks in before mysqld is rebooted, but doesn't get to that point in the processing until after the PM is DEGRADED. This causes ExeMgr to not include that PM in its cached list of active PM's. That SQL is then rejected because of the issue, and all subsequent is also rejected because the PM state is cached.

By code inspection, this problem can occur PM is in some start other than ACTIVE and some thread in ExeMgr calls one of the OAM_CACHE functions, which might trigger an oam cache reload. If a PM is in any but ACTIVE state, the oam cache reload logic ignores that PM, which is not what we want. I believe my retry logic fails because, during these fast moving tests, the retry is too soon and the PM is still in another state.

I added code so that if a PM is in BUSY_INIT (not probable), MAN_INIT (also not probable) or PID_UPDATE (possible, this state was more recently added to the code), the cache reload will retry a few times over 5 seconds to see if that PM goes ACTIVE. Also added that DEGRADED is a valid state, rather than rejecting it.

Comment by David Hall (Inactive) [ 2016-08-25 ]

For review

Comment by Daniel Lee (Inactive) [ 2016-09-15 ]

Assigned it to Mr. Hill for regression test.

Comment by David Hill (Inactive) [ 2016-09-16 ]

test299 passed regression testing.. issue not encountered.

Generated at Thu Feb 08 02:19:42 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.