[MCOL-1128] exemgr becomes non responsive Created: 2017-12-21  Updated: 2020-08-25  Resolved: 2018-01-22

Status: Closed
Project: MariaDB ColumnStore
Component/s: None
Affects Version/s: 1.1.2
Fix Version/s: 1.1.3

Type: Bug Priority: Critical
Reporter: David Thompson (Inactive) Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 1
Labels: None

Issue Links:
Relates
relates to MCOL-1062 High concurrency can lock up PrimProc Closed
Sprint: 2017-25, 2018-01, 2018-02

 Description   

1um1pm setup. Data is loaded by periodic multi value insert DML and obviously queried. This worked ok with infinidb 4.0 but after upgrade customer finds that the system works for a few hours and then either inserts and / or queries will stop working.



 Comments   
Comment by David Hall (Inactive) [ 2017-12-22 ]

The issue that Daniel reproduced can easily be reproduced on a single server 1.1 stack.
I have failed to reproduce it on a single server 1.0 stack.

This implies there's a good possibility the bug was introduced in 1.1. I will attempt to ascertain the exact patch that did it.

Meantime, perhaps the customer may want to try 1.0.12 and see if that helps.

Comment by Andrew Hutchings (Inactive) [ 2018-01-10 ]

A git bisect shows with some certainty the regression happened during the transition from MariaDB 10.1 -> 10.2 in ColumnStore 1.1. It is difficult to pinpoint exactly what triggered it as finding matching server and engine code revisions is very difficult (and not all of them compile on my test machine).

Recommended next step is to look at what the server is asking of the engine in both 1.0 and 1.1 for these queries to spot any differences.

Comment by David Hall (Inactive) [ 2018-01-11 ]

In 02/17, we added threadpooling to ExeMgr to increase performance. The connections to the connector, DMLProc, or DDLProc are thread-pooled and limited to 50 threads. Any additional connections are delayed until a thread becomes available. By removing the 50 limit, and allowing the threadpool to grow to an unspecified number of threads, the issue clears up.

DMLProc still calls ExeMgr even though there's no explicit query – it still needs to do system catalog queries.

Why waiting for an ExeMgr thread causes DMLProc to wait forever is not yet clear. However, this is the first progress we've made on this issue so I thought a status update was needed. I should be able to clear this all up in one day more.

Comment by Andrew Hutchings (Inactive) [ 2018-01-12 ]

Since they are inserts every one of them will flush the system catalog cache (for extent metadata update) and since they are on the UM they will get the system catalog via ExeMgr rather than direct from PrimProc (there are two access methods in the code).

I believe ExeMgr waits forever because it is waiting on PrimProc (MCOL-1062). This can be observed using Poor Man's Profiler (https://poormansprofiler.org/) on PrimProc and ExeMgr and looking at what stage the threads are stuck. PrimProc is trying to get a threadpool thread in itself for BPP searches but can't because none are available. None can be freed up.

Comment by David Hall (Inactive) [ 2018-01-12 ]

A work around is to shut the system down. Be sure it is completely down. Then add the following to the Columnstore.xml:

<ServerThreads>200</ServerThreads>
<ServerQueueSize>400</ServerQueueSize>
such that it looks like this:

<Columnstore Version="V1.0.0">
<ExeMgr1>
<ServerThreads>200</ServerThreads>
<ServerQueueSize>400</ServerQueueSize>
<IPAddr>127.0.0.1</IPAddr>
<Port>8601</Port>
<Module>pm1</Module>
</ExeMgr1>

Restart the system

The defaults here are 50/100. You could make them bigger than 200/400.

Comment by David Hall (Inactive) [ 2018-01-12 ]

After some experiments, it becomes clear that the accept loop in ExeMgr must have unfettered access to new threads, so that is what I did.

Comment by Daniel Lee (Inactive) [ 2018-01-22 ]

Build verified: 1.1.3-1 Github source

/root/columnstore/mariadb-columnstore-server
commit e0ae0d2fecf9941887478d9aa669c8b2d1092090
Merge: 21ec50194e 2490ddf50e
Author: benthompson15 <ben.thompson@mariadb.com>
Date: Fri Jan 19 12:39:05 2018 -0600

Merge pull request #84 from mariadb-corporation/MCOL-1159

MCOL-1159 Merge mariadb-10.2.12

/root/columnstore/mariadb-columnstore-server/mariadb-columnstore-engine
commit c74d5de21d6571c0b0e9a12dacaf77856d332e63
Merge: 201813d6 63adbd0f
Author: benthompson15 <ben.thompson@mariadb.com>
Date: Mon Jan 22 09:42:34 2018 -0600

Merge pull request #375 from mariadb-corporation/dev-1.1-build-fix

Fix missing compiler flag from 1.0 -> 1.1 merge

No longer reproducing the issue originally reported (see test case in comment #1)

Generated at Thu Feb 08 02:26:24 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.