[MCOL-5384] Plugin code leaks network connections towards ExeMgr making EM a threads leak bomb - possible mem leak Created: 2023-01-09 Updated: 2023-02-24 Resolved: 2023-01-27 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | ExeMgr, MariaDB Server |
| Affects Version/s: | 6.4.6, 22.08.7 |
| Fix Version/s: | 22.08.8, 6.4.7- CS only |
| Type: | Bug | Priority: | Critical |
| Reporter: | Karol Roslaniec | Assignee: | Roman |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Mariadb ver. 10.6.11/MCS ver. 6.4.6 (compiled from sources) |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Sprint: | 10.0.20 | ||||||||||||||||
| Assigned for Review: | |
||||||||||||||||
| Assigned for Testing: | |
||||||||||||||||
| Description |
|
Summary: ExeMgr spawns unlimited number of threads This problem is not new, it's already been reported as Hopefully now we can fix it, because I'm determined to find the problem, with some help from you guys. To clarify: I'm very new to Columnstore. This is really my first try to Columnstore, so I cannot say if there are other scenarios or problematic cases. Unfortunately this issue is also a showstopper for me. My scenario: What I was able to find out till now: At least this is my current understanding of the problem: Could you give me a hint, what code is responsible for creating mariadb-thread, creating TCP-connection, cleaning up connection and destroying the mariadb-thread? Best Regards, |
| Comments |
| Comment by Karol Roslaniec [ 2023-01-10 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I have already found out that ClientRotator class is responsible for TCP-connection. Best Regards, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Karol Roslaniec [ 2023-01-11 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I think I have found the problem. It turns out, the real problem is actually much more severe than only ExeMgr. Please find the patch to ha_mcs_impl.cpp in attachment (ha_mcs_impl.patch). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roman [ 2023-01-12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hey Karol, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Karol Roslaniec [ 2023-01-12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I'm using community server. Today I've removed old patch-files and created one single patch, because just today I have noticed a possible issue in my previous fixes. It is possible, but actually all my changes are made purely on code analysis rather than "running code", so many changes in patch are theoretical issues that I'm fixing. Namely in function ha_mcs_impl_close_connection I have put:
But the problem is, I actually don't know in which thread-context this method is called, and set_fe_conn_info_ptr works only on current_thd, which could be really bad in this scenario.
But then I have realized, that implementation of set_fe_conn_info_ptr is really ignoring thd parameter, so I had to fix it as well. Overall this patch is touching many other functionalities, so it would require some testing. Best Regards, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roman [ 2023-01-16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Karol I finished the review and I will hand over the patch to our QAs for additional testing. When they are done I will propagate the patch so the next community release will have it. The patch is much appreciated! | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gagan Goel (Inactive) [ 2023-01-20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Karol, Thank you very much for reporting this bug and for submitting the patch. I have confirmed that lingering TCP connections and ExeMgr threads indeed remain when an LDI (LOAD DATA (LOCAL) INFILE) statement is executed and the client connection to mariadb server is closed. Your patch does fix this problem. However, I was able to find an alternative with much lesser code change. See the patch below:
As you can see, I have retained your changes in ha_mcs_impl_close_connection() as this was indeed a legitimate mem leak. I have also retained your changes in set_fe_conn_info_ptr. The change I have introduced is in impl_external_lock(). This function is called by the server at the start and end of a query to:
Now at the end of an LDI statement when the lock_type=2, we previously did not reach the thd_set_ha_data(thd, mcs_hton, get_fe_conn_info_ptr()); code due to an early return caused by:
This caused the server to not call ha_mcs_impl_close_connection() when the client closed the session. For an LDI, ci->cal_conn_hndl is NULL. This is because for an LDI statement, mariadb server does not connect to the ExeMgr as there is no data read from the table being loaded. To fix the TCP connection/thread leak problem, I have simply changed the ordering so that the thd_set_ha_data() call executes before the if condition. Can you please try this patch in your system and let me know your thoughts? If this patch looks ok to you, I would suggest to create a PR in the engine github repo against the branch for which you are fixing: https://github.com/mariadb-corporation/mariadb-columnstore-engine and assign me as the reviewer (github user: tntnatbry). It will be much easier to perform code review there. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Karol Roslaniec [ 2023-01-22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Gagan, I'm glad I could help. Frankly speaking, I don't like your approach at all. It's using kind of "side effects" (function not directly involved); it does not guarantee that all other (theoretically) possible leaks are fixed; even "visually" looks weird, because it's setting the same value multiple times, wasting CPU cycles. My solution is really very simple and logically correct in my opinion: every time when you create an object
you immediately register it for deletion with:
What is really missing in my patch, is the cleanup of code, namely all these random calls to thd_set_ha_data could be now removed. Additionally this code:
is really ugly. I'm not sure what functions ha_mcs_impl_pushdown_init() or impl_rnd_init() are doing, but probably it would be much better to create new cal_connection_info() in this place even if it's not supposed to be used later, just to ensure that thd_ha_data is always pointing to an object pointer and not to some magic numbers. Best Regards, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roman [ 2023-01-23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Let me add my 5 cents to this convo. Here is the test case I run:
The results divides the so-called client UDFs: mcssettrace, mcssetparams, calshowpartitions etc in two groups:
Frankly speaking, I don't like your approach at all. It's using kind of "side effects" (function not directly involved); it does not guarantee that all other (theoretically) possible leaks are fixed; even "visually" looks weird, because it's setting the same value multiple times, wasting CPU cycles.
This is my unfortunate invention that is forced by the fact that this part of our codebase has to obey MDB API's, implicit and explicit call sequence conventions, namely if the ha_data value is 0 after the return from the method that sets ha_data to 0x42 there will be no call to close_connection later. I couldn't put cal_connection_info() blindly b/c it affected the other engines. We will re-evaluate this on the next refactoring run for the plugin code though. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roman [ 2023-01-24 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The problem also affects INFORMATION_SCHEMA tables, namely columnstore_columns and columnstore_tables. A client connection that accesses these tables leaves a network connection + ExeMgr client read thread behind. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roman [ 2023-01-25 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
4QA. The leaking connections can be detected with a simple test: The list of the offending expression is given in one of my previous comments: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kirill Perov [ 2023-01-26 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
updated 4QA: The leaking connections can be detected with a simple test: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kirill Perov [ 2023-01-26 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
verified at RockyLinux9 ss -tenp | grep 8601 shows nothing after previous steps 'updated 4QA'. But result is exactly same for 6.4.6. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kirill Perov [ 2023-01-26 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
tested 6.4.6 and 6.4.7 via "select * from information_schema.columnstore_columns": | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kirill Perov [ 2023-01-27 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
confirmed using "select * from information_schema.columnstore_columns" behavior | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Lee (Inactive) [ 2023-02-06 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Build verified: 22.08.8 RC2. |