[MCOL-4626] Columnstore cluster becomes non operational when running out of memory on a query Created: 2021-03-20 Updated: 2024-02-05 |
|
| Status: | Stalled |
| Project: | MariaDB ColumnStore |
| Component/s: | None |
| Affects Version/s: | 5.5.2 |
| Fix Version/s: | 23.10 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Todd Stoffel (Inactive) | Assignee: | Alan Mologorsky |
| Resolution: | Unresolved | Votes: | 2 |
| Labels: | rm_stability | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Sprint: | 2021-5, 2021-6, 2021-7, 2021-8, 2021-9, 2021-10, 2021-11, 2021-12, 2021-13, 2021-14, 2021-15, 2021-16, 2021-17, 2022-22, 2022-23, 2023-4, 2023-5, 2023-6 | ||||||||||||||||||||||||||||||||
| Description |
|
When there is too little memory available in docker, a destructive problem occurs. Currently - it is a crash of PrimProc, with the subsequent inability to run any queries. While prior commits improved the situation, it did not improve it enough. Yes, it is no longer a crash but an error message. And yes, in some cases the cluster remains operational. But - it is not always. There are cases when it is non operational after an error message. In order to complete this ticket, the remaining problem needs to be corrected. Even after the memory is exceeded and an error message is generated, the cluster should be operational, and be able to execute queries that fit in memory. We will continue working on this problem under Notes: Two things need to happen:
There may be deeper problems on very small settings like 1GB. Once we fix and verify the above, we should investigate what to do in those cases. 2. Technical description is below: In order to complete the solution we would implement better management of who holds memory, and allow the system to free most of the held memory, and essentially reset all the block cache and instances holding any memory at an OOM event and ensure next query would be as if the processes were reset without having to restart them. |
| Comments |
| Comment by Todd Stoffel (Inactive) [ 2021-03-23 ] |
|
The Docker host needs a minimum of 9 GB of RAM available to the containers in order to avoid this error. PrimProc should not crash with an OOM error but I'm reducing the priority of this ticket since the cause is known and can be avoided. |
| Comment by Ben Thompson (Inactive) [ 2021-05-03 ] |
|
PrimProc will throw a critical log message and fail to become operational. PrimProc[1004]: 35.668435 |0|0|0| C 28 CAL0000: Error total memory available is less than 3GB. attempts to interact with Columnstore while in this state will return errors. |
| Comment by Gregory Dorman (Inactive) [ 2021-05-03 ] |
|
Good as the explanation may be, it is not good enough. If this is the way PrimProc does it, find someone else to do the test (ExeMger, maybe even CMAPI, I don't know). Or teach PrimProc to do it in a more usable way. Guys - the days of Open Source attitudes are gone. We are enterprise software now. Especially in a cloud. People will not tolerate these kinds of things anymore. |
| Comment by Roman [ 2021-08-27 ] |
|
gdorman Denis has implemented the fix in develop-6 for the original case, namely INSERT..SELECT with text or long varchar columns crashes PP. |
| Comment by Manjot Singh (Inactive) [ 2022-03-24 ] |
|
Could s3 storage engine be leveraged to maintain global meta data? |
| Comment by alexey vorovich (Inactive) [ 2022-04-05 ] |
|
as per David.Hall on todays standup this is not going to be easy Should we move this to next release ? |
| Comment by alexey vorovich (Inactive) [ 2022-04-06 ] |
|
moved to 641 as per Todd |