[MCOL-3983] segv from cpimport bulk load preparation Created: 2020-05-04 Updated: 2022-02-24 Resolved: 2022-02-24 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | cpimport, Storage Manager |
| Affects Version/s: | None |
| Fix Version/s: | 6.3.1 |
| Type: | Bug | Priority: | Major |
| Reporter: | Patrick LeBlanc (Inactive) | Assignee: | Daniel Lee (Inactive) |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Environment: |
Skysql, 1.4.3-1 of columnstore |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Sprint: | 2021-11, 2021-12, 2021-13, 2021-14, 2021-15, 2021-16, 2021-17 | ||||||||||||||||
| Description |
|
A customer ran into a problem that caused SM to continuously restart. Looking at the core file, there were 886 threads, and the ones I looked at had pretty crazy backtraces. For example, the ultimate cause of the crash, according to gdb, was an assertion failure in the string dtor, in the metadataObject dtor, except the line it's pointing at instantiates a metadataObject (doesn't destroy it). Then, that causes fatalHandler() to run, which segfaults, causing fatalHander() to run again. My suspicion is that there is a general synchronization problem, and this results in mem corruption, and all of the random fallout that can happen from that. Need to follow up on things like Synchronizer::process(), where we use references to strings in a list (need to verify the iterator can't be invalidated or the value changed during use, etc). This ticket is for general robustification of StorageManager. Also need to figure out how they got up to 886 threads right away (or ever for that matter). Unclear whether licensing restrictions prevent me from saving the core file somewhere and linking the ticket to it. I'll do that once I get the go-ahead. They were running 10.4.12-6 enterprise with columnstore @ 'columnstore-1.4.3-1' An update, I found a bug in the config listeners for Downloader and Synchronizer, where if the config file has max_concurrent_uploads/downloads = 20, those threadpools never have a limit imposed on start. That could explain the whole problem. If the OS decides we've started too many threads too fast, or we hit a thread limit, that could cause a wide range of problems. Still, a scan through the code looking for races would be justified |
| Comments |
| Comment by Patrick LeBlanc (Inactive) [ 2020-05-06 ] | |||||||||
|
We need to freeze for 1.4.4 today; too bad we only found this yesterday. For 1.4.4 I will change the defaults in the config file and add a note in case we don't have time to circle back and fix the config listeners. | |||||||||
| Comment by Patrick LeBlanc (Inactive) [ 2020-05-06 ] | |||||||||
|
Also checking that in to 1.5 to keep them consistent. | |||||||||
| Comment by Rupert Harwood (Inactive) [ 2020-05-14 ] | |||||||||
|
I split out the portion about the max_concurrency variables into their own bug at: https://jira.mariadb.org/browse/MCOL-4003 Also, we found it is unrelated to this issue. | |||||||||
| Comment by Patrick LeBlanc (Inactive) [ 2020-05-27 ] | |||||||||
|
We'll have to refactor our tickets for this some. Maybe we make these other first-class tickets subtasks of this ticket, IDK yet.
| |||||||||
| Comment by Rupert Harwood (Inactive) [ 2020-08-24 ] | |||||||||
|
Note: Issue remain major despite support tickets being resolved. | |||||||||
| Comment by Daniel Lee (Inactive) [ 2022-02-24 ] | |||||||||
|
Build verified: 6.3.1-1 (b3977) Test #1
There are two lines for each uploads and downloads settings. According to the developer, 20 is the default setting and 25 is the specified settings. We should make this more clear in a future release. Test #2 Start with the 20 threads setting. Executed a large cpimport and number of threads max out at 32. Some of them are threads other than upload/download threads. For each additional concurrent cpimport, there will be one additional threads. When a cpimport job is completed, the number of threads also decreased by one. Number of threads is being capped and controlled. |