[MCOL-3983] segv from cpimport bulk load preparation Created: 2020-05-04  Updated: 2022-02-24  Resolved: 2022-02-24

Status: Closed
Project: MariaDB ColumnStore
Component/s: cpimport, Storage Manager
Affects Version/s: None
Fix Version/s: 6.3.1

Type: Bug Priority: Major
Reporter: Patrick LeBlanc (Inactive) Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 1
Labels: None
Environment:

Skysql, 1.4.3-1 of columnstore


Attachments: Text File all-threads-backtrace.txt    
Issue Links:
Duplicate
is duplicated by MCOL-4003 Thread Concurrency Variables Not Limi... Closed
PartOf
is part of MCOL-4343 umbrella for tech debt issues Open
Sprint: 2021-11, 2021-12, 2021-13, 2021-14, 2021-15, 2021-16, 2021-17

 Description   

A customer ran into a problem that caused SM to continuously restart. Looking at the core file, there were 886 threads, and the ones I looked at had pretty crazy backtraces. For example, the ultimate cause of the crash, according to gdb, was an assertion failure in the string dtor, in the metadataObject dtor, except the line it's pointing at instantiates a metadataObject (doesn't destroy it). Then, that causes fatalHandler() to run, which segfaults, causing fatalHander() to run again.

My suspicion is that there is a general synchronization problem, and this results in mem corruption, and all of the random fallout that can happen from that. Need to follow up on things like Synchronizer::process(), where we use references to strings in a list (need to verify the iterator can't be invalidated or the value changed during use, etc).

This ticket is for general robustification of StorageManager. Also need to figure out how they got up to 886 threads right away (or ever for that matter).

Unclear whether licensing restrictions prevent me from saving the core file somewhere and linking the ticket to it. I'll do that once I get the go-ahead.

They were running 10.4.12-6 enterprise with columnstore @ 'columnstore-1.4.3-1'

An update, I found a bug in the config listeners for Downloader and Synchronizer, where if the config file has max_concurrent_uploads/downloads = 20, those threadpools never have a limit imposed on start. That could explain the whole problem. If the OS decides we've started too many threads too fast, or we hit a thread limit, that could cause a wide range of problems. Still, a scan through the code looking for races would be justified



 Comments   
Comment by Patrick LeBlanc (Inactive) [ 2020-05-06 ]

We need to freeze for 1.4.4 today; too bad we only found this yesterday. For 1.4.4 I will change the defaults in the config file and add a note in case we don't have time to circle back and fix the config listeners.

Comment by Patrick LeBlanc (Inactive) [ 2020-05-06 ]

Also checking that in to 1.5 to keep them consistent.

Comment by Rupert Harwood (Inactive) [ 2020-05-14 ]

I split out the portion about the max_concurrency variables into their own bug at: https://jira.mariadb.org/browse/MCOL-4003

Also, we found it is unrelated to this issue.

Comment by Patrick LeBlanc (Inactive) [ 2020-05-27 ]

We'll have to refactor our tickets for this some. Maybe we make these other first-class tickets subtasks of this ticket, IDK yet.

  • The parsing errors causing the mysqld segfault are fixed in MCOL-4017
  • The specific storagemanager bug that would crash it and keep it from coming back up are fixed in MCOL-4021
  • Add'l storagemanager bugs I noticed in the code, but which weren't the problem, are being fixed under MCOL-3459
  • The config / threadpool cap bug that was part of the initial findings logged in this ticket are fixed under MCOL-4003
Comment by Rupert Harwood (Inactive) [ 2020-08-24 ]

Note: Issue remain major despite support tickets being resolved.

Comment by Daniel Lee (Inactive) [ 2022-02-24 ]

Build verified: 6.3.1-1 (b3977)

Test #1
-------
Set max upload/download threads to 25 and verified startup value:

[centos8:root~]# ./StorageManager 
StorageManager[6112]: Using the config file found at /etc/columnstore/storagemanager.cnf
StorageManager[6112]: max_concurrent_downloads = 20
StorageManager[6112]: max_concurrent_downloads = 25
StorageManager[6112]: max_concurrent_uploads = 20
StorageManager[6112]: max_concurrent_uploads = 25
StorageManager[6112]: StorageManager started.
StorageManager[6112]: SessionManager waiting for sockets.
StorageManager[6111]: StorageManager main process has started

There are two lines for each uploads and downloads settings. According to the developer, 20 is the default setting and 25 is the specified settings. We should make this more clear in a future release.

Test #2
-------

Start with the 20 threads setting. Executed a large cpimport and number of threads max out at 32. Some of them are threads other than upload/download threads. For each additional concurrent cpimport, there will be one additional threads. When a cpimport job is completed, the number of threads also decreased by one. Number of threads is being capped and controlled.

Generated at Thu Feb 08 02:46:54 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.