A customer ran into a problem that caused SM to continuously restart. Looking at the core file, there were 886 threads, and the ones I looked at had pretty crazy backtraces. For example, the ultimate cause of the crash, according to gdb, was an assertion failure in the string dtor, in the metadataObject dtor, except the line it's pointing at instantiates a metadataObject (doesn't destroy it). Then, that causes fatalHandler() to run, which segfaults, causing fatalHander() to run again.
My suspicion is that there is a general synchronization problem, and this results in mem corruption, and all of the random fallout that can happen from that. Need to follow up on things like Synchronizer::process(), where we use references to strings in a list (need to verify the iterator can't be invalidated or the value changed during use, etc).
This ticket is for general robustification of StorageManager. Also need to figure out how they got up to 886 threads right away (or ever for that matter).
Unclear whether licensing restrictions prevent me from saving the core file somewhere and linking the ticket to it. I'll do that once I get the go-ahead.
They were running 10.4.12-6 enterprise with columnstore @ 'columnstore-1.4.3-1'
An update, I found a bug in the config listeners for Downloader and Synchronizer, where if the config file has max_concurrent_uploads/downloads = 20, those threadpools never have a limit imposed on start. That could explain the whole problem. If the OS decides we've started too many threads too fast, or we hit a thread limit, that could cause a wide range of problems. Still, a scan through the code looking for races would be justified