MCS spawns lots of idle thread pool jobs for parallel query execution, e.g. every 2nd phase of a parallel 2-step aggregation spawns 24 threads and parallel sorting spawns 16 threads by default. The pool threads are just idle until data starts to flow from the lower parts of the executed query. Every thread uses sync primitives, e.g. mutex-es or cond_variable. When multiple queries are processed by an MCS cluster the concurrency sync primitives overhead is enormous and can reach 25% of non-virtualized CPU horsepower.
The suggested solution is to reduce a number of threads on the start down to one. EM adds more parallel threads if needed only.
Consider above mentioned 2nd step of a parallel aggregation. It pre-spawns of threads that reads data from an input queue and puts records(RowPointers to be exact) into buckets(bucket number = hash % buckets number). The thread later populates hash map with the calculated bucket number with the RowPointers and the hash calculated. The suggestion is to enable the code to detect if the input queue is filled up to a certain limit for a period of time and to add a new processing thread at this point. If it is the code must spawn another thread/-s.