[MCOL-4849] Optimize ExeMgr to reduce a number of context switches Created: 2021-08-30  Updated: 2022-03-29  Resolved: 2021-11-17

Status: Closed
Project: MariaDB ColumnStore
Component/s: ExeMgr
Affects Version/s: 6.1.1
Fix Version/s: 6.2.2

Type: New Feature Priority: Major
Reporter: Roman Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: PNG File Screenshot from 2021-09-01 20-22-08.png     PNG File cs.png     File graph.svg     PNG File perf.png     Text File profile.txt    
Issue Links:
Relates
relates to MCOL-4593 Multiple concurrent queries with aggr... Stalled
Epic Link: ColumnStore Performance Improvements
Sprint: 2021-10, 2021-11, 2021-12, 2021-13, 2021-14, 2021-15

 Description   

MCS generates enourmous amount of context switches that degrades performance a lot. The screenshoot cs.png demonstates this, namely cs number raises from ~80 to 21k once I run a single query(select l_orderkey, count(l_orderkey) from lineitem group by l_orderkey limit 10) in an infinite loop.
According with the perf tool observations(collected with perf record --call-graph dwarf -e context-switches -p $(pidof ExeMgr) command) there are number of candidates for the optimization:

  • messageqcpp::InetStreamSocket::readToMagic() that both uses poll and reads a byte at a time increasing the number of cs needed.
  • joblist::TupleBPS::receiveMultiPrimitiveMessages() that contains a loop that processes intermediate RGData sent by PP to EM as results of Primitive requests. There are both mutex and conditional_variables widely used in TupleBPS methods's code.
  • joblist::FIFO<rowgroup::RGData>::swapBuffers() that leverages mutexes to save crit section of this RGData queue from TBPS to TAS.
    Plz take a look at perf.png for some details.
    The goal is to reduce a number of context switches produced by ExeMgr code. It is worth to have in mind that othere categories of queries might produce a different pattern. However the three parts mentioned affect every query b/c the are substantial to query processing.


 Comments   
Comment by Daniel Lee (Inactive) [ 2021-11-17 ]

Build tested: 6.2.2 (#3334, #3335)

Build performance compared to release 6.1.1

Build #3334
   DBT3 performance
      Disk-run   is 10.05% faster
      Cached-run is 10.84% faster
 
   CPImport is       5.71% faster
   LDI is            1.03% faster
   insertSelect is   2.42% faster
 
Build #3335
   DBT3 performance
      Disk-run   is 10.69% faster
      Cached-run is 12.07% faster
 
   CPImport is       1.43% faster
   LDI is            1.20% faster
   insertSelect is   2.76% faster

Detailed performance test result can be found here:

https://docs.google.com/spreadsheets/d/1tznQqmpKfkbnn4HjfjYIIeowJlQpNlHj8PVI6QM-3mc/edit?usp=sharing

Generated at Thu Feb 08 02:53:28 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.