thread_running is global shared variable, which is updated twice per query in dispatch_command() using atomic-add.
It started to appear on radar in OLTP RO benchmark. E.g here're numbers for 2 socket/20 cores/40 threads Intel Broadwell system:
If we remove inc_thread_running() and dec_thread_running() we get slightly better throughput and dispatch_command() goes down in profiler:
So bottleneck is shifted to global_query_id counter, which is subject for another bug.
I expect much higher scalability impact on more powerful hardware.