Details
-
Task
-
Status: Open (View Workflow)
-
Minor
-
Resolution: Unresolved
-
None
Description
PMP, aka poor-mans-profiler, is a very good tool for understanding bottlenecks
in the server. It is especially good at detecting mutex contention, as it
profiles all threads. Most other profiling tools (oprofile, perf) only
profiles running threads, so are mostly useful to detect bottlenecks in CPU
usage, which is getting less and less relevant as number of cores grows while
software scalability struggles to keep up.
However, standard PMP based on attaching gdb and obtaining stack traces that
way suffers from performance problem. Gdb is not optimised for this scenario
and holds the target process suspended for longer than necessary for PMP,
causing server stalls (can be several seconds or more), which limits its use
in a production environment.
This task is about implementing a PMP tool that provides stack traces in a
more efficient way, to make it less intrusive on the server being profiled and
thereby allowing it to be used in more cases. It is based on a prototype by
knielsen: https://github.com/knielsen/knielsen-pmp. The goal is to be able to
obtain stack traces at a rate of around 1 millisecond per thread in the target
process.
The idea is to write a stand-alone C++ program that uses ptrace() to attach to
the threads of the running server, then uses libunwind to obtain stack
traces. Some effort will be spent to minimise the time that the target process
is held suspended under ptrace():
- During ptrace() we will only obtain the raw stacktraces (list of
instruction pointers). Resolving of symbols can take place after releasing
the target process.
- Libunwind allows to provide our own accessor methods for reading data from
the target process. We will use this to read more efficiently. Rather than
using ptrace(), which costs one system call for every word read, we will
use pread() from /proc/pid/mem in pages of 4k size; this allows to read
multiple words in a single system call. Additionally, we will cache reads,
so that reading same words in multiple stack traces requires only one
physical read. Read-only maps in the target process (ie. executable or
library images) can even be cached between different profiling
measurements.
- If necessary (stage 2), we can look into improving libunwind to be even
faster. Initial studies indicate that it does a number of repeated mmap()s
of /proc/pid/maps which could be greatly improved by caching the data.
A second goal is to provide an easy user interface, in the form of a single
executable program. Eg. this could be statically linked to allow usage
directly on a production server without requiring installation of Gdb or
Perl or other dependencies:
- The default operation could be a bit like top: a continuous updated display
listing obtained stack traces along with the percentage of total. This
allows to simply start the tool and see directly the top contenders.
- With option (and perhaps if stdout is not a tty), we can instead output the
raw stacktrace data for further analysis by external scripts (just like
traditional PMP using GDB).
[See also email discussion from early December 2011]