*Experiments with "rr" (https://rr-project.org/) in combination with RQG.*
|
1. Combining "rr" with RQG was on my box (Ubuntu) rather easy.
|
2. rr provides reverse execution under gdb like promised in its docu.
|
Server developer need to have a look on the features and decide if its useful.
|
3. The storage space consumption of one (rather simple and finally passing) RQG run is with ~ 2 GB problematic.
|
Attempts to compress that data save less than 10%.
|
For comparison:
|
One conventional RQG run where the server crashed : Compressed tar archive with datadir+logs+core ~ 5 MB
|
One working day with running RQG test campaigns causes easy
|
- Server dedicated for testing only > 5000 RQG runs
|
- notebook for development of tests and tool + testing up to 2000 RQG runs.
|
_Variant a_
|
Let "rr" write the traces on SSD, delete all traces of runs where the result is not of interest immediate.
|
I fear that writing 10000 GB per working day will cause some short lifetime of the SSD.
|
_Variant b_
|
Let "rr" write the traces into the vardir of the RQG run. Stick to the default vardir (/dev/shm/vardir which is a virtual memory based tmpfs).
|
Take care that we do not get significant paging. So the writes happen mostly in the RAM == no danger for the lifetime of the SSD.
|
Move only the remaining of interesting RQG runs to the SSD.
|
Remaining and/or new problems:
|
- a tool for preventing significant paging + corresponding OS setup needs to exist
|
Exists at least for my variant of RQG + my boxes running Ubuntu
|
- a tool checking the outcome of RQG run and deleting stuff not of interest needs to exist
|
There exists one but its functionality might be not sufficient like " We need at least one sample of crash type T but not 10 or 20".
|
In case some RQG test campaign harvests lets say 300 fails * 2 GB per fail than the danger for the lifetime is not that big but 600 GB might be too much
|
for some 1 - 2 TB SSD.
|
- A virtual memory consumption (~ 0.5 - 1 GB vardir of RQG run alone + ~ 2GB for "rr") including the restriction that paging needs to be prevented
|
(means ~ 2.5 till 3 GB per concurrent RQG run) combined with the frequent met condition that as more as the CPU's are overloaded as more
|
failures we catch per elapsed runtime is problematic. Many boxes have a nice number of CPU cores but not that much RAM.
|
Example: 4 real cores * 2 (rather no matter if Hyperthreading supported or not) -> 24 GB
|
|
State 2020-03
|
1. rqpl.pl supports now
|
--rr --> If assigned start DB server with "rr record" (--> lib/DBServer/MySQL/MySQLd.pm)
|
--rr_options --> If assigned than pass that to the call of "rr"
|
2. Two not yet pushed shellscripts which unpack the archive of some failing RQG run
|
and run a "rr replay"
|
The storage space consumption a serious less critical than described above
|
- in case only the DB server and not everything gets traced by rr
|
== This is what is currently implemented.
|
- in case the writing of core files is prevented
|
I have some experimental code doing exact that but a good backtrace is frequent of
|
significant value. Hence I need a shellscript which generates such a backtrace based on
|
"rr replay" before enabling that.
|
|