[MDEV-15850] Develop a successor for the RQG grammar alter_online, Improve RQG Created: 2018-04-12  Updated: 2024-02-05

Status: In Progress
Project: MariaDB Server
Component/s: Tests
Fix Version/s: None

Type: Task Priority: Major
Reporter: Matthias Leich Assignee: Matthias Leich
Resolution: Unresolved Votes: 0
Labels: None


 Description   

We lack some some sufficient thorough test which focuses on concurrent execution
of any (1) table related DDL and DML only (2).
(1) Even a DDL like DROP SCHEMA or TABLESPACE could affect tables.
(2) Optimizer and stored program language related bugs are not in scope in order
to make the test efficient during execution, failure analysis and replay.
Features already existing or being in development like
https://jira.mariadb.org/projects/MDEV/issues/MDEV-13134
will get by that more functional coverage.
The existing RQG grammar conf/runtime/alter_online.* would be some promising base
for such a test.



 Comments   
Comment by Matthias Leich [ 2018-04-23 ]

Developing the grammar and improving RQG requires frequent running RQG tests.
The MariaDB server version used for that is 10.3 and higher according to
progress of development.

Comment by Matthias Leich [ 2020-01-06 ]

*Experiments with "rr" (https://rr-project.org/) in combination with RQG.*
1. Combining "rr" with RQG was on my box (Ubuntu) rather easy.
2. rr provides reverse execution under gdb like promised in its docu.
   Server developer need to have a look on the features and decide if its useful.
3. The storage space consumption of one (rather simple and finally passing) RQG run is with ~ 2 GB problematic.
     Attempts to compress that data save less than 10%.
     For comparison:
     One conventional RQG run where the server crashed : Compressed tar archive with datadir+logs+core   ~ 5 MB
     One working day with running RQG test campaigns causes easy
     - Server dedicated for testing only > 5000 RQG runs
     - notebook for development of tests and tool + testing up to 2000 RQG runs.
     _Variant a_
     Let "rr" write the traces on SSD, delete all traces of runs where the result is not of interest immediate.
     I fear that writing 10000 GB per working day will cause some short lifetime of the SSD.
     _Variant b_
     Let "rr" write the traces into the vardir of the RQG run.  Stick to the default vardir (/dev/shm/vardir which is a virtual memory based tmpfs).
     Take care that  we do not get significant paging.  So the writes happen mostly in the RAM == no danger for the lifetime of the SSD.
     Move only the remaining of interesting RQG runs to the SSD.
     Remaining and/or new problems:
     - a tool for preventing significant paging + corresponding OS setup needs to exist
       Exists at least for my variant of RQG + my boxes running Ubuntu
     - a tool checking the outcome of RQG run and deleting stuff not of interest needs to exist
       There exists one but its functionality might be not sufficient  like " We need at least one sample of crash type T but not 10 or 20".
        In case some RQG test campaign harvests lets say 300 fails * 2 GB per fail than the danger for the lifetime is not that big but 600 GB might be too much
        for some 1 - 2 TB SSD. 
     - A virtual memory consumption (~ 0.5 - 1 GB vardir of RQG run alone + ~ 2GB for "rr")  including the restriction that paging needs to be prevented
       (means ~ 2.5 till 3 GB per concurrent RQG run) combined with the frequent met condition that as more as the CPU's are overloaded as more
       failures we catch per elapsed runtime is problematic.  Many boxes have a nice number of CPU cores but not that much RAM.
       Example:  4 real cores * 2 (rather no matter if Hyperthreading supported or not) -> 24 GB
 
State 2020-03
1. rqpl.pl supports now
    --rr    --> If assigned start DB server with "rr record"  (--> lib/DBServer/MySQL/MySQLd.pm)
    --rr_options  --> If assigned than pass that to the call of "rr"
 2. Two not yet pushed shellscripts which unpack the archive of some failing RQG run
     and run a "rr replay"
The storage space consumption a serious less critical than described above
- in case only the DB server and not everything gets traced by rr
  == This is what is currently implemented.
- in case the writing of core files is prevented
  I have some experimental code doing exact that but a good backtrace is frequent of 
  significant value. Hence I need a shellscript which generates such a backtrace based on
  "rr replay" before enabling that.

Comment by Roel Van de Paar [ 2020-03-18 ]

rr-project indeed looks very promising, especially for sporadic failures.

Comment by Matthias Leich [ 2020-03-20 ]

Two bad problems which showed up when running certain RQG tests on the strong box "sdp".
1. New set of tests A 
    Some DB Server perform a shutdown even though the main part of RQG testing is not finished.
    There seems to be no part inside RQG  and its tools which seem to have sent an "admin ... shutdown" or SIGTERM.
    The entry inside of the server error log gives the impression that a SIGTERM arrived.
2. New set of tests B
     Components inside RQG and also calls in the console report "Can't fork".
     ulimit -Su  reports  1541948  which should be far way sufficient.
The tests seem to have hit a max user processes limit which seems to be ~ 38000.
 
systemctl status user-<my id>.slice
● user-1002.slice - User Slice of mleich
   Loaded: loaded (/run/systemd/transient/user-1002.slice; transient)
Transient: yes
   Active: active since Thu 2020-03-19 14:36:01 UTC; 28min ago
    Tasks: 32 (limit: 37846)    <================ Here is this limit
 
/proc/sys/kernel/pid_max   reports   114688
The kernel and/or systemd/logind seem to set a default UserTasksMax share of 33% 
which than leads to ~ 38000 processes.
Adding to /etc/systemd/logind.conf a line
     UserTasksMax=66%
fixed both problems above.

Comment by Matthias Leich [ 2021-11-18 ]

Result of a testing campaign consisting of ~ 10000 RQG tests with --num-cpu-ticks=300:
Neither a raise of new bugs found per campaign nor a raise of replays of known bugs per campaign.
I cannot exclude that there might be bugs which could be replayed better with low values for num-cpu-ticks.
But they do not seem to exist in 
    origin/bb-10.6-MDEV-27058 1b0ee85b48a6a734e566c47c20cb24eae4a7afb7 2021-11-16T19:55:06+02:00

Generated at Thu Feb 08 08:24:30 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.