[MDEV-12084] Test timeouts on CentOS - interaction between server and client stops Created: 2017-02-19 Updated: 2020-06-04 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Scripts & Clients |
| Affects Version/s: | 10.2 |
| Fix Version/s: | 10.2 |
| Type: | Bug | Priority: | Major |
| Reporter: | Elena Stepanova | Assignee: | Georg Richter |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | 10.2-ga | ||
| Attachments: |
|
||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Description |
|
We are getting random (rare) timeouts in buildbot on kvm-bintar-centos5-amd64. Here is an example:
Normally the test takes ~20 seconds there. The problem is reproducible by running the test on the VM with very high --repeat value – it happens, very roughly, once in 300-400 runs. The main part of the test is running 16K of single-row INSERTs into a MyISAM table. There is no concurrency, INSERTs are executed one by one. The problem occurs somewhere during this process, in random places, not at a specific row. When it happens, the server is still reachable, show process shows the connection being idle:
(connection 11 is the one which has been performing updates). The table is healthy and responds to SELECTs. There is no disk space problem. When it stops, it stops seemingly forever. However, running gdb on the server sometimes "wakes it up", makes the flow to be resumed. Example of a stack trace from the server when it's in this state:
Example of a stack trace from mysqltest:
|
| Comments |
| Comment by Elena Stepanova [ 2017-03-11 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Possibly also related:
| |||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Lee (Inactive) [ 2017-05-05 ] | |||||||||||||||||||||||||||||||||||||||||||
|
I also ran into this issue when I was testing TEXT/BLOB in ColumnStore. It gets triggered randomly in different tests. mysql was running at 100%, but the test stuck.
Back trace on the mysql process
| |||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2017-05-06 ] | |||||||||||||||||||||||||||||||||||||||||||
|
dleeyh, could you please specify the environment? | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2017-05-06 ] | |||||||||||||||||||||||||||||||||||||||||||
|
I tried the same test
on CentOS 5, 6, 7.3 (vm-centos5-amd64-build.qcow2, vm-centos6-amd64-build.qcow2, vm-centos73-amd64-build.qcow2), same 10.2 source tarball, same build options (cmake . && make. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2017-05-07 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Another curious observation. I've set testcase-timeout=3000 and suite-timeout=3000 (that is, 50 hours). When I hit the problem with these settings, the test had hung for 8 hours, and then it continued normally. 8 hours must be the connection timeout, there is no big mystery in the value; but it did not fail, did not disconnect, it just continued normally after the timeout, whatever it means (and later after a number of iterations it hit the problem again in the same run). | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrew Hutchings (Inactive) [ 2017-05-07 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Just managed to reproduce this on CentOS 7 in a semi-easily reproduce it this way. This is my create table:
Script is test2.sh Create the table in test, run the script, hit Ctrl-C. This might take a few tries but when it causes "Ctrl-C – query killed. Continuing normally." the client hangs with 100% CPU usage and that stack trace.
GDB showing the loop it is stuck on:
| |||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2017-05-07 ] | |||||||||||||||||||||||||||||||||||||||||||
|
My variation of the problem is a bit different, in my case the client does not consume any CPU, it seems to be just stalling. The stack trace is very similar. I think both variants need to be analyzed in order to make sure it's really the same issue. Given the new evidence that the problem is not limited to the EOL-ed CentOS 5, I think it needs to be considered high priority. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Georg Richter [ 2017-05-08 ] | |||||||||||||||||||||||||||||||||||||||||||
|
I think this issue was fixed already in commit 44a740c348544acce35f289221a945941dc31979. I also wonder why there is a buildbot running CentOS5. According to https://mariadb.com/kb/en/mariadb/deprecation-policy/ CentOS5 is not supported anymore. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrew Hutchings (Inactive) [ 2017-05-08 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Agreed, that commit fixes the problem in my CentOS 7 tests. Unfortunately that commit isn't in the 10.2 submodule checkout yet. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2017-05-08 ] | |||||||||||||||||||||||||||||||||||||||||||
|
serg, do you expect the commit 44a740c348544acce35f289221a945941dc31979 will make it to the 10.2 submodule before the release? I think it should. georg, We don't build RPM packages on CentOS 5 anymore, but there is an "old" bintar that we provide, for systems with old glibc etc. Historically, it's built on CentOS 5, and it stays this way, so that we don't introduce sudden differences in that bintar. I'll try to re-check the mentioned commit on CentOS 5; but better still, let's have it in 10.2 tree, since it's already known to fix the problem on CentOS 7. |