[MDEV-13876] galera.MW-328A failed in buildbot with wrong result or timeout Created: 2017-09-22 Updated: 2018-09-14 Resolved: 2018-09-14 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera, Tests |
| Affects Version/s: | 10.1, 10.2 |
| Fix Version/s: | 10.2.18, 10.3.10, 10.1.37 |
| Type: | Bug | Priority: | Major |
| Reporter: | Elena Stepanova | Assignee: | Jan Lindström (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Description |
|
http://buildbot.askmonty.org/buildbot/builders/kvm-deb-xenial-amd64/builds/2004
It can also time out:
|
| Comments |
| Comment by Michael Widenius [ 2017-12-06 ] | ||||||||||||||||||
|
The above happens from time to time. When applying the included patch, which is unrelated and mainly decreases time that the LOCK_thd_kill mutex is hold, this happens much more frequently. | ||||||||||||||||||
| Comment by Elena Stepanova [ 2017-12-11 ] | ||||||||||||||||||
|
Still fails in the similar fashion on the latest 10.1:
| ||||||||||||||||||
| Comment by Julius Goryavsky [ 2017-12-20 ] | ||||||||||||||||||
|
I made a new experimental version of patch which should fix the problems with the MW-328A test. This is a very difficult bug, because it is heavily tied to the speed of the computer and rarely reproduced in my environment. From my point of view, it caused by the memory write order inversion between the updates of mysys_var->current_mutex and mysys_var->current_cond. In the main branch of the code (the ::awake() method in the sql_class.cc file) there are additional checks that should protect us from problems caused by the zero current_mutex value while we have nonzero current_cond. However, this protection method is somewhat heuristic and requires a loop with sleep() call, which can lead to a sudden freezing of the server and hinders the work in real time. Also, we have many other places in the code, where broadcast made without additional checks. Also, the update of the mysys_var->abort variable (which is checked in the mysys routines) sometimes occurs before the capture of mutex, sometimes after. In the second case, we too may have problems with write order iversion / inconsistent memory read on some architectures. However, the nulling of mysys_var->current_mutex and mysys_var->current_cond is already protected with mutex. I propose to get rid of all potential pitfalls by adding protection across writing to these variables using the same mutex. In addition, I found a lot of places where there is no check for thd->killed flag, which can lead to loss of the broadcast and may cause server hangup for a long time. I tried everywhere to follow the test for thd->killed after entern_cond(). My pull request for this patch is here: https://github.com/MariaDB/server/pull/520 | ||||||||||||||||||
| Comment by Seppo Jaakola [ 2017-12-20 ] | ||||||||||||||||||
|
MW-328A.test shows sporadic failures also in the upstream MySQL version. The test has been fixed to be deterministic with commits tagged with MW-418, and this work should be merged to MariaDB side as well. The patch by Julius may need a separate mtr test for verification. |