[MXS-559] Crash due to debug assertion in readwritesplit Created: 2016-01-26 Updated: 2016-02-01 Resolved: 2016-02-01 |
|
| Status: | Closed |
| Project: | MariaDB MaxScale |
| Component/s: | readwritesplit |
| Affects Version/s: | 1.3.0 |
| Fix Version/s: | 1.3.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | markus makela | Assignee: | markus makela |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Description |
|
Originally reported by engel75: It looks like the "jan21" version is still crashing. The crash was caused by using IP 10.0.248.202 which is listening to the RW split router of 5 MariaDB 5.5.47 nodes to dump a DB to the cluster:
/var/log/syslog from one of the Galera DB nodes:
|
| Comments |
| Comment by Florian Engelmann [ 2016-01-26 ] | ||||||||||||||||
|
The .sql dump got 4.7GB. After adding
to the beginning of the file the dump import failed again:
but Maxscale did not crash this time. So I enabled "log_info" (while "log_to_shm" is still disabled) to see which SQL statement fails the import but it did NOT fail this time. The "log_bin" filesystem size is 20GB and still got 11GB left. I'll import the dump once again to see what happens this time. | ||||||||||||||||
| Comment by markus makela [ 2016-01-26 ] | ||||||||||||||||
|
It crashes on line 1691 in readwritesplit due to a debug check failure on the master connection's DCB. The use of the master's DCB isn't required as it could be replaced with the client's DCB since both of them use the same authentication data as was found out in | ||||||||||||||||
| Comment by Florian Engelmann [ 2016-01-26 ] | ||||||||||||||||
|
Ok sounds like an easy fix. Could we please get a new package including the fix and debugging enabled after you are ready? | ||||||||||||||||
| Comment by Florian Engelmann [ 2016-01-26 ] | ||||||||||||||||
|
no crash this time but the import failed again.
| ||||||||||||||||
| Comment by markus makela [ 2016-01-26 ] | ||||||||||||||||
|
Here's a build with the fix added for Ubuntu Trusty: http://maxscale-jenkins.mariadb.com/ci-repository/release-1.3.0-jan26/ | ||||||||||||||||
| Comment by Florian Engelmann [ 2016-01-26 ] | ||||||||||||||||
|
great! Testing this one now... thank you! | ||||||||||||||||
| Comment by Florian Engelmann [ 2016-01-26 ] | ||||||||||||||||
|
same problem with the new jan26 version:
Testing the same command via haproxy now to see if there is any problem related to our mariaDB Galera configuration. | ||||||||||||||||
| Comment by Florian Engelmann [ 2016-01-26 ] | ||||||||||||||||
|
success via haproxy:
The problem is definitely maxscale related. | ||||||||||||||||
| Comment by markus makela [ 2016-01-27 ] | ||||||||||||||||
|
The error that is logged is related to error handling for backend server connections. It would suggest that MaxScale lost connection to the server or it thought that it lost connection due to timeout values being too low. It might be a good idea to try with higher timeouts for the monitors if the database is slow during the dump: Monitor documentation If that is the only relevant error message logged and it actually was caused by a timeout there should be more messages. I'll continue investigating this and see if I can reproduce it. | ||||||||||||||||
| Comment by markus makela [ 2016-01-27 ] | ||||||||||||||||
|
One thing that could also help is to do the dump with the readconnroute module if possible. It uses connection based load balancing and is faster than the readwritesplit when all writes go to the master server. If it is possible for you to confirm that the dump is successful with the readconnroute router, we can confirm that it is a problem with the readwritesplit module. For more details about the readconnroute module, please read the documentation here: https://github.com/mariadb-corporation/MaxScale/blob/release-1.3.0/Documentation/Routers/ReadConnRoute.md | ||||||||||||||||
| Comment by markus makela [ 2016-01-27 ] | ||||||||||||||||
|
I was able to reproduce something similar to this:
Here the monitor isn't able to connect to the master because of the intensity of the dump restoration that was happening but that was explicitly logged into the error logs. | ||||||||||||||||
| Comment by markus makela [ 2016-01-28 ] | ||||||||||||||||
|
engel75 Could you try the dump again without the info or debug log leves on? That would filter out some of the extra noise and would allow errors to be spotted more easily. If you can also increase the timeouts for the monitor to about 10-15 seconds by adding backend_connect_timeout=10 and backend_read_timeout=10 to the monitor definition we can rule out the monitor timing out and closing the connection. | ||||||||||||||||
| Comment by Florian Engelmann [ 2016-01-28 ] | ||||||||||||||||
|
OK - config looks like:
Same result:
Maxscale did not crash, so no core dump. | ||||||||||||||||
| Comment by Florian Engelmann [ 2016-01-28 ] | ||||||||||||||||
|
btw. importing the dump via the readconn router finished successful:
I'll test this once again to be sure... | ||||||||||||||||
| Comment by Florian Engelmann [ 2016-01-28 ] | ||||||||||||||||
|
seems like readconn router does not cause any trouble. My 2nd import finished successful too. | ||||||||||||||||
| Comment by markus makela [ 2016-01-28 ] | ||||||||||||||||
|
OK, that's good to hear. One thing that could be possibly helpful is to split the dump into separate statements and execute them one at a time or by splitting the dump in two. This way we could find out if it is always the same statement or if it seems to be a "random" failure. For the time being, I suggest using the readconnroute to load dumps into a cluster simply because it does a lot less processing and thus is a lot faster. Although using the readwritesplit is not the fastest way to do it, it seems to be a great way to catch bugs in MaxScale | ||||||||||||||||
| Comment by markus makela [ 2016-02-01 ] | ||||||||||||||||
|
The crash due to debug assertion has been fixed but the dump of the database still fails. I'll create a new bug report for that issue: |