[MXS-956] Maxscale crash: Removing DCB 0x7fbf94016760 but was in state DCB_STATE_DISCONNECTED which is not legal for a call to dcb_close Created: 2016-10-25 Updated: 2016-12-14 Resolved: 2016-12-14 |
|
| Status: | Closed |
| Project: | MariaDB MaxScale |
| Component/s: | readwritesplit |
| Affects Version/s: | 2.0.1, 2.0.2 |
| Fix Version/s: | 2.0.3 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Marlin Cremers | Assignee: | markus makela |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | galera | ||
| Environment: |
root@maxscale1:~# uname -a |
||
| Issue Links: |
|
||||||||
| Sprint: | 2016-21, 2016-22, 2016-23 | ||||||||
| Description |
|
Maxscale crashed with the following error messages in the log:
Core dump:
|
| Comments |
| Comment by Marlin Cremers [ 2016-11-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
I'm getting the following stack trace:
| ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2016-11-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
We've build a package for Ubuntu Xenial for testing purposes. The package can be found here: http://max-tst-01.mariadb.com/ci-repository/2.0-release-nov4/mariadb-maxscale/ The package was built from commit 87e94f6bc6e09d274bba70d62083b1349688ff33. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2016-11-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
This package did not fix the problem and the same crash occurred with the same log output. This could point out that it isn't a simple case of wrong backend server reference state but something else. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2016-11-05 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
mcremers Also testing without master_failure_mode=error_on_write would be good to see if the read-only functionality is the root cause of the crash. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2016-11-11 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
We've fixed a few bugs in the reconnection logic. Here's an Ubuntu Xenial package for testing built from commit 7ef8b187b541410810fc090814de48a107d729b7: Although we haven't been able to reproduce the crash, we believe that the reason for the double closing of a DCB could be related to how a server reference got misplaced after the list of references was sorted to pick the best candidates. We believe that this has been fixed and that router_options=master_failure_mode=error_on_write should work again. mcremers, if it is possible, please test with the original configuration and the new package. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marlin Cremers [ 2016-11-13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
The daemon just crashed with the latest build and the original config. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2016-11-13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
The log message seems to tell us that it was a slave DCB, i.e. a connection, that was closed twice. With this new information, it seems that the problem could lie within the closing of DCBs in the processing of session commands and the recovery of slave connections. To test this hypothesis, please add back the disable_sescmd_history=true to the router_options parameter. If the crash does not occur, it is likely that this is the reason. Meanwhile, we will focus on investigating the new areas of code that this additional information has provided us and take a new look at the slave connection code. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2016-11-15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
We've identified a few problem cases with how the connections are handled. The reconnection of slave servers could be triggered even if no slave was disconnected. This could happen if the master connection was interrupted unexpectedly and the master would still be in Master state. These are now fixed by always treating the disconnection of the master the same way. We've done a special testing build with extra log messages added where the crash can happen that can be found here: http://max-tst-01.mariadb.com/ci-repository/2.0-markusjm-nov15/mariadb-maxscale/ This was build from the 2.0-markusjm branch and commit 9556ee0f01bf12b1baeb93ad9483b87c37320c1b. mcremers Please test this with the original configuration posted. We should get very detailed information if the crash still occurs. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2016-11-21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
We discussed this on #maxscale on FreeNode and it seems there have been no problems with the latest package. I'll close this bug and it can be reopened if it still shows up with 2.0.2. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marlin Cremers [ 2016-11-27 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Just got this in the log.
| ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2016-11-27 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Seems we didn't fix the true cause of the problem. Can you provide the stacktrace for the crash? | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2016-11-29 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
The log messages would suggest that a DCB not related to the session (already closed) gets processed which triggers the reconnection logic. The reconnection seems to be the thing which causes these errors. I've built a new package from commit ee3c42cff781bec3bbbf1898f5b248aaee92fefa with more detailed error messages about where the DCB was closed and where the attempt is being made. It also adds extra checks before the DCB is closed and warns if an reconnection occurs when it shouldn't happen. The packages can be found here: http://max-tst-01.mariadb.com/ci-repository/2.0-markusjm-nov29/mariadb-maxscale/ | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2016-12-01 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Based on a chat on IRC, MaxScale still crashes. I've added some extra logging about whether the DCB which is being closed has a corresponding backend server reference. I've also added a somewhat of a temporary fix which doesn't close the DCB in closeSession if it isn't in a valid state. If possible, please test with this new package: http://max-tst-01.mariadb.com/ci-repository/2.0-markusjm-dec1/mariadb-maxscale/ The packages are build from commit 1272cccf537aad8b824e1df978e0454e5b3b6c40. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2016-12-07 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
A discussion on IRC pointed out that the temporary fix does indeed prevent the crash. It again pointed to some strange behavior in the error handling but it also uncovered a bug that could've allowed masters with inconsistent state to be used. Simplifying the connection closing logic in the error handler should give us a better guarantee that the backend server references and the actual connections stay in sync. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2016-12-12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
We've fixed a minor bug with the error_on_write mode and some of the error handling. The code is also a bit simpler and all connections and the related backend references are closed at the same time. The latest packages for the 2.0.3 release candidate can be found here: http://max-tst-01.mariadb.com/ci-repository/2.0-release-dec12/mariadb-maxscale/ The packages were built from commit 15a8675fca53da3417b8c0155e43d91e1173f208. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2016-12-14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
With the fix mentioned earlier, the crash is prevented and this is OK for 2.0.3. |