[MXS-351] Router error handling can cause crash by leaving dangling DCB pointer Created: 2015-09-05 Updated: 2018-01-15 Resolved: 2015-11-23 |
|
| Status: | Closed |
| Project: | MariaDB MaxScale |
| Component/s: | Core, mariadbbackend, readconnroute, readwritesplit, schemarouter |
| Affects Version/s: | 1.2.0 |
| Fix Version/s: | 1.3.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | martin brampton (Inactive) | Assignee: | martin brampton (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Any |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
It appears that the action of the router error handling (for example handleError in readconnroute.c line 847ff) is unsafe. When a backend "error" occurs, it is picked up by the protocol error handler in mysql_backend.c (gw_backend_hangup at line 1046ff). This calls the router error handler with various relevant data blocks. At present the protocol calls dcb_close, passing the DCB relating to the backend that has the "error". However, the router seems to need to do more work. There is a pointer in the router's session to the backend DCB, and if the DCB is to be closed, the pointer needs to be removed. Otherwise, the DCB gets closed and freed but is subsequently referred to from the link in the router session, which can cause a crash. The provisional solution, implemented in branch All line numbers refer to |
| Comments |
| Comment by martin brampton (Inactive) [ 2015-09-05 ] |
|
The attached file shows information obtained through GDB indicating the series of events that leads to the crash. All routers require amendment, as well as the solution requiring validation. It is hard to know for sure whether the fix is effective, because sometimes tests were frequently crashing, but at other times the same test can be run repeatedly without any problem. This is most likely the result of differences in the environment of the VPSs used for testing, which are outside my control. So far, running tests against the modified code does not result in a crash. |
| Comment by martin brampton (Inactive) [ 2015-09-05 ] |
|
Note that the provisional solution has been applied only to the read connection router. All routers will require modification. |
| Comment by Dipti Joshi (Inactive) [ 2015-09-07 ] |
|
johan.wikman as noted in Martin's last comment, the fix that he has done is only for read connection router - and needs to be done for all the routers. This indicates there needs to be common router functionality that all the routers inherit so that when we such universal bug across all the routers - the code should not have to be duplicated. |
| Comment by Dipti Joshi (Inactive) [ 2015-09-08 ] |
|
martin brampton Will this be closed when |
| Comment by martin brampton (Inactive) [ 2015-09-08 ] |
|
I am working on correcting all routers. It is not feasible for this to be common functionality. The specific reason the error handling is in the router is that each router may have its own techniques and data storage layout, and the protocol does not know the details of these. The code therefore cannot be common. In this particular case, each error handler shared the characteristic of having incomplete functionality, and each needs to be fixed to deal with this. The corrections will not be identical code. The time to fix is mainly involved in checking, so far as possible, that all the functionality around these error handlers is sound. Yes, this and |
| Comment by Dipti Joshi (Inactive) [ 2015-09-10 ] |
|
martin brampton Can we put this in "In Progress" status ? Thanks, |
| Comment by martin brampton (Inactive) [ 2015-09-11 ] |
|
Yes. Why doesn't JIRA do this automatically when work is logged? |
| Comment by Johan Wikman [ 2015-11-23 ] |
|
martin brampton With 329 merged to develop, can this be closed now? |
| Comment by martin brampton (Inactive) [ 2015-11-23 ] |
|
Fixed. |