[MXS-351] Router error handling can cause crash by leaving dangling DCB pointer Created: 2015-09-05  Updated: 2018-01-15  Resolved: 2015-11-23

Status: Closed
Project: MariaDB MaxScale
Component/s: Core, mariadbbackend, readconnroute, readwritesplit, schemarouter
Affects Version/s: 1.2.0
Fix Version/s: 1.3.0

Type: Bug Priority: Blocker
Reporter: martin brampton (Inactive) Assignee: martin brampton (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Any


Attachments: Text File CrashingDCBtest.txt    
Issue Links:
PartOf
is part of MXS-329 The session pointer in a DCB can be n... Closed

 Description   

It appears that the action of the router error handling (for example handleError in readconnroute.c line 847ff) is unsafe. When a backend "error" occurs, it is picked up by the protocol error handler in mysql_backend.c (gw_backend_hangup at line 1046ff). This calls the router error handler with various relevant data blocks. At present the protocol calls dcb_close, passing the DCB relating to the backend that has the "error".

However, the router seems to need to do more work. There is a pointer in the router's session to the backend DCB, and if the DCB is to be closed, the pointer needs to be removed. Otherwise, the DCB gets closed and freed but is subsequently referred to from the link in the router session, which can cause a crash.

The provisional solution, implemented in branch MXS-329 (because the problem was exposed during testing of the crash described in MXS-329), is to remove the dcb_close from the protocol. The protocol cannot deal with the router session, because the structure of a router session varies from router to router. The error handler in the router needs to do this. So the error handler in the router code now (at least temporarily) aborts if the passed DCB does not match the DCB pointer in the router session. Otherwise, the pointer in the router session is set to NULL and the DCB is closed.The call to dcb_close in the protocol error handler is removed.

All line numbers refer to MXS-329 as of now.



 Comments   
Comment by martin brampton (Inactive) [ 2015-09-05 ]

The attached file shows information obtained through GDB indicating the series of events that leads to the crash.

All routers require amendment, as well as the solution requiring validation.

It is hard to know for sure whether the fix is effective, because sometimes tests were frequently crashing, but at other times the same test can be run repeatedly without any problem. This is most likely the result of differences in the environment of the VPSs used for testing, which are outside my control. So far, running tests against the modified code does not result in a crash.

Comment by martin brampton (Inactive) [ 2015-09-05 ]

Note that the provisional solution has been applied only to the read connection router. All routers will require modification.

Comment by Dipti Joshi (Inactive) [ 2015-09-07 ]

johan.wikman as noted in Martin's last comment, the fix that he has done is only for read connection router - and needs to be done for all the routers. This indicates there needs to be common router functionality that all the routers inherit so that when we such universal bug across all the routers - the code should not have to be duplicated.

Comment by Dipti Joshi (Inactive) [ 2015-09-08 ]

martin brampton Will this be closed when MXS-329 gets closed ?

Comment by martin brampton (Inactive) [ 2015-09-08 ]

I am working on correcting all routers. It is not feasible for this to be common functionality. The specific reason the error handling is in the router is that each router may have its own techniques and data storage layout, and the protocol does not know the details of these. The code therefore cannot be common. In this particular case, each error handler shared the characteristic of having incomplete functionality, and each needs to be fixed to deal with this. The corrections will not be identical code. The time to fix is mainly involved in checking, so far as possible, that all the functionality around these error handlers is sound.

Yes, this and MXS-329 will be closed at the same time.

Comment by Dipti Joshi (Inactive) [ 2015-09-10 ]

martin brampton Can we put this in "In Progress" status ?

Thanks,
Dipti

Comment by martin brampton (Inactive) [ 2015-09-11 ]

Yes. Why doesn't JIRA do this automatically when work is logged?

Comment by Johan Wikman [ 2015-11-23 ]

martin brampton With 329 merged to develop, can this be closed now?

Comment by martin brampton (Inactive) [ 2015-11-23 ]

Fixed.

Generated at Thu Feb 08 03:58:39 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.