[MXS-329] The session pointer in a DCB can be null unexpectedly Created: 2015-08-22 Updated: 2015-11-23 Resolved: 2015-11-23 |
|
| Status: | Closed |
| Project: | MariaDB MaxScale |
| Component/s: | Core |
| Affects Version/s: | 1.2.0 |
| Fix Version/s: | 1.3.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | martin brampton (Inactive) | Assignee: | martin brampton (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
All |
||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Description |
|
In the current develop branch, line 379 of mysql_backend.c is: service_refresh_users(dcb->session->service); this will naturally fail and cause a crash if dcb->session is NULL. It seems that this situation can arise. In the case in point, the DCB was in state DCB_STATE_POLLING, which implies that the DCB had not been processed by the zombie mechanism. Nor was there any indication of corruption. The big question here is whether there are any rules for when dcb->session may be null. Most of the time it isn't NULL, but obviously sometimes it is. How did it come to be NULL? Does code have to assume that dcb->session could become NULL at any time? Or are there known circumstances where it is safe to assume that it will not be NULL? |
| Comments |
| Comment by martin brampton (Inactive) [ 2015-08-23 ] |
|
Where is dcb->session set to NULL? service.c 357 - this seems unnecessary and should probably be removed. The DCB in question (port->listener) was created earlier in the function (and would then have been set to NULL by line 204 of dcb.c) and there appears to be no processing that would have set it to non-NULL. If that is correct, the statement does no harm, but is unhelpful for diagnosing problems by being an irrelevant distraction. session.c 163 - during session creation. This is in a context where session creation has failed. It looks as if session creation could be simpler and safer. At 110 a spinlock is obtained, but the session has only just been created, and nobody knows about it until in line 125 the client DCB is linked to the session. But it isn't clear that the link needs to be made so soon and if it were postponed, the lock would not be required. If the setting of the pointer from DCB to session were left until later, it would not need to be unset at line 163. Personally, I'd prefer line 170 to be return NULL rather than goto return_session. session.c 204 - still in session creation - same argument as for line 163. session.c 223 - still in session creation - unclear why the spinlock on line 216 was required if the link from the client DCB had been postponed. Otherwise as above. Line 235 might be the point at which the DCB has to be linked. Should atomic_add be used on line 237? session.c 355 - is part of the session_unlink_dcb function which may need to set the DCB session link to NULL - but this seems a dangerous operation. Why is it done and is it capable of causing problems? harness_common.c 739, 768. I've no idea what this code is, and is entirely devoid of comments tee.c 589 - I've no idea what is going on here. httpd.c 354 - Not clear why this is needed since the DCB has only been allocated a few lines earlier, and a newly allocated DCB will have session set to NULL, SFAIK. schemarouter.c 1187, 1188 - Why is the session pointer set to NULL prior to calling dcb_close? It will be set to NULL eventually as a result of the call to dcb_close. Is setting it NULL here necessary? It may be dangerous. shardrouter.c 1263, 1264 As for schemarouter. Could common code be factored out? These appear to be all the places in MaxScale where the session pointer in a DCB is set to NULL. It looks as if a number of them could be eliminated. Changing session.c obviously requires care. The session_unlink_dcb function seems to be the area where further investigation is needed, but the setting of DCB session to NULL takes place only in the call from the persistent connections logic. This cannot be the cause of crashes in released code, as persistent connections are not yet in any release. |
| Comment by martin brampton (Inactive) [ 2015-08-23 ] |
|
I may have misunderstood, but there seems to be something of a catch 22 in relation to setting up a listener. This thought derives from looking at the possibility of insisting that a DCB submitted to poll_add_dcb() should have a non-null session. The reason for this idea is that as soon as the DCB has been submitted to the poll list, it is possible that an event will be processed that refers to the DCB. If it has no session, then the event handling may fail since much code assumes that a DCB has a session. The possibly problematic code is at lines 364-397 of service.c. The first action is to start listening and includes a call to poll_add_dcb(). At this point, the DCB does not have a session link. This is then added by a call to session_alloc(). Once this is complete, the DCB does have a session link, but there is a period when it is in the poll list but has no session. Attempting to turn the code around does not work - calling session_alloc first is not possible because there is a test within this function that requires the DCB to be already listening. If poll_add_dcb has not yet been called, the DCB will not be in the "correct" state and so session_alloc fails and the listener is not created. The ramifications of changing session_alloc are not immediately obvious. I suspect that substantial change is not possible because establishing a listener involves binding to the port, and if there is no relevant entry in the poll list it may be possible to lose data (but this is based on my very limited knowledge of this area). |
| Comment by martin brampton (Inactive) [ 2015-08-23 ] |
|
I believe that the answer to the question posed in the original description is that in all normal circumstances, a DCB must have a valid session pointer. This is logical - poll events activate DCB processing, and this cannot be achieved without a session. This implies that there is no reason for code to check for dcb->session being NULL. However, there is probably a borderline flaw in the persistent connections code (even though heavy testing does not produce a crash). Persistent connections may need a mechanism similar to the existing zombie mechanism to be completely robust. There must be other faults of the kind outlined above, since crashes from the field use released code that does not have persistent connections. |
| Comment by martin brampton (Inactive) [ 2015-08-24 ] |
|
The work needed here seems to be:
This set of actions should be sufficient, if fully completed, to prevent crashes caused by dcb->session being null. |
| Comment by Johan Wikman [ 2015-08-31 ] |
|
What's the status of this one. It seems that of the current crashes/blockers, |
| Comment by martin brampton (Inactive) [ 2015-09-01 ] |
|
There seems a reasonable probability that fixing these issues will resolve a number of problems, including the ones mentioned above. This cannot be certain since we do not have precise details for most of the reported crashes, and we cannot be sure of having removed infrequent crashes until there has been extensive testing and live running. However, the intention is to harden the system against the particular class of crashes that occur as a result of the session pointer in a DCB being NULL. At worst, the modified system will report errors and skip processing of poll events that violate that condition. If that happens, we will not have a crash, we will have an explicit error message, and some clue to the cause of the problem rather than just an indirect manifestation. In addition to the work defined above, it was necessary to introduce a dummy session for use by the MySQL backend protocol in the situation where the user has not yet been authenticated. The dummy session is a single static SESSION structure. This work has been done and appears to work but requires further checking and testing. The other software changes have been implemented and subjected to some testing, but further work will be needed. |
| Comment by Dipti Joshi (Inactive) [ 2015-09-04 ] |
|
martin brampton Is this fixed now ? |
| Comment by martin brampton (Inactive) [ 2015-09-05 ] |
|
Probably. Clarifying some of the test behaviours needed a discussion with Massimiliano, who was on holiday until yesterday. Following discussion, further tests brought to light |
| Comment by Johan Wikman [ 2015-11-23 ] |
|
martin brampton I suppose this can be closed now? |
| Comment by martin brampton (Inactive) [ 2015-11-23 ] |
|
Tests that exposed the faults described in this and the related issues now run correctly. |
| Comment by yorkershi [ 2015-11-23 ] |
|
Hi, I want to know when the version 1.3.0 will be released. |