[MXS-329] The session pointer in a DCB can be null unexpectedly Created: 2015-08-22  Updated: 2015-11-23  Resolved: 2015-11-23

Status: Closed
Project: MariaDB MaxScale
Component/s: Core
Affects Version/s: 1.2.0
Fix Version/s: 1.3.0

Type: Bug Priority: Major
Reporter: martin brampton (Inactive) Assignee: martin brampton (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

All


Attachments: Text File MaxScaleMXS-329CrashReport.txt    
Issue Links:
PartOf
includes MXS-330 Insufficient error handling in httpd.c Closed
includes MXS-351 Router error handling can cause crash... Closed
Problem/Incident
causes MXS-414 Maxscale crashed every day! Closed
Relates
relates to MXS-207 MaxScale received fatal signal 11 (li... Closed
relates to MXS-415 MaxScale 1.2.1 crashed with Signal 6 ... Closed
relates to MXS-198 MaxScale received fatal signal 11 Closed

 Description   

In the current develop branch, line 379 of mysql_backend.c is: service_refresh_users(dcb->session->service); this will naturally fail and cause a crash if dcb->session is NULL. It seems that this situation can arise.

In the case in point, the DCB was in state DCB_STATE_POLLING, which implies that the DCB had not been processed by the zombie mechanism. Nor was there any indication of corruption.

The big question here is whether there are any rules for when dcb->session may be null. Most of the time it isn't NULL, but obviously sometimes it is. How did it come to be NULL?

Does code have to assume that dcb->session could become NULL at any time? Or are there known circumstances where it is safe to assume that it will not be NULL?



 Comments   
Comment by martin brampton (Inactive) [ 2015-08-23 ]

Where is dcb->session set to NULL?

service.c 357 - this seems unnecessary and should probably be removed. The DCB in question (port->listener) was created earlier in the function (and would then have been set to NULL by line 204 of dcb.c) and there appears to be no processing that would have set it to non-NULL. If that is correct, the statement does no harm, but is unhelpful for diagnosing problems by being an irrelevant distraction.

session.c 163 - during session creation. This is in a context where session creation has failed. It looks as if session creation could be simpler and safer. At 110 a spinlock is obtained, but the session has only just been created, and nobody knows about it until in line 125 the client DCB is linked to the session. But it isn't clear that the link needs to be made so soon and if it were postponed, the lock would not be required. If the setting of the pointer from DCB to session were left until later, it would not need to be unset at line 163. Personally, I'd prefer line 170 to be return NULL rather than goto return_session.

session.c 204 - still in session creation - same argument as for line 163.

session.c 223 - still in session creation - unclear why the spinlock on line 216 was required if the link from the client DCB had been postponed. Otherwise as above. Line 235 might be the point at which the DCB has to be linked. Should atomic_add be used on line 237?

session.c 355 - is part of the session_unlink_dcb function which may need to set the DCB session link to NULL - but this seems a dangerous operation. Why is it done and is it capable of causing problems?

harness_common.c 739, 768. I've no idea what this code is, and is entirely devoid of comments

tee.c 589 - I've no idea what is going on here.

httpd.c 354 - Not clear why this is needed since the DCB has only been allocated a few lines earlier, and a newly allocated DCB will have session set to NULL, SFAIK.

schemarouter.c 1187, 1188 - Why is the session pointer set to NULL prior to calling dcb_close? It will be set to NULL eventually as a result of the call to dcb_close. Is setting it NULL here necessary? It may be dangerous.

shardrouter.c 1263, 1264 As for schemarouter. Could common code be factored out?

These appear to be all the places in MaxScale where the session pointer in a DCB is set to NULL. It looks as if a number of them could be eliminated. Changing session.c obviously requires care. The session_unlink_dcb function seems to be the area where further investigation is needed, but the setting of DCB session to NULL takes place only in the call from the persistent connections logic. This cannot be the cause of crashes in released code, as persistent connections are not yet in any release.

Comment by martin brampton (Inactive) [ 2015-08-23 ]

I may have misunderstood, but there seems to be something of a catch 22 in relation to setting up a listener. This thought derives from looking at the possibility of insisting that a DCB submitted to poll_add_dcb() should have a non-null session. The reason for this idea is that as soon as the DCB has been submitted to the poll list, it is possible that an event will be processed that refers to the DCB. If it has no session, then the event handling may fail since much code assumes that a DCB has a session. The possibly problematic code is at lines 364-397 of service.c. The first action is to start listening and includes a call to poll_add_dcb(). At this point, the DCB does not have a session link. This is then added by a call to session_alloc(). Once this is complete, the DCB does have a session link, but there is a period when it is in the poll list but has no session. Attempting to turn the code around does not work - calling session_alloc first is not possible because there is a test within this function that requires the DCB to be already listening. If poll_add_dcb has not yet been called, the DCB will not be in the "correct" state and so session_alloc fails and the listener is not created. The ramifications of changing session_alloc are not immediately obvious. I suspect that substantial change is not possible because establishing a listener involves binding to the port, and if there is no relevant entry in the poll list it may be possible to lose data (but this is based on my very limited knowledge of this area).

Comment by martin brampton (Inactive) [ 2015-08-23 ]

I believe that the answer to the question posed in the original description is that in all normal circumstances, a DCB must have a valid session pointer. This is logical - poll events activate DCB processing, and this cannot be achieved without a session. This implies that there is no reason for code to check for dcb->session being NULL.

However, there is probably a borderline flaw in the persistent connections code (even though heavy testing does not produce a crash). Persistent connections may need a mechanism similar to the existing zombie mechanism to be completely robust. There must be other faults of the kind outlined above, since crashes from the field use released code that does not have persistent connections.

Comment by martin brampton (Inactive) [ 2015-08-24 ]

The work needed here seems to be:

  • Remove any statements setting dcb->session to NULL except for the single occasion in dcb_final_free where zombie DCBs are finally (safely) killed. This will guarantee that dcb->session is never set to NULL while a DCB is being processed
  • Move the removal of a DCB from the poll list from the middle stages of dcb_close into dcb_process_victim_queue, at which time we can be sure that there are no outstanding events relating to the DCB and every thread has relinquished any interest in it
  • Alter the persistent connections mechanism so that, instead of taking connections for reuse from dcb_close, they are taken from dcb_process_victim_queue. Again, this is so that we can be sure that all events relating to the DCB have been handled before it becomes a candidate for reuse
  • Put tests into the main poll handling loop so that a DCB is not processed by the modules unless it has a non-null session. If the DCB does have null session, write to the error log and ignore the event. This may cause problems, but will avoid crashes because a DCB will never be introduced into processing without a session. The problems will be noted in the error log and can be investigated.

This set of actions should be sufficient, if fully completed, to prevent crashes caused by dcb->session being null.

Comment by Johan Wikman [ 2015-08-31 ]

What's the status of this one. It seems that of the current crashes/blockers, MXS-200, MXS-198 and MXS-207 might be fixed by this.

Comment by martin brampton (Inactive) [ 2015-09-01 ]

There seems a reasonable probability that fixing these issues will resolve a number of problems, including the ones mentioned above. This cannot be certain since we do not have precise details for most of the reported crashes, and we cannot be sure of having removed infrequent crashes until there has been extensive testing and live running. However, the intention is to harden the system against the particular class of crashes that occur as a result of the session pointer in a DCB being NULL. At worst, the modified system will report errors and skip processing of poll events that violate that condition. If that happens, we will not have a crash, we will have an explicit error message, and some clue to the cause of the problem rather than just an indirect manifestation.

In addition to the work defined above, it was necessary to introduce a dummy session for use by the MySQL backend protocol in the situation where the user has not yet been authenticated. The dummy session is a single static SESSION structure. This work has been done and appears to work but requires further checking and testing.

The other software changes have been implemented and subjected to some testing, but further work will be needed.

Comment by Dipti Joshi (Inactive) [ 2015-09-04 ]

martin brampton Is this fixed now ?

Comment by martin brampton (Inactive) [ 2015-09-05 ]

Probably. Clarifying some of the test behaviours needed a discussion with Massimiliano, who was on holiday until yesterday. Following discussion, further tests brought to light MXS-351 which is being handled within the MXS-329 branch, there being little gain from creating a new branch.

Comment by Johan Wikman [ 2015-11-23 ]

martin brampton I suppose this can be closed now?

Comment by martin brampton (Inactive) [ 2015-11-23 ]

Tests that exposed the faults described in this and the related issues now run correctly.

Comment by yorkershi [ 2015-11-23 ]

Hi, I want to know when the version 1.3.0 will be released.

Generated at Thu Feb 08 03:58:29 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.