[MXS-414] Maxscale crashed every day! Created: 2015-10-17  Updated: 2015-11-24  Resolved: 2015-11-24

Status: Closed
Project: MariaDB MaxScale
Component/s: Core
Affects Version/s: 1.2.0, 1.2.1
Fix Version/s: 1.3.0

Type: Bug Priority: Blocker
Reporter: cai sunny Assignee: Johan Wikman
Resolution: Fixed Votes: 1
Labels: None
Environment:

Linux 2.6.32-573.3.1.el6.x86_64 #1 SMP Thu Aug 13 22:55:16 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux


Attachments: Text File michael_error1 (1).log     Text File michael_partial-trace1.log    
Issue Links:
Problem/Incident
is caused by MXS-329 The session pointer in a DCB can be n... Closed
Relates
relates to MXS-207 MaxScale received fatal signal 11 (li... Closed

 Description   

Maxscale crashed every day!

The error message like below:
2015-10-16 22:25:19 [is_read_tmp_table] Error: Master server DBC is NULL. This means that the connection to the master server is already closed while a query is still being routed.
2015-10-16 22:25:19 [check_create_tmp_table] Error: Master server DCB is NULL. This means that the connection to the master server is already closed while a query is still being routed.
2015-10-16 22:25:19 [check_drop_tmp_table] Error: Master server DBC is NULL. This means that the connection to the master server is already closed while a query is still being routed.
2015-10-16 22:25:29 [is_read_tmp_table] Error: Master server DBC is NULL. This means that the connection to the master server is already closed while a query is still being routed.
2015-10-16 22:25:29 [check_create_tmp_table] Error: Master server DCB is NULL. This means that the connection to the master server is already closed while a query is still being routed.
2015-10-16 22:25:29 [check_drop_tmp_table] Error: Master server DBC is NULL. This means that the connection to the master server is already closed while a query is still being routed.
2015-10-16 22:25:39 [is_read_tmp_table] Error: Master server DBC is NULL. This means that the connection to the master server is already closed while a query is still being routed.
2015-10-16 22:25:39 [check_create_tmp_table] Error: Master server DCB is NULL. This means that the connection to the master server is already closed while a query is still being routed.
2015-10-16 22:25:39 [check_drop_tmp_table] Error: Master server DBC is NULL. This means that the connection to the master server is already closed while a query is still being routed.
2015-10-16 22:25:49 Fatal: MaxScale 1.2.1 received fatal signal 11. Attempting backtrace.
2015-10-16 22:25:49 Commit ID: 7bfda81b098bfd0db3c306495725d1910294d2e8 System name: Linux Release string: CentOS release 6.6 (Final) Embedded library version: 5.5.42-MariaDB
2015-10-16 22:25:49 /usr/bin/maxscale() [0x523a32]
2015-10-16 22:25:49 /lib64/libpthread.so.0(+0xf790) [0x7f952a3d0790]
2015-10-16 22:25:49 /usr/lib64/maxscale/libreadwritesplit.so(+0x5d16) [0x7f9524183d16]
2015-10-16 22:25:49 /usr/lib64/maxscale/libreadwritesplit.so(+0x6f7a) [0x7f9524184f7a]
2015-10-16 22:25:49 /usr/lib64/maxscale/libreadwritesplit.so(+0x691c) [0x7f952418491c]
2015-10-16 22:25:49 /usr/lib64/maxscale/libMySQLClient.so(+0x7a2c) [0x7f950d34ea2c]
2015-10-16 22:25:49 /usr/lib64/maxscale/libMySQLClient.so(+0x65d6) [0x7f950d34d5d6]
2015-10-16 22:25:49 /usr/bin/maxscale() [0x5394b5]
2015-10-16 22:25:49 /usr/bin/maxscale(poll_waitevents+0x634) [0x538d74]
2015-10-16 22:25:49 /lib64/libpthread.so.0(+0x7a51) [0x7f952a3c8a51]
2015-10-16 22:25:49 /lib64/libc.so.6(clone+0x6d) [0x7f9528d399ad]



 Comments   
Comment by cai sunny [ 2015-10-17 ]

Please help check how to find the root cause.

Comment by markus makela [ 2015-10-19 ]

Is there any specific types of queries you were doing through MaxScale when this happened? Did this crash happen at the same time when the master server went down?

This seems like MXS-207 which was supposedly fixed for 1.2.1 but still seems to occur. We will investigate this and come back as soon as we have some information.

Comment by cai sunny [ 2015-10-19 ]

It happened on different time.
Do you know how to get the log and find out which SQL?

Comment by markus makela [ 2015-10-23 ]

Enabling the trace log with maxadmin enable log trace should show which queries are being routed.

Comment by Michaël de groot [ 2015-11-04 ]

I experienced this as well. Please find the (full, anonimized) error log and (partial, anonimized) trace log attached. Please keep in mind the full and original logs are sent to you in our internal channel as well.

Comment by Johan Wikman [ 2015-11-05 ]

Interpreted stacktrace:

/home/ec2-user/workspace/server/core/gateway.c:362
/lib64/libpthread.so.0(+0xf790) [0x7fe67501f790]
/home/ec2-user/workspace/server/modules/routing/readwritesplit/readwritesplit.c:1704
/home/ec2-user/workspace/server/modules/routing/readwritesplit/readwritesplit.c:2203
/home/ec2-user/workspace/server/modules/routing/readwritesplit/readwritesplit.c:2055
/home/ec2-user/workspace/server/modules/protocol/mysql_client.c:1929
/home/ec2-user/workspace/server/modules/protocol/mysql_client.c:1143
/home/ec2-user/workspace/server/core/poll.c:878
/home/ec2-user/workspace/server/core/poll.c:610
/home/ec2-user/workspace/server/core/gateway.c:1842
/lib64/libc.so.6(__libc_start_main+0xfd) [0x7fe6738bed5d]
/usr/bin/maxscale() [0x5226e9]

Comment by Johan Wikman [ 2015-11-05 ]

The relevant code is:

1695   if(master_dcb == NULL || master_dcb->session == NULL)
1696   {
1697       skygw_log_write(LE,"[%s] Error: Master server DBC is NULL. "
1698 	      "This means that the connection to the master server is already "
1699 	      "closed while a query is still being routed.",__FUNCTION__);
1700       return qtype;
1701   }
1702   CHK_DCB(master_dcb);
1703 
1704   data = (MYSQL_session*)master_dcb->session->data;
1705 
1706   if(data == NULL)
1707   {
1708       skygw_log_write(LE,"[%s] Error: User data in master server DBC is NULL.",__FUNCTION__);
1709       return qtype;
1710   }

The crash occurs on line 1704.

Since it was checked on line 1695 that master_dcb and master_dcb->session are non-NULL, there are basically two options:

  • Either master_dcb->session points at unallocated memory, or
  • master_dcb->session has been set to NULL, by some other thread, after the check on line 1695.

The former would require that not only has the session object pointed to by master_dcb->session been freed, without master_dcb->session having been set to NULL, but also that the memory would have been reclaimed by the operating system. That seems unlikely.

So, the likely reason is that master_dcb->session has been set to NULL by another thread after the check was made on line 1695, but before master_dcb->session->data is accessed on line 1704.

This is the kind of issue that the fix of MXS-329 is expected to deal with.

Comment by Johan Wikman [ 2015-11-24 ]

I will close this now since the expection is that MXS-329 fixes this issue, and we are starting to wrap things up for 1.3.

Generated at Thu Feb 08 03:59:06 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.