[MDEV-10287] mariadb server crashes for no apparent reason Created: 2016-06-25 Updated: 2020-10-01 Resolved: 2020-10-01 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - Connect |
| Affects Version/s: | 10.1.14 |
| Fix Version/s: | 10.1.32 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Robert Dyas | Assignee: | Anel Husakovic |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Environment: |
centOS7 on Google Compute Engine |
||
| Attachments: |
|
| Description |
|
Server crashes frequently now (once or twice per day) for no apparent reason. A previous report looked like it was linked to CONNECT engine or unixODBC issues, but this one I can't tell.
|
| Comments |
| Comment by Elena Stepanova [ 2016-06-25 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Could you please provide a bigger part of the error log, preferably from the server startup and till the end of the crash? Thanks. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Robert Dyas [ 2016-06-26 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Resuts of SHOW PLUGINS below:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Robert Dyas [ 2016-06-26 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Log file from last week or so attached. You will see several crashes, most with different back traces. When we look at the SQL statements before this occurs (logs I can't share as has customer info in it) the statements before the crash are not similar. Log file attached. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Robert Dyas [ 2016-06-26 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
My.cnf file attached. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Robert Dyas [ 2016-06-28 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Some additional info we have found in the last two days of testing and trying to track this down: When we modified our app to use only local tables (no CONNECT tables) but made absolutely zero other changes, the server is now up for 48 hours and counting under load. Before it would go down after 6 to 14 hours at most. So it is looking like the root cause is either in CONNECT engine, unixODBC, or the Simba Salesforce ODBC driver itself. How does one track down the root cause in this situation? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2016-06-29 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Since you mentioned before that you actually have an SQL log (which you cannot share), it might be useful to check what the allegedly guilty connection was doing before the crash happened. The crash happened not upon a query execution, but upon certain plugin-related actions a connection is supposed to perform when it exits. You can check your SQL log for the activity that the connection was performing during its life time. It must have done something related to the CONNECT engine that triggered a delayed problem which popped up when the connecting was being closed. If it so happens that the connection didn't do too many different things, it might help to pinpoint the point of failure in CONNECT (or maybe not even CONNECT specifically, but plugins in general). We have similar reports, but no known root cause so far. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Robert Dyas [ 2016-06-29 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
RE "It must have done something related to the CONNECT engine that triggered a delayed problem which popped up when the connecting was being closed." I think you are on the right path. During the connection's lifetime it performs a large number of queries, many join a CONNECT table with a normal table (innodb). This seems to work fine. The statements immediately leading up to the crash are often different (not necessarily same tables involved). Note that we are using unixODBC pooled connections, so possibly when those connections are cleaned up it crashes? I wonder if you can find out if the people having similar problems are using pooled connections? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2016-07-01 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Can you try to check in your logs whether the failed connection was killed by some other thread (by a KILL <connection_id> command), or was it exiting on its own (<connection_id> Quit record is present in the general log)? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Robert Dyas [ 2016-07-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Elena, I don't think I have this log enabled... the queries are logged by our own software on google cloud with some additional context info, not via the built in maria db logging. Where would the general log file be by default? If I don't have it enabled, what would be the best way to config it in my.cnf so we can track this in the future with whatever info would be most helpful to you? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2016-07-03 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The question about the connection quitting vs being killed might still apply to your custom logging. If it logs queries, it will log the KILL query as well; and quite probably it also records connect/disconnect. If not, then, to enable general log temporarily till the first crash, I think the best way is just to run SET GLOBAL general_log=1. It will save you server restart, and also after the server is crashed and restarted, the general log will be disabled automatically again, which is probably what you want. Please note however that enabling general log can come at the cost of some performance on a busy server; so, if you have high traffic and performance is of the essence, maybe you shouldn't do it (at least not just yet). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Robert Dyas [ 2016-07-25 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
So I have some more info that (might) help in finding the problem. If we leave everything the same BUT turn off connection pooling in UnixODBC we (so far) have had zero problems. Also, slightly off topic, is the new JDBC connect table type ready to test in 10.1.16? And if so, does it catch exceptions thrown by the driver and just report them back as normal errors? The reason I ask is if so this would eliminate in the future the possibility that a driver bug brings down the entire database server. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2016-07-25 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
rdyas, thanks for the update. Taking into account the new information, I'm assigning it to bertrandop for further investigation. bertrandop should also be able to answer the question about the new connect table type. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Olivier Bertrand [ 2016-07-25 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Yes the JDBC table type should be available in version 10.1.16. Unfortunately, the version that is included in MariaDB 10.1.15 is outdated (and perhaps buggy) Yes, all exception thrown by the drivers are handled in the java wrappers (interfacing via JNI between MariaDB and the JDBC drivers). The error messages are retrieved by CONNECT as normal errors and should not crash the server. By the way, in the above crash report, I don't see anything involving CONNECT. So what make you think that it is due to ODBC tables? It would be interresting to know if such crashes have been observed on other machines. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Robert Dyas [ 2016-10-13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This appears to be the same issue as The root cause appears to be something related to ODBC connection pooling, as once the connection pooling was turned off this problem is 100% gone for several months in production with zero crashes. Also, it appears to be related closing (or attempting to close and it fails) the ODBC connection. |