[MDEV-10482] PHP FPM Connection problem Created: 2016-08-02 Updated: 2016-10-24 Resolved: 2016-10-24 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Plugin - unix_socket |
| Affects Version/s: | 10.0.25 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | Dallas Clarke | Assignee: | Unassigned |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | need_feedback | ||
| Environment: |
Linux Debian (10.0.25-MariaDB-0+deb8u1 (Debian)) |
||
| Attachments: |
|
| Description |
|
I have a standard LAMP setup with PHP FPM connecting to the Mariadb server on the local host via a linux file socket. I've been getting an issue that has been getting successively worse for the last few weeks, for some reason the mariadb server just stops responding to the PHP processes. I can log onto the mariadb console and run "SHOW PROCESSLIST" and I can see that all the PHP process are connected to the database, but are all in the sleeping state. All the PHP process are in a spin lock waiting for the results from the database, but the database is not responding. The PHP processes run at 100% CPU and while the mariadb is about 5% CPU. If I stop the database, the PHP stop spinning. If I use php5 or php7 processes, the same problem exists. This state continues until I log into the console and type a command like "SHOW PROCESSLIST;" and "set global general_log_file = OFF;" and for some reason this then sparks the database into action. Restarting the database does not fix the problem, and after a few transactions, it again locks up. The database seems to lock and sometimes unlock itself, but blocks all PHP processes in the action. There is nothing in any of the log files, and I've tried changing as many of the setting that I thought would help. Nothing has fixed the issue, or I don't know what resource limit to increase. The problem is getting worse as the size of the database has gotten larger and the amount of queries has become more. |
| Comments |
| Comment by Daniel Black [ 2016-08-02 ] |
|
Are you actually using a unix socket authenticated user for the PHP fpm connections? Are you short of ram? Running with general log enabled in production doesn't seem like a good idea performance wise however if you tail this is it actually having queries going though? Have you enabled/looked at the slow query log? Are the PHP processes starving mariadb of CPU? SHOW PROCESS LIST, SHOW ENGINE INNODB STATUS, SHOW GLOBAL STATUS, SHOW GLOBAL VARIABLES would be useful. Use https://mariadb.com/kb/en/mariadb/debugging-a-running-server-on-linux/ to get a backtrace would also be useful (even if using the packaged version and not a custom built version). |
| Comment by Dallas Clarke [ 2016-08-03 ] |
|
The system has 30GB of RAM and about 7GB free at any one time. It has 16CPU cores and I've limited the FPM process to 14 to prevent starvation, plus have dropped the process priority of the FPM process to 39 while lifting the mariadb process to 0. It seems to happen as the database has gotten larger and inefficient queries are executed. Potentially disk I/O problems as well as the server is about 18 months old. I had one really bad query, that searched a newsletter subscriber table for an email address and if it wasn't already in there it added it. There are about 170K subscribers, no index on the field and when multiple people did this at the same time, it could have been a cause of deadlock. I have since added an index on the email field, although this problems still exists, it's just the database is now getting through it in about 20 seconds. The event constantly happens, where the FPM processes lockup waiting for the database, but the database is just sleeping. Then about 10-20 seconds later the DB will just jump into gear, run at 100% CPU and serve the FPM processes. Things settle down, then 10-30 seconds later we go through the same process again. |
| Comment by Dallas Clarke [ 2016-08-03 ] |
|
Attaching the debugger locked up the entire box. admin@ip-172-31-17-205:~$ sudo gdb -p 39839 /usr/bin/mysql Thread 1 (process 39839): Inferior 1 [process 39839] will be detached. Quit anyway? (y or n) y |
| Comment by Daniel Black [ 2016-08-03 ] |
|
Attaching /tmp/mysql.log is important. Your innodb_buffer_pool_size is 128M - far too small - 12G may be more appropriate. its quite possible that your innodb_log_file_size is too small if you do lots of writes. |
| Comment by Dallas Clarke [ 2016-08-03 ] |
|
The only thing in the mysql.log file was:- Thread 1 (process 39839): Inferior 1 [process 39839] will be detached. Quit anyway? (y or n) Detaching from program: /usr/bin/mysql, process 39839 The entire box stops. I can't run the debugger as there are about 200 current users. |
| Comment by Daniel Black [ 2016-08-03 ] |
|
you can script gdb scripts. Probably should document that. |
| Comment by Dallas Clarke [ 2016-08-03 ] |
|
Upping the innodb_buffer_pool_size to 12GB has help, but the issue still exists. The Mariadb process now has twice the memory and is doing far less work, but something is still blocking it. Also the khugepaged process is being called at the start of an event. |
| Comment by Dallas Clarke [ 2016-08-04 ] |
|
After a dramatic crash of the system last night during a peak period, and repeatedly restarting Apache, php-fpm and MariaDB, I was out of ideas on how the fix the system and in desperation rebooted it. Like a classic old Windows server, it fixed the problems and the system is now well behaved and responding directly. Who knew that Linux is becoming more and more like Windows. So it looks like the root cause on the bug was within Linux and not MariaDB, although better error messages and warnings from MariaDB would help system administrators resolve the issues with more predictability. Also logging warnings of inefficient queries that would benefit system administrators to identify problems before they become problems. Looking at the learning experience, I now know most the the configurable options a LAMP server and mine can now handle about 50% greater load. |
| Comment by Dallas Clarke [ 2016-08-15 ] |
|
After rebooting, these problem steadily returns and I'm now considering rebooting the server every night. I've also discovering this problem on a second AWS server, where process connections are randomly being blocked. And with HHVM, these threads are being blocked for hours. Is there any method of auto killing a process that's sleeping for more than 10 seconds? |
| Comment by Elena Stepanova [ 2016-09-25 ] |
|
Probably 10 seconds is far too strict, but you might want to check and adjust wait_timeouts. 1) set up a process which will collect SHOW PROCESSLIST output every few seconds, to see how exactly it progresses; |
| Comment by Elena Stepanova [ 2016-10-24 ] |
|
Please comment to if you have further information on the issue. |