[MDEV-24744] MariaDB unresponsive Created: 2021-01-30  Updated: 2021-05-17  Resolved: 2021-05-17

Status: Closed
Project: MariaDB Server
Component/s: OTHER
Affects Version/s: 10.5.8
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Mehmet Assignee: Unassigned
Resolution: Incomplete Votes: 0
Labels: crash, need_feedback, performance, regression
Environment:

Ubuntu 64 bits. CPU: Intel Xeon E5-1660v3 - 8c/16t - 3GHz /3.5GHz
RAM: 128GB DDR4 ECC 2400 MHz
Disks: SoftRaid 2x4TB


Attachments: File my.cnf    

 Description   

Mariadb stops working after 3-5 weeks, this is happening 4th times after upgrading database version to 10.5.*
and all our application servers and slave replication suspends until we restart database again

Unfortunately there is no log, and database is running but all current connections to database becomes unresponsive but i can create new connection using SSH to database and execute queries

For example i can execute this command and see results

_show processlist;_

but i can not kill process id, or when i execute this command i never get any response

 _show status where `variable_name` = 'Threads_connected';_

When i executed this command

_show open tables WHERE In_use > 0;_

I saw almost all tables had > 1000 in_use

It seems that something is locked and not allowing other queries to be executed
I saw more than 2000 queries in processlist list waiting to be executed

These are the last logs before mariadb becomes unresponsive
The first two logs are usual and always happening

Jan 30 15:15:45 ns535551 mariadbd[19731]: 2021-01-30 15:15:45 3221 [ERROR] mariadbd: Can't find record in 'MessageContent'
Jan 30 15:15:45 ns535551 mariadbd[19731]: 2021-01-30 15:15:45 3221 [Warning] Slave SQL: Could not execute Update_rows_v1 event on table angel.MessageContent; Can't find record in 'MessageContent', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mariadb-bin.065897, end_log_pos 52767840, Gtid 0-1-4286777278, Internal MariaDB error code: 1032

Jan 30 15:20:37 ns535551 mariadbd[19731]: 2021-01-30 15:20:37 322378 [Note] InnoDB: Number of pools: 2
Jan 30 16:07:35 ns535551 mariadbd[19731]: 2021-01-30 16:07:35 323399 [Note] InnoDB: Number of pools: 3_

and these are the first 2 queries which becomes unresponsive

select  pe.id as id,ac.name as name, pe.birthDate as birthDate,pe.sex as sex,'*' as aboutMe, '*' as aboutOpponent, pe.status as status,'*' as profileHeader,  pe.loginDate as loginDate,(select count(1) from FileData fd where fd.personId=pe.id and fd.status='A' and  (fd.indicator='P' or fd.indicator is null) and (fd.albumId is null or fd.albumId=0))  as photoCount, 0 as videoCount,  pe.lat as lat,pe.crDate as crDate,  pe.height as height,  pe.weight as weight, (select CONVERT(group_concat(fd.id) USING 'utf8') from FileData fd where fd.personId=pe.id and fd.status='A' and  (fd.indicator='P' or fd.indicator is null) and (fd.albumId is null or fd.albumId=0))  as photos, 0 as privatePhotoCount, pe.userAgent as userAgent,pe.lng as lng, 0 as travelOfferCount,0 as travelSeekerCount, pe.profilePhotoId as profilePhotoId,  ac.membership as membership,ac.verified as verified,  pe.currCity as currCity,pe.currCountry as currCountry,  pe.avatarId as avatarId,  100*st_distance(location,  POINT(41.1144 ,28.9967)) as cdistance , pe.distance as distance, ac.lang as lang,'*' as email,'*' as password,'*' as username, pe.currCountryCode as currCountryCode,  pe.autoLocation as autoLocation,  pe.ipAddress,  ac.phoneVerified,  ac.photoVerified, '*' as score, '*' as gift,  pe.hideMyAge,  pe.region, '*' as interests, '*' as activities, '*' as music, '*' as tvShows, '*' as movies, '*' as books, (select amount*unix_timestamp(startDate) from CoinUsage use index (coinusage_profileidservicestatus_index) where profileId=pe.id and  service='TS' and status='A' and endDate > now() limit 1) as topServiceValue  from Person as pe   use index for order by (person_statusphotologindate_index)  left join Account ac on ac.id=pe.id  where ac.membership not in  ('R','X','D','E')  and ac.id not in (select profileId from ProfileValidation)  and pe.sex='F'  and ac.id  not in (select secondPersonId from Friend where relationType=2 and firstPersonId=4607996  and secondPersonId is not null) and st_contains(ST_GEOMFROMTEXT( 'Polygon((42.01372160591873 30.190387034446076, 42.01372160591873 27.803012965553926,40.215078394081274 27.803012965553926, 40.215078394081274 30.190387034446076, 42.01372160591873 30.190387034446076))'),pe.location) order by pe.status desc, pe.hasPhoto desc,pe.loginDate desc  limit 50

select  pe.id as id,ac.name as name, pe.birthDate as birthDate,pe.sex as sex,'*' as aboutMe, '*' as aboutOpponent, pe.status as status,'*' as profileHeader,  pe.loginDate as loginDate,(select count(1) from FileData fd where fd.personId=pe.id and fd.status='A' and  (fd.indicator='P' or fd.indicator is null) and (fd.albumId is null or fd.albumId=0))  as photoCount, 0 as videoCount,  pe.lat as lat,pe.crDate as crDate,  pe.height as height,  pe.weight as weight, (select CONVERT(group_concat(fd.id) USING 'utf8') from FileData fd where fd.personId=pe.id and fd.status='A' and  (fd.indicator='P' or fd.indicator is null) and (fd.albumId is null or fd.albumId=0))  as photos, 0 as privatePhotoCount, pe.userAgent as userAgent,pe.lng as lng, 0 as travelOfferCount,0 as travelSeekerCount, pe.profilePhotoId as profilePhotoId,  ac.membership as membership,ac.verified as verified,  pe.currCity as currCity,pe.currCountry as currCountry,  pe.avatarId as avatarId,  100*st_distance(location,  POINT(36.75 ,3.05)) as cdistance , pe.distance as distance, ac.lang as lang,'*' as email,'*' as password,'*' as username, pe.currCountryCode as currCountryCode,  pe.autoLocation as autoLocation,  pe.ipAddress,  ac.phoneVerified,  ac.photoVerified, '*' as score, '*' as gift,  pe.hideMyAge,  pe.region, '*' as interests, '*' as activities, '*' as music, '*' as tvShows, '*' as movies, '*' as books, (select amount*unix_timestamp(startDate) from CoinUsage use index (coinusage_profileidservicestatus_index) where profileId=pe.id and  service='TS' and status='A' and endDate > now() limit 1) as topServiceValue  from Person as pe   use index for order by (person_statusphotologindate_index)  left join Account ac on ac.id=pe.id  where (ac.username = '%' or ac.name like '%')   and ac.membership not in ('X','D','E') order by pe.status desc, pe.hasPhoto desc,pe.loginDate desc  limit 50

I just had an idea about the problem, as you pay attention to following part at my last query

(ac.username = '%' or ac.name like '%') 

Database was unresponsive 4 times before and we always had this part at the last queries
since, query parameter is empty, all Account table data with other joins are loaded into memory, it is > 3 million rows and > 50 rows, I guess as a result of this, buffers are overloaded and system stops responding to other requests

We will fix query but even if we have this case, system must be responsive because we reorder results list and fetch only 50 rows from it.



 Comments   
Comment by Daniel Black [ 2021-04-14 ]

Could it be that innodb_open_files is limiting your number of open tables? Note the default is to autosize to signifcantly larger. maybe its thrashing on this.

If it happens again `show engine innodb status` may be helpful.

Having Debuginfo packages installed ahead of next time (https://downloads.mariadb.org/mariadb/repositories/#distro=Ubuntu&distro_release=focal--ubuntu_focal&mirror=tripleit&version=10.5)

And scripting a backtrace full on all threads may also provide enough useful information to see what is happening:

https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/#getting-backtraces-from-a-running-mysqld-process-with-gdb-on-linux

Comment by Mehmet [ 2021-04-14 ]

we have fixed query and we are not having this issue again
but mariadb is stopping and becoming unresponsive with this query using default configuration

select  pe.id as id,ac.name as name, pe.birthDate as birthDate,pe.sex as sex,'*' as aboutMe, '*' as aboutOpponent, pe.status as status,'*' as profileHeader,  pe.loginDate as loginDate,(select count(1) from FileData fd where fd.personId=pe.id and fd.status='A' and  (fd.indicator='P' or fd.indicator is null) and (fd.albumId is null or fd.albumId=0))  as photoCount, 0 as videoCount,  pe.lat as lat,pe.crDate as crDate,  pe.height as height,  pe.weight as weight, (select CONVERT(group_concat(fd.id) USING 'utf8') from FileData fd where fd.personId=pe.id and fd.status='A' and  (fd.indicator='P' or fd.indicator is null) and (fd.albumId is null or fd.albumId=0))  as photos, 0 as privatePhotoCount, pe.userAgent as userAgent,pe.lng as lng, 0 as travelOfferCount,0 as travelSeekerCount, pe.profilePhotoId as profilePhotoId,  ac.membership as membership,ac.verified as verified,  pe.currCity as currCity,pe.currCountry as currCountry,  pe.avatarId as avatarId,  100*st_distance(location,  POINT(41.1144 ,28.9967)) as cdistance , pe.distance as distance, ac.lang as lang,'*' as email,'*' as password,'*' as username, pe.currCountryCode as currCountryCode,  pe.autoLocation as autoLocation,  pe.ipAddress,  ac.phoneVerified,  ac.photoVerified, '*' as score, '*' as gift,  pe.hideMyAge,  pe.region, '*' as interests, '*' as activities, '*' as music, '*' as tvShows, '*' as movies, '*' as books, (select amount*unix_timestamp(startDate) from CoinUsage use index (coinusage_profileidservicestatus_index) where profileId=pe.id and  service='TS' and status='A' and endDate > now() limit 1) as topServiceValue  from Person as pe   use index for order by (person_statusphotologindate_index)  left join Account ac on ac.id=pe.id  where ac.membership not in  ('R','X','D','E')  and ac.id not in (select profileId from ProfileValidation)  and pe.sex='F'  and ac.id  not in (select secondPersonId from Friend where relationType=2 and firstPersonId=4607996  and secondPersonId is not null) and st_contains(ST_GEOMFROMTEXT( 'Polygon((42.01372160591873 30.190387034446076, 42.01372160591873 27.803012965553926,40.215078394081274 27.803012965553926, 40.215078394081274 30.190387034446076, 42.01372160591873 30.190387034446076))'),pe.location) order by pe.status desc, pe.hasPhoto desc,pe.loginDate desc  limit 50

we have 4 million rows on Account table, when query fetches all records, database is becoming unresponsive
I have already attached our my.cnf file, you can see our configuration
It will be very helpful to reproduce this case on test environment

Comment by Daniel Black [ 2021-04-14 ]

> I have already attached our my.cnf file, you can see our configuration

I saw. I looked. I offered a suggestion. And as an additional suggestion, look at join buffer settings, and the type of query plan used (as show in EXPLAIN query), and look at the documentation https://mariadb.com/kb/en/query-optimizations/ on what system variables can assist/guide that query.

Also included was a request for information.

> It will be very helpful to reproduce this case on test environment

So what's stopping you?

I also looked at your query. Its very long, hard to read, and without table information (SHOW CREATE TABLE), or EXPLAIN query, hard to do anything with. If your remove the USE INDEX directives, particularly on Person, does that help? Can you start with a simpler query and work up to the point of difficulty?

Comment by Daniel Black [ 2021-04-14 ]

You mention regression, what was the previous version? If you use your current data in that test environment, what is the EXPLAIN query for the same query that was ok on the previous version?

Generated at Thu Feb 08 09:32:20 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.