[MDEV-17527] Galera cluster "slowly deteriorating" (statistics?) after latest upgrade to Ubuntu16.04(only patches)/mariadb 10.2.18 (from 10.2.9) Created: 2018-10-23  Updated: 2019-03-13  Resolved: 2019-03-13

Status: Closed
Project: MariaDB Server
Component/s: Configuration, Galera, Optimizer
Affects Version/s: 10.2.18
Fix Version/s: 10.2.23, 10.3.14, 10.4.4

Type: Bug Priority: Major
Reporter: brianr Assignee: Jan Lindström (Inactive)
Resolution: Fixed Votes: 0
Labels: galera
Environment:

Ubuntu 16.04
MariaDB 10.2.18


Attachments: Text File jira.mariadb.org_mysql_table_statistics_MDEV-17527.txt     Text File jira.mariadb.org_ticket_MDEV-17527.txt     Text File jira.mariadb.org_ticket_MDEV-17527_explain2.txt    

 Description   

I have a 3-node Galera cluster DB-4 , DB-5 and DB-6
It worked fine up until the recent upgrade Oct 05. (apt update / apt upgrade, which all went fine with no errors. )
Rebooted servers, checked that cluster formed again with 3 nodes. All fine.
Few days later, users started reporting their sessions hung.
We searched high and low for clues, ended up shutting down all but the only node still able to respond to a certain query (select from 3-4 tables). That query still returns in 0.00secs from DB-4
So, only DB-4 is running now.
Then when we wanted to restart DB-5/DB-6 they failed miserably (during SST, no matter if rsync or the newer mariabackup I recently installed ), and nobody could find the reason. Checking network/apparmor and all kinds of possibilities. We found out yesterday that it's because the timeout for systemd was set to 1m30s, despite me setting TimoutSec=0 in galera.cnf. Whichever - I've now set it to "infinity" and the remaining two nodes were able to joing the cluster again using SST(mariabackup)
(I think it's because all the recent smart changes to TimeoutSec for mariadb/galera doing SST has NOT gone into the systemd version 229 we still use on this 16.04 ??)

Yesterday, right after the SST finished successfully, I ran the troubled query as I did after the upgrade. All three nodes responded immediately like DB-4 used to do, so after running it multiple times over some time, I decided to head on home with a smile on my face

When I got in, this morning, I saw that DB-5 and DB-6 is now again back to the spot where they cannot answer the query. At first they took so long that I CTRL+C on it.
Now, after some hours, they CAN respond, but in stead of 0.00secs, they use ~1.3 secs on it.

I can see the EXPLAIN for that same query is identical on the slow nodes, but different on the working DB-4 node.
Over time, it (explainplan) seems to have changed on DB-5/DB-6 so it is now "almost" the same as on DB-4, but in another sequence !?

Now my questions, in all my bewilderment is:

Can anyone suggest/clarify what I might have done to my cluster during the upgrade (upgrade Ubuntu with newest patches to 16.04, and MariaDB upgraded from 10.2.9 -> 10.2.18)
I would say it went OK, but we did have some "you need to run mysql_upgrade" in our syslog. It was related to some system table having a column "too short" as I recall.
The mysql_upgrade was run on all three servers with no problems.

Is there anything we need to do, to make sure that statistics be the same, as good on 5/6 as on DB-4 so the poor optimizer makes the same desicions on all three servers?

Can anyone spot where I dropped the ball, please

Anything useful to check ?, to see if something inside Galera is actually not working as it should ... logfiles (syslog) indicate nothing even closely resembling any problems.



 Comments   
Comment by brianr [ 2018-10-24 ]

I have attached a file, showing more details on why I believe there's a 'yuge' issue here.

I might have set a conf param wrong, so statistics go bad after successful SST in this Galera setup, or somehow something happened (during upgrade?) to cause this situation, where Galera renders itself utterly useless for us.

I hope someone can shed some light on this, because I really liked Galera before Oct 5 2018 !

Comment by brianr [ 2018-10-24 ]

Oh , snap!

The recently attached file was created yesterday just before end of work.
Today it's messed itself up again, so the same query doesn't return results before I give up on it (+15secs)

I have attached new file "jira.mariadb.org_ticket_MDEV-17527_explain2.txt" with new explainplans, to show what it now thinks data looks like on the DB-5/DB-6 Galera servers. DB-4 still responds immediately like always.

What on earth is going on under the hood? ...

PS: ClusterControl claims there's nothing wrong with Galera, and syslog contains no signs of problems,
And each node have the same amount (select count from ...) in each of the tables in the query - but 67% of the 3 servers apparently now have an optimizer on drugs or something !?

Comment by brianr [ 2018-10-24 ]

jira.mariadb.org_mysql_table_statistics_MDEV-17527.txt

Ok, there must simply be something I do not understand about Mysql/MariaDB statistics.

Please see attached, where I just found another place to check Statistics.
Almost all tables in the query are seen as empty according to statistics - on the two troubled nodes 5/6.

However, node DB-6 now suddenly responds immediately, like DB-4 always have done and do. But DB-5 still waits 1.2 secs before responding.
I assume it's because DB-6 has loaded data to buffer ... but of course I don't know for sure.

Is there something I have forgotten in relation to Statistics in a Galera cluster, when upgrading from 10.2.9 to 10.2.18 ?

I am baffled ...

Comment by brianr [ 2018-10-25 ]

And for todays update, I can tell that I tested usual query on all three nodes.
DB.4 responds promptly a always.
Both DB-5 and DB-6 did not respond at all, just hung on this usual query test.

I then cancelled the queries on both 5 & 6.

Then did ANALYZE TABLE x on all 4 tables involved in the query - on only DB-6.
Then DB-6 responds immediately.

The weirdest thing is that right after this, also DB-5 responds immediately. This is more than weird, I think.

DB-4 continues to respond as it should and always have - promptly.

How on earth could optimizer/stats get so untrustworthy in this MariaDB/Galera setup ?
How can I get back to trusting this previously fine Galera/MariaDB cluster?

Someone, please...

Comment by brianr [ 2018-10-29 ]

Just a quick update, I ran ANALYZE TABLE on the 4 tables friday again.

Today, back to work, a check on those four table shows that the stats are once more gone (apparently only in the one of the two databases involved in the query) !?
0 rows in each of the tables, and when I submit ANALYZE TABLE ... it correcly updates info with the number of rows in each table. And the test query again returns answer immediately.

But apparently only until it "somehow" loses that knowledge again. I think the optimizer needs a new partner, to do its work here ... statistics should be more trustworthy than this I hope?

Do you know of any setting that would make statistics go bad like this ?

  • I can't find any explanation...
Comment by brianr [ 2018-10-31 ]

OK,
Just got some details on how clients are reloading these tables in the database that's showing wrong stats on its tables (involved in the troubled query).

Nightly, they just drop the tables and recreate them empty (that would explain why the nodes not running the dataload end up showing 0 rows in this database's tables)
Then their script unpacks and prepares an export (sqlfile) from another datasource, and this sqlfile is then loaded with a simple "mysql user/password < sqlfile"

Could this, in your view, somehow end up as we see it: that only the node handling the dataload is able to see/use the true statistics, while the other two nodes think tables in this reloaded database has no rows?

I would have sworn that the "mysql < sqlfile" would and should mean that Galera ensures that the other nodes are equally as updated on statis when the load is done?

Comment by brianr [ 2018-11-02 ]

To elaborate on the previous rant :

If I believe we have set all params to auto update statistics, can you please explain where I am assuming wrong when I claim that the following should result in continuously updated statistics on tables? (https://mariadb.com/kb/en/library/xtradbinnodb-server-system-variables/#innodb_stats_auto_recalc is set to ON )
(Stats are not updated - the statictics are wrong after drop/reload of tables in one of two databases.)

We have two databases in this specific SELECT. One of the databases is stuffed with new data nightly.
This worked flawlessly on previous 10.2.9, but now it acts weird in 10.2.18 (new optimizer algorithms?)

Steps:
Nightly some tables are dropped/recreated from one of the two databases. (Statistics seems to spot this and updates on secondary nodes to "rows=0")
Simultaneously, data is exported from remote database and copied to this server, where it's imported with a simple "mysql [login info] < sqlfile"
This results in the (SELECT) query only works on the Galera node that's running the import.
The two other nodes will read from table statistics that tables contains 0 rows (as if it's only updated on CREATE table?)

But according to this info : https://mariadb.com/kb/en/library/xtradbinnodb-server-system-variables/#innodb_stats_auto_recalc , Statistics should also be updated if "+10%" data is changed. I'd say we reach +10% pretty fast, going from 0 rows to 100% with the "mysql [login info] < sqlfile"

If I am assuming wrong here, please let me know - it might very well be that I have overlooked the note, that "mysql < sqlfile" bypasses any statistics update logic, and should be followed by individual manual "optimize/analyze table" ... ?? ( But I hope you agree it would make less than sense if things work that way )
It could also be that MariaDB has worked this way always, and that Optimizer just happened to choose different strategies in 10.2.9 which made it work fine - but that the wrong statistics and possible new optimizer attitude in 10.2.18 is now painfully showing that statistics need to be manually updated after each dataload?

Hope you got so far, and have time to respond with some advise...?

Comment by Jan Lindström (Inactive) [ 2019-03-13 ]

Fixed on https://jira.mariadb.org/browse/MDEV-18577

Generated at Thu Feb 08 08:37:09 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.