[MDEV-17527] Galera cluster "slowly deteriorating" (statistics?) after latest upgrade to Ubuntu16.04(only patches)/mariadb 10.2.18 (from 10.2.9) Created: 2018-10-23 Updated: 2019-03-13 Resolved: 2019-03-13 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Configuration, Galera, Optimizer |
| Affects Version/s: | 10.2.18 |
| Fix Version/s: | 10.2.23, 10.3.14, 10.4.4 |
| Type: | Bug | Priority: | Major |
| Reporter: | brianr | Assignee: | Jan Lindström (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | galera | ||
| Environment: |
Ubuntu 16.04 |
||
| Attachments: |
|
| Description |
|
I have a 3-node Galera cluster DB-4 , DB-5 and DB-6 Yesterday, right after the SST finished successfully, I ran the troubled query as I did after the upgrade. All three nodes responded immediately like DB-4 used to do, so after running it multiple times over some time, I decided to head on home with a smile on my face When I got in, this morning, I saw that DB-5 and DB-6 is now again back to the spot where they cannot answer the query. At first they took so long that I CTRL+C on it. I can see the EXPLAIN for that same query is identical on the slow nodes, but different on the working DB-4 node. Now my questions, in all my bewilderment is: Can anyone suggest/clarify what I might have done to my cluster during the upgrade (upgrade Ubuntu with newest patches to 16.04, and MariaDB upgraded from 10.2.9 -> 10.2.18) Is there anything we need to do, to make sure that statistics be the same, as good on 5/6 as on DB-4 so the poor optimizer makes the same desicions on all three servers? Can anyone spot where I dropped the ball, please Anything useful to check ?, to see if something inside Galera is actually not working as it should ... logfiles (syslog) indicate nothing even closely resembling any problems. |
| Comments |
| Comment by brianr [ 2018-10-24 ] |
|
I have attached a file, showing more details on why I believe there's a 'yuge' issue here. I might have set a conf param wrong, so statistics go bad after successful SST in this Galera setup, or somehow something happened (during upgrade?) to cause this situation, where Galera renders itself utterly useless for us. I hope someone can shed some light on this, because I really liked Galera before Oct 5 2018 |
| Comment by brianr [ 2018-10-24 ] |
|
Oh , snap! The recently attached file was created yesterday just before end of work. I have attached new file "jira.mariadb.org_ticket_ What on earth is going on under the hood? ... PS: ClusterControl claims there's nothing wrong with Galera, and syslog contains no signs of problems, |
| Comment by brianr [ 2018-10-24 ] |
|
jira.mariadb.org_mysql_table_statistics_MDEV-17527.txt Ok, there must simply be something I do not understand about Mysql/MariaDB statistics. Please see attached, where I just found another place to check Statistics. However, node DB-6 now suddenly responds immediately, like DB-4 always have done and do. But DB-5 still waits 1.2 secs before responding. Is there something I have forgotten in relation to Statistics in a Galera cluster, when upgrading from 10.2.9 to 10.2.18 ? I am baffled ... |
| Comment by brianr [ 2018-10-25 ] |
|
And for todays update, I can tell that I tested usual query on all three nodes. I then cancelled the queries on both 5 & 6. Then did ANALYZE TABLE x on all 4 tables involved in the query - on only DB-6. The weirdest thing is that right after this, also DB-5 responds immediately. This is more than weird, I think. DB-4 continues to respond as it should and always have - promptly. How on earth could optimizer/stats get so untrustworthy in this MariaDB/Galera setup ? Someone, please... |
| Comment by brianr [ 2018-10-29 ] |
|
Just a quick update, I ran ANALYZE TABLE on the 4 tables friday again. Today, back to work, a check on those four table shows that the stats are once more gone (apparently only in the one of the two databases involved in the query) !? But apparently only until it "somehow" loses that knowledge again. I think the optimizer needs a new partner, to do its work here ... statistics should be more trustworthy than this I hope? Do you know of any setting that would make statistics go bad like this ?
|
| Comment by brianr [ 2018-10-31 ] |
|
OK, Nightly, they just drop the tables and recreate them empty (that would explain why the nodes not running the dataload end up showing 0 rows in this database's tables) Could this, in your view, somehow end up as we see it: that only the node handling the dataload is able to see/use the true statistics, while the other two nodes think tables in this reloaded database has no rows? I would have sworn that the "mysql < sqlfile" would and should mean that Galera ensures that the other nodes are equally as updated on statis when the load is done? |
| Comment by brianr [ 2018-11-02 ] |
|
To elaborate on the previous rant If I believe we have set all params to auto update statistics, can you please explain where I am assuming wrong when I claim that the following should result in continuously updated statistics on tables? (https://mariadb.com/kb/en/library/xtradbinnodb-server-system-variables/#innodb_stats_auto_recalc is set to ON ) We have two databases in this specific SELECT. One of the databases is stuffed with new data nightly. Steps: But according to this info : https://mariadb.com/kb/en/library/xtradbinnodb-server-system-variables/#innodb_stats_auto_recalc , Statistics should also be updated if "+10%" data is changed. I'd say we reach +10% pretty fast, going from 0 rows to 100% with the "mysql [login info] < sqlfile" If I am assuming wrong here, please let me know - it might very well be that I have overlooked the note, that "mysql < sqlfile" bypasses any statistics update logic, and should be followed by individual manual "optimize/analyze table" ... ?? ( But I hope you agree it would make less than sense if things work that way Hope you got so far, and have time to respond with some advise...? |
| Comment by Jan Lindström (Inactive) [ 2019-03-13 ] |