Discovered on Debian 10.3, replicated on Debian 9.12, Ubuntu 20.04, and CentOS 8.2. All systems are Digital Ocean droplets with 2 vCPU (2.3GHz), 2GB RAM, and SSD storage.
Description
I originally posted a question at https://dba.stackexchange.com/questions/269503/why-is-disk-io-higher-on-debian-10-mariadb-10-3-with-mysql-replication
But further investigation leads me to believe this is a regression in MariaDB, possibly with the InnoDB storage engine, but I'll try to repeat the important bits here.
Initially, I had 2 Debian 9.12 servers running MariaDB 10.1.44 in a parallel master-master replication setup. One is the primary, the other used as an automatic fail-over in our application. I was looking to replace them with new Debian 10 instances running MariaDB 10.3.22. Once I had the new 10.3.22 instance going, I was having some performance trouble, and in the process discovered roughly 6x higher write disk write activity on the new server (based on bytes, rather than # of writes).
I moved all production activity back to the 10.1.44 primary. Write activity on the 10.3.22 server dropped, but was still about 3x higher, with no noticeable change on the 10.1.44 primary. In fact, when you compare the 10.1.44 primary with its matching fail-over server, there is very little difference in write activity (both write/s and bytes/s).
I've gone through and compared every single setting that was different between MariaDB 10.1.44 and 10.3.22, and found nothing that impacted the disk IO in any way for our workload.
But for reference, here are the settings I've tried:
{{binlog_annotate_row_events = OFF
binlog_checksum = NONE
binlog_format = STATEMENT
innodb_flush_method = O_DSYNC
log_slow_admin_statements = OFF
log_slow_slave_statements = OFF
replicate_annotate_row_events = OFF
innodb_adaptive_hash_index_parts = 0/1/2
innodb_purge_threads = 1
innodb_spin_wait_delay = 6
innodb_strict_mode = OFF
innodb_use_atomic_writes = OFF
innodb_flush_sync = OFF
innodb_flush_neighbors = 0
innodb_io_capacity = 2000
innodb_use_native_aio = OFF
thread_cache_size = 8
thread_pool_max_threads = 1000
thread_stack = 196608}}
I also noticed 10.3.22 was using the Barracuda format, whereas 10.1.44 was using Antelope, so I moved the 10.1.44 primary server to Barracuda which did not have a negative impact. Note that I wasn't sure how to rebuild these tables, so they remain as Antelope: SYS_DATAFILES, SYS_FOREIGN, SYS_FOREIGN_COLS, SYS_TABLESPACES.
After all that, I wanted to rule out Debian 10 as the culprit, and I also tested various versions of MariaDB and even MySQL 8, so here are the combinations that I've tried:
optimize1 Debian 9.12 MariaDB 10.1.44 (master-master with db11)
optimize2 Debian 9.12 MariaDB 10.1.44 (slave from optimize1)
db11 Debian 10.3 MariaDB 10.3.22 (master-master with optimize1)
db12 Debian 10.3 MariaDB 10.3.22 (slave from db11)
db11293 Ubuntu 20.04 MariaDB 10.3.22 (slave from 10.1)
db11294 CentOS 8.2 MariaDB 10.4.13 (slave from 10.1)
db11295 Ubuntu 20.04 MySQL 8.0.20 (slave from 10.1)
db11296 Debian 9.12 MariaDB 10.1.44 (slave from 10.1)
db11296 Debian 9.12 MariaDB 10.2.32 (slave from 10.1)
db11296 Debian 9.12 MariaDB 10.3.23 (slave from 10.1)
db11296 Debian 9.12 MariaDB 10.4.13 (slave from 10.1)
db11296 Debian 9.12 MariaDB 10.5.3 (slave from 10.1)
Note that db11296 is currently running 10.5.3, but I tested it on all those versions for at least an hour to compare it with the performance of the other active servers.
The CentOS 8.2 server performaed better than the Debian/Ubuntu servers due to using the XFS filesystem, but still much more write IO than the 10.1.44 servers.
MySQL 8 is sometimes "normal", but then bursts for brief periods to match the 10.2+ servers.
10.2 on Debian was the worst of the bunch, 10.5 was the best, matching the performance of the CentOS server running 10.4.
The workload on the db is nothing too crazy, 5-15 updates/second, lots of reads, and the occasional insert/delete. Some of the most common queries are these:
UPDATE clients SET licensed=licensed-1,last_optimized=NOW(),to_bill=0 WHERE id=436922 AND licensed>0
UPDATE client_usage_stats SET consumed=consumed+1 WHERE client_id=436922 AND year=2020 AND month=6
UPDATE clients SET consumed=consumed+1,last_optimized=NOW(),to_bill=0 WHERE id=436922
UPDATE api_keys SET consumed=consumed+1 WHERE sub_key='xyz123...'
Note, if you look at the charts, you might run into a time when all the servers seem to have "synced up" with roughly the same amount of disk write activity. I noticed this for the first time yesterday. I thought perhaps it was due to some changes I made which were making additional writes of this sort:
UPDATE expired_emails SET hits=hits+1 WHERE api_key='xyz...'
"Additional" in the sense that we already had some of these happening already, but now there were several other cases that could trigger the same query.
But reverting the changes did not result in any changes, and the write activity stayed synced for around 8-10 hours.
This morning when I checked, everything was back to "normal", in that the new servers all had substantially higher disk writes, and reinstating the changes from yesterday made no impact on write activity. At most, we were only seeing about 10-15 of those per minute.
I'll attach some snapshots of the charts from both periods for reference.
Shane Bishop
added a comment - - edited I forgot to provide my "stock" config for these nodes:
max_connections = 1000
sql_mode = NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION
key_buffer_size = 16M
query_cache_limit = 4M
query_cache_size = 0
query_cache_type = 0
server-id = 12
log_bin = /var/log/mysql/mysql-bin.log
binlog_do_db = optimizer
replicate-do-db = optimizer
report-host = db12...
log-slave-updates = true
max_binlog_size = 100M
innodb_log_file_size = 32M
innodb_buffer_pool_size = 256MB
Note, if you look at the charts, you might run into a time when all the servers seem to have "synced up" with roughly the same amount of disk write activity. I noticed this for the first time yesterday. I thought perhaps it was due to some changes I made which were making additional writes of this sort:
UPDATE expired_emails SET hits=hits+1 WHERE api_key='xyz...'
"Additional" in the sense that we already had some of these happening already, but now there were several other cases that could trigger the same query.
But reverting the changes did not result in any changes, and the write activity stayed synced for around 8-10 hours.
This morning when I checked, everything was back to "normal", in that the new servers all had substantially higher disk writes, and reinstating the changes from yesterday made no impact on write activity. At most, we were only seeing about 10-15 of those per minute.
I'll attach some snapshots of the charts from both periods for reference.
I finally had time to test Percona (8.0.19), and discovered that it does not suffer from this same issue. Disk IO (writes specifically) is significantly better than MariaDB 10.2+, a bit better than stock MySQL 8, and the same or better than the original slave server running MariaDB 10.1.
Shane Bishop
added a comment - I finally had time to test Percona (8.0.19), and discovered that it does not suffer from this same issue. Disk IO (writes specifically) is significantly better than MariaDB 10.2+, a bit better than stock MySQL 8, and the same or better than the original slave server running MariaDB 10.1.
Digging in system i/o profiles it appears that MariaDB 10.2 is writing to redo logs intensively.
Guillaume Lefranc
added a comment - Same issue, after upgrade:
Digging in system i/o profiles it appears that MariaDB 10.2 is writing to redo logs intensively.
High innodb_io_capacity settings are now interpreted in a different, more aggressive fashion by the innodb page flushing algorithm in InnoDB 5.7, it seems.
I suggest lowering it to 200, and you'll see less agressive flushing.
Guillaume Lefranc
added a comment - Found the cause of the regression:
High innodb_io_capacity settings are now interpreted in a different, more aggressive fashion by the innodb page flushing algorithm in InnoDB 5.7, it seems.
I suggest lowering it to 200, and you'll see less agressive flushing.
Thanks Percona for the explanations in this post: https://www.percona.com/blog/2020/05/14/tuning-mysql-innodb-flushing-for-a-write-intensive-workload/
If your innodb_io_capacity setting was carried over from a previous version, that could certainly explain it, as I think Debian 9/MariaDB 10.1 had a default of 2000.
However, the stock/default configs on Debian 10 already use innodb_io_capacity=200. Lowering it further has not helped. As noted in my original report, I did already try increasing innodb_io_capacity to 2000, and of course that didn't help at all.
Shane Bishop
added a comment - If your innodb_io_capacity setting was carried over from a previous version, that could certainly explain it, as I think Debian 9/MariaDB 10.1 had a default of 2000.
However, the stock/default configs on Debian 10 already use innodb_io_capacity=200. Lowering it further has not helped. As noted in my original report, I did already try increasing innodb_io_capacity to 2000, and of course that didn't help at all.
Olivier FILIPE
added a comment - Hello,
I had the same problem. Try this: https://mariadb.com/fr/resources/blog/goodbye-replication-lag/
This was the solution for me.
Thanks for the suggestion, but I've just moved the entire thing over to a managed database so I don't have to deal with the issues anymore. So unfortunately, I don't have a way to test it anymore. Even if I brought the servers back online, there would be no way to generate the same load they were handling before
That said, the article was written for MariaDB 10.0.12, and 10.1 worked just fine for me, so I'm not sure it would have been much of a help, but at this point I might never know. I do know switching to Percona took care of it, and it has slave_parallel_workers and replica_parallel_workers set to 0. It doesn't have a slave_parallel_threads variable to configure, so that's the closest vars I can find.
Shane Bishop
added a comment - Thanks for the suggestion, but I've just moved the entire thing over to a managed database so I don't have to deal with the issues anymore. So unfortunately, I don't have a way to test it anymore. Even if I brought the servers back online, there would be no way to generate the same load they were handling before
That said, the article was written for MariaDB 10.0.12, and 10.1 worked just fine for me, so I'm not sure it would have been much of a help, but at this point I might never know. I do know switching to Percona took care of it, and it has slave_parallel_workers and replica_parallel_workers set to 0. It doesn't have a slave_parallel_threads variable to configure, so that's the closest vars I can find.
People
Axel Schwenke
Shane Bishop
Votes:
3Vote for this issue
Watchers:
10Start watching this issue
Dates
Created:
Updated:
Git Integration
Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.
{"report":{"fcp":1189.6999998092651,"ttfb":342.7999997138977,"pageVisibility":"visible","entityId":88732,"key":"jira.project.issue.view-issue","isInitial":true,"threshold":1000,"elementTimings":{},"userDeviceMemory":8,"userDeviceProcessors":32,"apdex":0.5,"journeyId":"02797212-98b3-4d41-8ae0-9e984d26c412","navigationType":0,"readyForUser":1299.5999999046326,"redirectCount":0,"resourceLoadedEnd":899.2999997138977,"resourceLoadedStart":351.5,"resourceTiming":[{"duration":193.2999997138977,"initiatorType":"link","name":"https://jira.mariadb.org/s/2c21342762a6a02add1c328bed317ffd-CDN/lu2cib/820016/12ta74/0a8bac35585be7fc6c9cc5a0464cd4cf/_/download/contextbatch/css/_super/batch.css","startTime":351.5,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":351.5,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":544.7999997138977,"responseStart":0,"secureConnectionStart":0},{"duration":195.30000019073486,"initiatorType":"link","name":"https://jira.mariadb.org/s/7ebd35e77e471bc30ff0eba799ebc151-CDN/lu2cib/820016/12ta74/494e4c556ecbb29f90a3d3b4f09cb99c/_/download/contextbatch/css/jira.browse.project,project.issue.navigator,jira.view.issue,jira.general,jira.global,atl.general,-_super/batch.css?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&slack-enabled=true&whisper-enabled=true","startTime":351.7999997138977,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":351.7999997138977,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":547.0999999046326,"responseStart":0,"secureConnectionStart":0},{"duration":399.59999990463257,"initiatorType":"script","name":"https://jira.mariadb.org/s/0917945aaa57108d00c5076fea35e069-CDN/lu2cib/820016/12ta74/0a8bac35585be7fc6c9cc5a0464cd4cf/_/download/contextbatch/js/_super/batch.js?locale=en","startTime":352.09999990463257,"connectEnd":352.09999990463257,"connectStart":352.09999990463257,"domainLookupEnd":352.09999990463257,"domainLookupStart":352.09999990463257,"fetchStart":352.09999990463257,"redirectEnd":0,"redirectStart":0,"requestStart":551.4000000953674,"responseEnd":751.6999998092651,"responseStart":598,"secureConnectionStart":352.09999990463257},{"duration":547.0999999046326,"initiatorType":"script","name":"https://jira.mariadb.org/s/2d8175ec2fa4c816e8023260bd8c1786-CDN/lu2cib/820016/12ta74/494e4c556ecbb29f90a3d3b4f09cb99c/_/download/contextbatch/js/jira.browse.project,project.issue.navigator,jira.view.issue,jira.general,jira.global,atl.general,-_super/batch.js?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&locale=en&slack-enabled=true&whisper-enabled=true","startTime":352.19999980926514,"connectEnd":352.19999980926514,"connectStart":352.19999980926514,"domainLookupEnd":352.19999980926514,"domainLookupStart":352.19999980926514,"fetchStart":352.19999980926514,"redirectEnd":0,"redirectStart":0,"requestStart":551.7999997138977,"responseEnd":899.2999997138977,"responseStart":599.5999999046326,"secureConnectionStart":352.19999980926514},{"duration":248,"initiatorType":"script","name":"https://jira.mariadb.org/s/a9324d6758d385eb45c462685ad88f1d-CDN/lu2cib/820016/12ta74/c92c0caa9a024ae85b0ebdbed7fb4bd7/_/download/contextbatch/js/atl.global,-_super/batch.js?locale=en","startTime":353.09999990463257,"connectEnd":353.09999990463257,"connectStart":353.09999990463257,"domainLookupEnd":353.09999990463257,"domainLookupStart":353.09999990463257,"fetchStart":353.09999990463257,"redirectEnd":0,"redirectStart":0,"requestStart":552,"responseEnd":601.0999999046326,"responseStart":598.9000000953674,"secureConnectionStart":353.09999990463257},{"duration":251.60000038146973,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:calendar-en/jira.webresources:calendar-en.js","startTime":353.2999997138977,"connectEnd":353.2999997138977,"connectStart":353.2999997138977,"domainLookupEnd":353.2999997138977,"domainLookupStart":353.2999997138977,"fetchStart":353.2999997138977,"redirectEnd":0,"redirectStart":0,"requestStart":553.5999999046326,"responseEnd":604.9000000953674,"responseStart":601.4000000953674,"secureConnectionStart":353.2999997138977},{"duration":251.59999990463257,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:calendar-localisation-moment/jira.webresources:calendar-localisation-moment.js","startTime":353.5,"connectEnd":353.5,"connectStart":353.5,"domainLookupEnd":353.5,"domainLookupStart":353.5,"fetchStart":353.5,"redirectEnd":0,"redirectStart":0,"requestStart":554,"responseEnd":605.0999999046326,"responseStart":602.4000000953674,"secureConnectionStart":353.5},{"duration":199.30000019073486,"initiatorType":"link","name":"https://jira.mariadb.org/s/b04b06a02d1959df322d9cded3aeecc1-CDN/lu2cib/820016/12ta74/a2ff6aa845ffc9a1d22fe23d9ee791fc/_/download/contextbatch/css/jira.global.look-and-feel,-_super/batch.css","startTime":353.69999980926514,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":353.69999980926514,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":553,"responseStart":0,"secureConnectionStart":0},{"duration":251.5,"initiatorType":"script","name":"https://jira.mariadb.org/rest/api/1.0/shortcuts/820016/47140b6e0a9bc2e4913da06536125810/shortcuts.js?context=issuenavigation&context=issueaction","startTime":353.7999997138977,"connectEnd":353.7999997138977,"connectStart":353.7999997138977,"domainLookupEnd":353.7999997138977,"domainLookupStart":353.7999997138977,"fetchStart":353.7999997138977,"redirectEnd":0,"redirectStart":0,"requestStart":555.7999997138977,"responseEnd":605.2999997138977,"responseStart":603.0999999046326,"secureConnectionStart":353.7999997138977},{"duration":199.40000009536743,"initiatorType":"link","name":"https://jira.mariadb.org/s/3ac36323ba5e4eb0af2aa7ac7211b4bb-CDN/lu2cib/820016/12ta74/d176f0986478cc64f24226b3d20c140d/_/download/contextbatch/css/com.atlassian.jira.projects.sidebar.init,-_super,-project.issue.navigator,-jira.view.issue/batch.css?jira.create.linked.issue=true","startTime":354.09999990463257,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":354.09999990463257,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":553.5,"responseStart":0,"secureConnectionStart":0},{"duration":254.30000019073486,"initiatorType":"script","name":"https://jira.mariadb.org/s/5d5e8fe91fbc506585e83ea3b62ccc4b-CDN/lu2cib/820016/12ta74/d176f0986478cc64f24226b3d20c140d/_/download/contextbatch/js/com.atlassian.jira.projects.sidebar.init,-_super,-project.issue.navigator,-jira.view.issue/batch.js?jira.create.linked.issue=true&locale=en","startTime":354.2999997138977,"connectEnd":354.2999997138977,"connectStart":354.2999997138977,"domainLookupEnd":354.2999997138977,"domainLookupStart":354.2999997138977,"fetchStart":354.2999997138977,"redirectEnd":0,"redirectStart":0,"requestStart":562.0999999046326,"responseEnd":608.5999999046326,"responseStart":606.0999999046326,"secureConnectionStart":354.2999997138977},{"duration":502.30000019073486,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:bigpipe-js/jira.webresources:bigpipe-js.js","startTime":355.09999990463257,"connectEnd":355.09999990463257,"connectStart":355.09999990463257,"domainLookupEnd":355.09999990463257,"domainLookupStart":355.09999990463257,"fetchStart":355.09999990463257,"redirectEnd":0,"redirectStart":0,"requestStart":672.0999999046326,"responseEnd":857.4000000953674,"responseStart":851.9000000953674,"secureConnectionStart":355.09999990463257},{"duration":503.5,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:bigpipe-init/jira.webresources:bigpipe-init.js","startTime":355.09999990463257,"connectEnd":355.09999990463257,"connectStart":355.09999990463257,"domainLookupEnd":355.09999990463257,"domainLookupStart":355.09999990463257,"fetchStart":355.09999990463257,"redirectEnd":0,"redirectStart":0,"requestStart":717.9000000953674,"responseEnd":858.5999999046326,"responseStart":855.0999999046326,"secureConnectionStart":355.09999990463257},{"duration":172.80000019073486,"initiatorType":"xmlhttprequest","name":"https://jira.mariadb.org/rest/webResources/1.0/resources","startTime":915.2999997138977,"connectEnd":915.2999997138977,"connectStart":915.2999997138977,"domainLookupEnd":915.2999997138977,"domainLookupStart":915.2999997138977,"fetchStart":915.2999997138977,"redirectEnd":0,"redirectStart":0,"requestStart":1055.6999998092651,"responseEnd":1088.0999999046326,"responseStart":1087.4000000953674,"secureConnectionStart":915.2999997138977},{"duration":195,"initiatorType":"xmlhttprequest","name":"https://jira.mariadb.org/rest/webResources/1.0/resources","startTime":1138.5999999046326,"connectEnd":1138.5999999046326,"connectStart":1138.5999999046326,"domainLookupEnd":1138.5999999046326,"domainLookupStart":1138.5999999046326,"fetchStart":1138.5999999046326,"redirectEnd":0,"redirectStart":0,"requestStart":1300.1999998092651,"responseEnd":1333.5999999046326,"responseStart":1332.5999999046326,"secureConnectionStart":1138.5999999046326},{"duration":204.2000002861023,"initiatorType":"script","name":"https://www.google-analytics.com/analytics.js","startTime":1180.7999997138977,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":1180.7999997138977,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":1385,"responseStart":0,"secureConnectionStart":0}],"fetchStart":0,"domainLookupStart":0,"domainLookupEnd":0,"connectStart":0,"connectEnd":0,"requestStart":152,"responseStart":342,"responseEnd":346,"domLoading":349,"domInteractive":1388,"domContentLoadedEventStart":1388,"domContentLoadedEventEnd":1448,"domComplete":1889,"loadEventStart":1889,"loadEventEnd":1890,"userAgent":"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)","marks":[{"name":"bigPipe.sidebar-id.start","time":1347.7999997138977},{"name":"bigPipe.sidebar-id.end","time":1348.5999999046326},{"name":"bigPipe.activity-panel-pipe-id.start","time":1348.6999998092651},{"name":"bigPipe.activity-panel-pipe-id.end","time":1350.9000000953674},{"name":"activityTabFullyLoaded","time":1465.2999997138977}],"measures":[],"correlationId":"9537c89fbbbba8","effectiveType":"4g","downlink":9.5,"rtt":0,"serverDuration":123,"dbReadsTimeInMs":12,"dbConnsTimeInMs":21,"applicationHash":"9d11dbea5f4be3d4cc21f03a88dd11d8c8687422","experiments":[]}}
I forgot to provide my "stock" config for these nodes:
max_connections = 1000
sql_mode = NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION
key_buffer_size = 16M
query_cache_limit = 4M
query_cache_size = 0
query_cache_type = 0
server-id = 12
log_bin = /var/log/mysql/mysql-bin.log
binlog_do_db = optimizer
replicate-do-db = optimizer
report-host = db12...
log-slave-updates = true
max_binlog_size = 100M
innodb_log_file_size = 32M
innodb_buffer_pool_size = 256MB
Note, if you look at the charts, you might run into a time when all the servers seem to have "synced up" with roughly the same amount of disk write activity. I noticed this for the first time yesterday. I thought perhaps it was due to some changes I made which were making additional writes of this sort:
UPDATE expired_emails SET hits=hits+1 WHERE api_key='xyz...'
"Additional" in the sense that we already had some of these happening already, but now there were several other cases that could trigger the same query.
But reverting the changes did not result in any changes, and the write activity stayed synced for around 8-10 hours.
This morning when I checked, everything was back to "normal", in that the new servers all had substantially higher disk writes, and reinstating the changes from yesterday made no impact on write activity. At most, we were only seeing about 10-15 of those per minute.
I'll attach some snapshots of the charts from both periods for reference.