Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-8955

Problem with using parallel replication settings on master under heavy load

Details

    Description

      Hello,
      Following my recent thread about problem with chained replication (MDEV-8929), a different problem appeared.

      On master I used to set binlog_commit_wait_count = 10 (also tested 20, 30, 100 and 1000) and binlog_commit_wait_usec=10000 (also 20000 and default of 1 sec).
      All worked fine until the load been increased. In that case something happened to Mariadb. I saw that processlist started increasing with threads awaiting to login:

      unauthenticated user | connecting host  | NULL     | Connect        | NULL | login                                                                 | NULL

      MariaDB started slowly answering. Even 'mysql' shell command took time to execute. PHP backends went down because of slow performance. No LA (<1), however.
      I tried both pool of threads and thread per connection. Works fine until I reach something like 10qps/sec.

      The test environment: MyISAM tables without indexes, delayed insert and ROW replication with logs written to a different partition (no IO problems)

      When I applied the binlog_commit_wait_count settings on master sometimes it takes few seconds to apply, and sometimes immediately.
      But the result is always the same: instead of having 10k qps (or above) and around 1k threads connected from test servers, I end up with 1-2qps and >10k connected threads, all of which are those unauthenticated users. In the end everything stops working. All I see in DB - some sleep and mostly unauthenticated processes.

      I guess this is caused by heavy internal calculations of small number of queries (10/20/30) that needs to be group-committed. If server performs at 10qps, it means each query takes 100 microseconds and 10/20/30 set for binlog_commit_wait_count queries could be a very small number that keeps MariaDB busy on calculating which in its turn causes problems with connections/threads and so on. Setting the number of wait_count higher only "helps" MariaDB stuck faster, though it's a matter of a few seconds in either case. Under such load it seems that any binlog_commit_wait_usec setting is "just-in-case" limit that is quite irrelevant.

      So all in all I ended up with standard master and slave that couldn't catch up because its parallel threads didn't do anything as nothing else could be activated on master. Seems like standard group commit isn't helping much in such case.

      | Binlog_commits                        | 9842017          |
      | Binlog_group_commits                  | 9757972          |

      +
      whatever settings I apply on master, it's always only binlog_group_commit_trigger_timeout counters updated (which to my opinion also indicates some problem under heavy load). Even setting count to 50 queries and keeping wait_user with default 1 sec.

      MariaDB [(none)]> show global status like 'binlog%';
      +---------------------------------------+------------------+
      | Variable_name                         | Value            |
      +---------------------------------------+------------------+
      | Binlog_commits                        | 19042854         |
      | Binlog_group_commits                  | 18832527         |
      | Binlog_group_commit_trigger_count     | 0                |
      | Binlog_group_commit_trigger_lock_wait | 0                |
      | Binlog_group_commit_trigger_timeout   | 12569            |
      | Binlog_snapshot_file                  | mysql-bin.012932 |
      | Binlog_snapshot_position              | 55357394         |
      | Binlog_bytes_written                  | 7571633389       |
      | Binlog_cache_disk_use                 | 0                |
      | Binlog_cache_use                      | 0                |
      | Binlog_stmt_cache_disk_use            | 362              |
      | Binlog_stmt_cache_use                 | 19042644         |
      +---------------------------------------+------------------+
      12 rows in set (0.07 sec)
       
       
      MariaDB [(none)]> show global variables like 'binlog%';
      +-----------------------------------------+--------+
      | Variable_name                           | Value  |
      +-----------------------------------------+--------+
      | binlog_annotate_row_events              | OFF    |
      | binlog_cache_size                       | 32768  |
      | binlog_checksum                         | NONE   |
      | binlog_commit_wait_count                | 50     |
      | binlog_commit_wait_usec                 | 100000 |
      | binlog_direct_non_transactional_updates | OFF    |
      | binlog_format                           | ROW    |
      | binlog_optimize_thread_scheduling       | ON     |
      | binlog_stmt_cache_size                  | 32768  |
      +-----------------------------------------+--------+
      9 rows in set (1.50 sec)

      Please advise

      thanks!
      Alex

      Attachments

        Activity

          ShivaS Alex added a comment -

          For now I've succeeded to stabilize the things more or less by using binlog_commit_wait_count=10 under low load and keeping usec = 1 sec (the default). seems like it works under heavy load. But if MariaDB is restarted any of the settings are applied - the problems begin...

          ShivaS Alex added a comment - For now I've succeeded to stabilize the things more or less by using binlog_commit_wait_count=10 under low load and keeping usec = 1 sec (the default). seems like it works under heavy load. But if MariaDB is restarted any of the settings are applied - the problems begin...

          ShivaS,
          regardless the problem itself, which we will need to look into, what does your last comment about options mean? Cannot you put the desired settings into the config file, so that they are picked up on MariaDB restart? Or am I misreading again?

          elenst Elena Stepanova added a comment - ShivaS , regardless the problem itself, which we will need to look into, what does your last comment about options mean? Cannot you put the desired settings into the config file, so that they are picked up on MariaDB restart? Or am I misreading again?
          ShivaS Alex added a comment - - edited

          Hi Elena,
          I am not using the settings in my.cnf yet, but playing with it real-time to find the best resources/performance/speed combination for the load I am using.
          So once MariaDB is restarted and the load immediately hits it, I apply the setting(s) and watch after its behavior.
          So far I succeeded stabilizing things with binlog_commit_wait_count=10 under low load. Then it works fine even under high load. I didn't try anything else yet, and guess that setting time counter to low value could be another cause, so I keep it 1 sec (default) for now , this to let MariaDB never (hopefully) hit that limit and deal only with 10 queries it needs to count and commit at once.

          ShivaS Alex added a comment - - edited Hi Elena, I am not using the settings in my.cnf yet, but playing with it real-time to find the best resources/performance/speed combination for the load I am using. So once MariaDB is restarted and the load immediately hits it, I apply the setting(s) and watch after its behavior. So far I succeeded stabilizing things with binlog_commit_wait_count=10 under low load. Then it works fine even under high load. I didn't try anything else yet, and guess that setting time counter to low value could be another cause, so I keep it 1 sec (default) for now , this to let MariaDB never (hopefully) hit that limit and deal only with 10 queries it needs to count and commit at once.
          ShivaS Alex added a comment -

          unfortunately the problem is back at 15-20k qps ;(
          I can't stabilize master to be a good guy and make commits...the slave is lagging behind...
          after restart, when all settings been reset, I still cannot apply anything related to binlog_commit_wait_count...server just stuck...

          ShivaS Alex added a comment - unfortunately the problem is back at 15-20k qps ;( I can't stabilize master to be a good guy and make commits...the slave is lagging behind... after restart, when all settings been reset, I still cannot apply anything related to binlog_commit_wait_count...server just stuck...
          ShivaS Alex added a comment -

          Elena,
          I decided to give it a try and switch master to blackhole, in hope that it might speed up things.
          Well, it did not ;(
          2 things I've paid attention to:

          1. it's easier to apply/ work with binlog settings on master while it's not using pool of threads, but old style thread per connection.
          in all my games I paid attention that old style connection is more stable with binlog settings (unless a really huge load arrives)
          2. I decided to alter big myisam tables to blackhole and it took very long, like if it was converting to any other type
          Query OK, 201147303 rows affected (8 min 9.60 sec)
          Records: 201147303 Duplicates: 0 Warnings: 0
          Is it by design or should work much faster? Actually i could make it faster by either dropping/recreating or truncating/alter

          So I assume it's something about internal counters/whatever that's applied to every arrived connection/thread and works much harder with pool of threads (after all once problem happens, all processlist is full of unauthenticated users)

          ShivaS Alex added a comment - Elena, I decided to give it a try and switch master to blackhole, in hope that it might speed up things. Well, it did not ;( 2 things I've paid attention to: 1. it's easier to apply/ work with binlog settings on master while it's not using pool of threads, but old style thread per connection. in all my games I paid attention that old style connection is more stable with binlog settings (unless a really huge load arrives) 2. I decided to alter big myisam tables to blackhole and it took very long, like if it was converting to any other type Query OK, 201147303 rows affected (8 min 9.60 sec) Records: 201147303 Duplicates: 0 Warnings: 0 Is it by design or should work much faster? Actually i could make it faster by either dropping/recreating or truncating/alter So I assume it's something about internal counters/whatever that's applied to every arrived connection/thread and works much harder with pool of threads (after all once problem happens, all processlist is full of unauthenticated users)
          ShivaS Alex added a comment -

          Another thing I've noticed:

          on master I have slow log set to 100 msec, all tables are blackhole and all incoming queries are delayed inserts.
          Whenever I apply binlog_wait_count (I use 10 now as it's most stable one) - I notice slow queries here and there in slow log. All of it are around 100msec, not higher. If binlog_wait_count is disabled - slow log is clean. The load is a bit lazy now 5-10qps
          in binlog status I see:
          Binlog_group_commit_trigger_timeout | 795751

          which means that it truly hits the usec timeout which is set to 1 second. I understand that delayed insert is something that kinda behaves on its own and unique self, but still strange to see slow queries and hitting binlog_commit_wait_usec default timeout

          ShivaS Alex added a comment - Another thing I've noticed: on master I have slow log set to 100 msec, all tables are blackhole and all incoming queries are delayed inserts. Whenever I apply binlog_wait_count (I use 10 now as it's most stable one) - I notice slow queries here and there in slow log. All of it are around 100msec, not higher. If binlog_wait_count is disabled - slow log is clean. The load is a bit lazy now 5-10qps in binlog status I see: Binlog_group_commit_trigger_timeout | 795751 which means that it truly hits the usec timeout which is set to 1 second. I understand that delayed insert is something that kinda behaves on its own and unique self, but still strange to see slow queries and hitting binlog_commit_wait_usec default timeout
          ShivaS Alex added a comment -

          Elena, one more thing I wanted to pay your attention at:

          quoting from one of previous comments since I've spammed a bit in this thread and it could be missed:

          I decided to alter big myisam tables to blackhole and it took very long, like if it was converting to any other type
          Query OK, 201147303 rows affected (8 min 9.60 sec)
          Records: 201147303 Duplicates: 0 Warnings: 0
          Is it by design or should work much faster? Actually i could make it faster by either dropping/recreating or truncating/alter

          Maybe a small fix can be introduced? To make alter-to-blackhole happen faster regardless the original table size?

          ShivaS Alex added a comment - Elena, one more thing I wanted to pay your attention at: quoting from one of previous comments since I've spammed a bit in this thread and it could be missed: I decided to alter big myisam tables to blackhole and it took very long , like if it was converting to any other type Query OK, 201147303 rows affected ( 8 min 9.60 sec) Records: 201147303 Duplicates: 0 Warnings: 0 Is it by design or should work much faster? Actually i could make it faster by either dropping/recreating or truncating/alter Maybe a small fix can be introduced? To make alter-to-blackhole happen faster regardless the original table size?

          ShivaS, I cannot say right away if it's possible to make it happen faster. Since it apparently has nothing to do with the parallel replication, I suggest to file a separate JIRA issue about it.

          elenst Elena Stepanova added a comment - ShivaS , I cannot say right away if it's possible to make it happen faster. Since it apparently has nothing to do with the parallel replication, I suggest to file a separate JIRA issue about it.

          People

            Elkin Andrei Elkin
            ShivaS Alex
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.