Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-28407

'Out of memory while looking for GTID position in binlog', Internal MariaDB error code: 1236

Details

    • Bug
    • Status: Stalled (View Workflow)
    • Critical
    • Resolution: Unresolved
    • 10.6.7
    • 10.6
    • Replication
    • None

    Description

      While testing MDEV-21117, often time hit the following error while old_master got stuck at being demoted as new_slave when old_slave got promoted to new_master.

      *new_master(old_slave) error log

      2022-04-25 15:03:26 0 [Note] /usr/sbin/mariadbd: ready for connections.
      Version: '10.6.7-3-MariaDB-enterprise-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MariaDB Enterprise Server
      2022-04-25 15:03:26 0 [Note] InnoDB: Buffer pool(s) load completed at 220425 15:03:26
      2022-04-25 15:03:30 24 [Warning] Aborted connection 24 to db: 'unconnected' user: 'maxscale' host: '192.168.254.26' (Got timeout reading communication packets)
      2022-04-25 15:03:33 28 [Warning] Aborted connection 28 to db: 'unconnected' user: 'maxscale' host: '192.168.254.26' (Got timeout reading communication packets)
      2022-04-25 15:03:36 23 [Warning] Timeout waiting for reply of binlog (file: mariadb-bin.000010, pos: 461), semi-sync up to file , position 0.
      2022-04-25 15:03:36 23 [Note] Semi-sync replication switched OFF.
      2022-04-25 15:04:24 6 [ERROR] Error reading packet from server: Lost connection to server during query (server_errno=2013)
      2022-04-25 15:04:24 6 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mariadb-bin.000002' at position 23973467; GTID position '0-1-1036936'
      2022-04-25 15:04:24 6 [ERROR] Slave I/O: error reconnecting to master 'repl@192.168.254.24:3306' - retry-time: 60  maximum-retries: 100000  message: Can't connect to server on '192.168.254.24' (111 "Connection refused"), Internal MariaDB error code: 2003
      2022-04-25 15:05:24 6 [Note] Slave: connected to master 'repl@192.168.254.24:3306',replication resumed in log 'mariadb-bin.000002' at position 23973467
      2022-04-25 15:06:05 6 [ERROR] Error reading packet from server: Lost connection to server during query (server_errno=2013)
      2022-04-25 15:06:05 6 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mariadb-bin.000003' at position 29316013; GTID position '0-1-1096941'
      2022-04-25 15:06:05 6 [ERROR] Slave I/O: error reconnecting to master 'repl@192.168.254.24:3306' - retry-time: 60  maximum-retries: 100000  message: Can't connect to server on '192.168.254.24' (111 "Connection refused"), Internal MariaDB error code: 2003
      2022-04-25 15:06:11 7 [Note] Error reading relay log event: slave SQL thread was killed
      2022-04-25 15:06:11 7 [Note] Slave SQL thread exiting, replication stopped in log 'mariadb-bin.000003' at position 29316013; GTID position '0-1-1096941', master: 192.168.254.24:3306
      2022-04-25 15:06:11 6 [Note] Slave I/O thread killed during or after a reconnect done to recover from failed read
      2022-04-25 15:06:11 6 [Note] Slave I/O thread exiting, read up to log 'mariadb-bin.000003', position 29316013; GTID position 0-1-1096941, master 192.168.254.24:3306
      2022-04-25 15:06:11 6 [Note] cannot connect to master to kill slave io_thread's connection
      2022-04-25 15:06:11 33 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
      2022-04-25 15:06:11 33 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.
      2022-04-25 15:06:12 3312 [Note] Start binlog_dump to slave_server(1), pos(, 4), using_gtid(1), gtid('0-1-1096912')
      
      

      *new_slave(old_master) error log

      2022-04-25 15:06:12 0 [Note] Successfully truncated binlog file:./mariadb-bin.000003 from previous file size 29319436 to pos:29301832 to remove transactions starting from GTID 0-1-1096913
      2022-04-25 15:06:12 0 [Note] Server socket created on IP: '0.0.0.0'.
      2022-04-25 15:06:12 0 [Warning] 'proxies_priv' entry '@% root@node1' ignored in --skip-name-resolve mode.
      2022-04-25 15:06:12 0 [Note] /usr/sbin/mariadbd: ready for connections.
      Version: '10.6.7-3-MariaDB-enterprise-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MariaDB Enterprise Server
      2022-04-25 15:06:12 5 [Note] Master connection name: ''  Master_info_file: 'master.info'  Relay_info_file: 'relay-log.info'
      2022-04-25 15:06:12 5 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='', master_port='3306', master_log_file='', master_log_pos='4'. New state master_host='192.168.254.25', master_port='3306', master_log_file='', master_log_pos='4'.
      2022-04-25 15:06:12 5 [Note] Previous Using_Gtid=No. New Using_Gtid=Current_Pos
      2022-04-25 15:06:12 6 [Note] Slave I/O thread: Start semi-sync replication to master 'repl@192.168.254.25:3306' in log '' at position 4
      2022-04-25 15:06:12 6 [Note] Slave I/O thread: connected to master 'repl@192.168.254.25:3306',replication starts at GTID position '0-1-1096912'
      2022-04-25 15:06:12 7 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './mariadb-relay-bin.000001' position: 4; GTID position '0-1-1096912'
      2022-04-25 15:06:12 6 [ERROR] Error reading packet from server: Out of memory while looking for GTID position in binlog (server_errno=1236)
      2022-04-25 15:06:12 6 [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'Out of memory while looking for GTID position in binlog', Internal MariaDB error code: 1236
      2022-04-25 15:06:12 6 [Note] Slave I/O thread exiting, read up to log 'FIRST', position 4; GTID position 0-1-1096912, master 192.168.254.25:3306
      

      At this point, maxscale server list shows as:

      ┌─────────┬────────────────┬──────┬─────────────┬─────────────────┬─────────────┐
      │ Server  │ Address        │ Port │ Connections │ State           │ GTID        │
      ├─────────┼────────────────┼──────┼─────────────┼─────────────────┼─────────────┤
      │ primary │ 192.168.254.2433060           │ Running         │ 0-1-1096912
      ├─────────┼────────────────┼──────┼─────────────┼─────────────────┼─────────────┤
      │ replica │ 192.168.254.253306100         │ Master, Running │ 0-2-1302231
      └─────────┴────────────────┴──────┴─────────────┴─────────────────┴─────────────┘
      

      • slave status and rpl_semi_sync_% status -

      *master/slave_config

      [mariadb]
      bind_address = 0.0.0.0 
      log_bin      = mariadb-bin.log 
      server_id    = 2 
      log_basename = mariadb 
      binlog_format = ROW 
      log_slave_updates = ON 
      skip_name_resolve = ON 
       
      log_error 
      slow_query_log 
      long_query_time=1 
      log_warnings=2
       
      rpl_semi_sync_master_enabled=ON 
      rpl_semi_sync_master_timeout=1000000 
      rpl_semi_sync_master_wait_point=AFTER_SYNC 
       
      rpl_semi_sync_slave_enabled=ON 
       
      innodb-flush-log-at-trx-commit=2 
      sync-binlog=1 
      relay-log-purge=1 
      relay-log-recovery=1 
      innodb_rollback_on_timeout = ON 
      max_connections = 3000 
       
      slave_parallel_threads = 16 
      slave_parallel_workers = 16 
      gtid_strict_mode = ON 
      rpl_semi_sync_slave_enabled = ON 
      rpl_semi_sync_master_timeout=10000 
       
      innodb_autoinc_lock_mode=2 
      innodb_flush_method = O_DIRECT 
      innodb_lock_wait_timeout = 3 
      innodb_print_all_deadlocks = ON 
      performance_schema = ON 
      shutdown_wait_for_slaves = ON 
      slave_parallel_max_queued = 134217728 
      table_definition_cache = 15000 
      table_open_cache = 15000 
      thread_pool_size=64 
      wait_timeout=310 
      general_log = on 
      

      Attachments

        Activity

          Elkin Andrei Elkin added a comment - - edited

          Thanks for being wide-awake, Julian !

          Elkin Andrei Elkin added a comment - - edited Thanks for being wide-awake, Julian !
          Elkin Andrei Elkin added a comment -

          allen.lee@mariadb.com, salve.

          Instead of closing I suggest we first try to reproduce ourselves.
          Let me ask you to describe your testing scenario for angelique.sklavounos so she could pick up and proceed. Please reassign to her once that's done. If there's the var directory still alive, we need it too.
          Thank you for this alerting case and your efforts so far!

          Andrei.

          Elkin Andrei Elkin added a comment - allen.lee@mariadb.com , salve. Instead of closing I suggest we first try to reproduce ourselves. Let me ask you to describe your testing scenario for angelique.sklavounos so she could pick up and proceed. Please reassign to her once that's done. If there's the var directory still alive, we need it too. Thank you for this alerting case and your efforts so far! Andrei.
          Elkin Andrei Elkin added a comment -

          Thank you for this bit of info! Changes like `log-bin=name` or `log-bin-index` needed to be reported.
          They relate to the binlog index file whose reading, as I analyzed, end up with this bogus OOM. So it's very likely the binlog index got screwed somehow by your manipulations.
          The error message needs correction and that's what we're certainly going to do in this ticket.
          I'd be good to find out how exactly the binlog index got corrupted, maybe angelique.sklavounos could give a try to few tests to reveal that.
          To help us still, allen.lee@mariadb.com, do you remember which of the two options or both you played around with?

          Elkin Andrei Elkin added a comment - Thank you for this bit of info! Changes like `log-bin=name` or `log-bin-index` needed to be reported. They relate to the binlog index file whose reading, as I analyzed, end up with this bogus OOM. So it's very likely the binlog index got screwed somehow by your manipulations. The error message needs correction and that's what we're certainly going to do in this ticket. I'd be good to find out how exactly the binlog index got corrupted, maybe angelique.sklavounos could give a try to few tests to reveal that. To help us still, allen.lee@mariadb.com , do you remember which of the two options or both you played around with?

          I've not been able to reproduce this after multiple times of following the method in MDEV-28461, the .cnf above, and changing `log_bin` in the cnf in between stopping and starting the primary and replica.

          It was mentioned that 'log-bin' was manually changed but not seen to be "changed as ... changed in config file, but ... it was somewhat different name". I found the same with the above config, in that the log_bin definition is not used but instead log_basename is used. That it "was somewhat different name" makes sense from the config:

          log_bin      = mariadb-bin.log 
          server_id    = 2 
          log_basename = mariadb 
          

          from which the actual binlog will be named mariadb-bin.000001.

          angelique.sklavounos Angelique Sklavounos (Inactive) added a comment - I've not been able to reproduce this after multiple times of following the method in MDEV-28461 , the .cnf above, and changing `log_bin` in the cnf in between stopping and starting the primary and replica. It was mentioned that 'log-bin' was manually changed but not seen to be "changed as ... changed in config file, but ... it was somewhat different name". I found the same with the above config, in that the log_bin definition is not used but instead log_basename is used. That it "was somewhat different name" makes sense from the config: log_bin = mariadb-bin.log server_id = 2 log_basename = mariadb from which the actual binlog will be named mariadb-bin.000001.

          allen.lee@mariadb.com would you happen to remember any more details to reproduce this? One thing I did notice from the logs of MDEV-28461 was that the error only occurred before the duplicate error and then wasn't seen again, so did something change?

          angelique.sklavounos Angelique Sklavounos (Inactive) added a comment - - edited allen.lee@mariadb.com would you happen to remember any more details to reproduce this? One thing I did notice from the logs of MDEV-28461 was that the error only occurred before the duplicate error and then wasn't seen again, so did something change?

          Tried reproducing based on description and logs, but was not able to. Please reopen if issue is seen again.

          angelique.sklavounos Angelique Sklavounos (Inactive) added a comment - Tried reproducing based on description and logs, but was not able to. Please reopen if issue is seen again.
          JIraAutomate JiraAutomate added a comment -

          Automated message:
          ----------------------------
          Since this issue has not been updated since 6 weeks, it's time to move it back to Stalled.

          JIraAutomate JiraAutomate added a comment - Automated message: ---------------------------- Since this issue has not been updated since 6 weeks, it's time to move it back to Stalled.

          People

            Elkin Andrei Elkin
            allen.lee@mariadb.com Allen Lee (Inactive)
            Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.