[MDEV-28407] 'Out of memory while looking for GTID position in binlog', Internal MariaDB error code: 1236 Created: 2022-04-25  Updated: 2023-12-05

Status: Stalled
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.6.7
Fix Version/s: 10.6

Type: Bug Priority: Critical
Reporter: Allen Lee (Inactive) Assignee: Andrei Elkin
Resolution: Unresolved Votes: 1
Labels: None

Attachments: PNG File image-2022-04-25-15-20-22-265.png    

 Description   

While testing MDEV-21117, often time hit the following error while old_master got stuck at being demoted as new_slave when old_slave got promoted to new_master.

*new_master(old_slave) error log

2022-04-25 15:03:26 0 [Note] /usr/sbin/mariadbd: ready for connections.
Version: '10.6.7-3-MariaDB-enterprise-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MariaDB Enterprise Server
2022-04-25 15:03:26 0 [Note] InnoDB: Buffer pool(s) load completed at 220425 15:03:26
2022-04-25 15:03:30 24 [Warning] Aborted connection 24 to db: 'unconnected' user: 'maxscale' host: '192.168.254.26' (Got timeout reading communication packets)
2022-04-25 15:03:33 28 [Warning] Aborted connection 28 to db: 'unconnected' user: 'maxscale' host: '192.168.254.26' (Got timeout reading communication packets)
2022-04-25 15:03:36 23 [Warning] Timeout waiting for reply of binlog (file: mariadb-bin.000010, pos: 461), semi-sync up to file , position 0.
2022-04-25 15:03:36 23 [Note] Semi-sync replication switched OFF.
2022-04-25 15:04:24 6 [ERROR] Error reading packet from server: Lost connection to server during query (server_errno=2013)
2022-04-25 15:04:24 6 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mariadb-bin.000002' at position 23973467; GTID position '0-1-1036936'
2022-04-25 15:04:24 6 [ERROR] Slave I/O: error reconnecting to master 'repl@192.168.254.24:3306' - retry-time: 60  maximum-retries: 100000  message: Can't connect to server on '192.168.254.24' (111 "Connection refused"), Internal MariaDB error code: 2003
2022-04-25 15:05:24 6 [Note] Slave: connected to master 'repl@192.168.254.24:3306',replication resumed in log 'mariadb-bin.000002' at position 23973467
2022-04-25 15:06:05 6 [ERROR] Error reading packet from server: Lost connection to server during query (server_errno=2013)
2022-04-25 15:06:05 6 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mariadb-bin.000003' at position 29316013; GTID position '0-1-1096941'
2022-04-25 15:06:05 6 [ERROR] Slave I/O: error reconnecting to master 'repl@192.168.254.24:3306' - retry-time: 60  maximum-retries: 100000  message: Can't connect to server on '192.168.254.24' (111 "Connection refused"), Internal MariaDB error code: 2003
2022-04-25 15:06:11 7 [Note] Error reading relay log event: slave SQL thread was killed
2022-04-25 15:06:11 7 [Note] Slave SQL thread exiting, replication stopped in log 'mariadb-bin.000003' at position 29316013; GTID position '0-1-1096941', master: 192.168.254.24:3306
2022-04-25 15:06:11 6 [Note] Slave I/O thread killed during or after a reconnect done to recover from failed read
2022-04-25 15:06:11 6 [Note] Slave I/O thread exiting, read up to log 'mariadb-bin.000003', position 29316013; GTID position 0-1-1096941, master 192.168.254.24:3306
2022-04-25 15:06:11 6 [Note] cannot connect to master to kill slave io_thread's connection
2022-04-25 15:06:11 33 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
2022-04-25 15:06:11 33 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.
2022-04-25 15:06:12 3312 [Note] Start binlog_dump to slave_server(1), pos(, 4), using_gtid(1), gtid('0-1-1096912')

*new_slave(old_master) error log

2022-04-25 15:06:12 0 [Note] Successfully truncated binlog file:./mariadb-bin.000003 from previous file size 29319436 to pos:29301832 to remove transactions starting from GTID 0-1-1096913
2022-04-25 15:06:12 0 [Note] Server socket created on IP: '0.0.0.0'.
2022-04-25 15:06:12 0 [Warning] 'proxies_priv' entry '@% root@node1' ignored in --skip-name-resolve mode.
2022-04-25 15:06:12 0 [Note] /usr/sbin/mariadbd: ready for connections.
Version: '10.6.7-3-MariaDB-enterprise-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MariaDB Enterprise Server
2022-04-25 15:06:12 5 [Note] Master connection name: ''  Master_info_file: 'master.info'  Relay_info_file: 'relay-log.info'
2022-04-25 15:06:12 5 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='', master_port='3306', master_log_file='', master_log_pos='4'. New state master_host='192.168.254.25', master_port='3306', master_log_file='', master_log_pos='4'.
2022-04-25 15:06:12 5 [Note] Previous Using_Gtid=No. New Using_Gtid=Current_Pos
2022-04-25 15:06:12 6 [Note] Slave I/O thread: Start semi-sync replication to master 'repl@192.168.254.25:3306' in log '' at position 4
2022-04-25 15:06:12 6 [Note] Slave I/O thread: connected to master 'repl@192.168.254.25:3306',replication starts at GTID position '0-1-1096912'
2022-04-25 15:06:12 7 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './mariadb-relay-bin.000001' position: 4; GTID position '0-1-1096912'
2022-04-25 15:06:12 6 [ERROR] Error reading packet from server: Out of memory while looking for GTID position in binlog (server_errno=1236)
2022-04-25 15:06:12 6 [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'Out of memory while looking for GTID position in binlog', Internal MariaDB error code: 1236
2022-04-25 15:06:12 6 [Note] Slave I/O thread exiting, read up to log 'FIRST', position 4; GTID position 0-1-1096912, master 192.168.254.25:3306

At this point, maxscale server list shows as:

┌─────────┬────────────────┬──────┬─────────────┬─────────────────┬─────────────┐
│ Server  │ Address        │ Port │ Connections │ State           │ GTID        │
├─────────┼────────────────┼──────┼─────────────┼─────────────────┼─────────────┤
│ primary │ 192.168.254.2433060           │ Running         │ 0-1-1096912
├─────────┼────────────────┼──────┼─────────────┼─────────────────┼─────────────┤
│ replica │ 192.168.254.253306100         │ Master, Running │ 0-2-1302231
└─────────┴────────────────┴──────┴─────────────┴─────────────────┴─────────────┘

  • slave status and rpl_semi_sync_% status -

*master/slave_config

[mariadb]
bind_address = 0.0.0.0 
log_bin      = mariadb-bin.log 
server_id    = 2 
log_basename = mariadb 
binlog_format = ROW 
log_slave_updates = ON 
skip_name_resolve = ON 
 
log_error 
slow_query_log 
long_query_time=1 
log_warnings=2
 
rpl_semi_sync_master_enabled=ON 
rpl_semi_sync_master_timeout=1000000 
rpl_semi_sync_master_wait_point=AFTER_SYNC 
 
rpl_semi_sync_slave_enabled=ON 
 
innodb-flush-log-at-trx-commit=2 
sync-binlog=1 
relay-log-purge=1 
relay-log-recovery=1 
innodb_rollback_on_timeout = ON 
max_connections = 3000 
 
slave_parallel_threads = 16 
slave_parallel_workers = 16 
gtid_strict_mode = ON 
rpl_semi_sync_slave_enabled = ON 
rpl_semi_sync_master_timeout=10000 
 
innodb_autoinc_lock_mode=2 
innodb_flush_method = O_DIRECT 
innodb_lock_wait_timeout = 3 
innodb_print_all_deadlocks = ON 
performance_schema = ON 
shutdown_wait_for_slaves = ON 
slave_parallel_max_queued = 134217728 
table_definition_cache = 15000 
table_open_cache = 15000 
thread_pool_size=64 
wait_timeout=310 
general_log = on 



 Comments   
Comment by Andrei Elkin [ 2022-04-28 ]

Thanks for being wide-awake, Julian !

Comment by Andrei Elkin [ 2022-04-29 ]

allen.lee@mariadb.com, salve.

Instead of closing I suggest we first try to reproduce ourselves.
Let me ask you to describe your testing scenario for angelique.sklavounos so she could pick up and proceed. Please reassign to her once that's done. If there's the var directory still alive, we need it too.
Thank you for this alerting case and your efforts so far!

Andrei.

Comment by Andrei Elkin [ 2022-05-06 ]

Thank you for this bit of info! Changes like `log-bin=name` or `log-bin-index` needed to be reported.
They relate to the binlog index file whose reading, as I analyzed, end up with this bogus OOM. So it's very likely the binlog index got screwed somehow by your manipulations.
The error message needs correction and that's what we're certainly going to do in this ticket.
I'd be good to find out how exactly the binlog index got corrupted, maybe angelique.sklavounos could give a try to few tests to reveal that.
To help us still, allen.lee@mariadb.com, do you remember which of the two options or both you played around with?

Comment by Angelique Sklavounos (Inactive) [ 2022-05-20 ]

I've not been able to reproduce this after multiple times of following the method in MDEV-28461, the .cnf above, and changing `log_bin` in the cnf in between stopping and starting the primary and replica.

It was mentioned that 'log-bin' was manually changed but not seen to be "changed as ... changed in config file, but ... it was somewhat different name". I found the same with the above config, in that the log_bin definition is not used but instead log_basename is used. That it "was somewhat different name" makes sense from the config:

log_bin      = mariadb-bin.log 
server_id    = 2 
log_basename = mariadb 

from which the actual binlog will be named mariadb-bin.000001.

Comment by Angelique Sklavounos (Inactive) [ 2022-05-23 ]

allen.lee@mariadb.com would you happen to remember any more details to reproduce this? One thing I did notice from the logs of MDEV-28461 was that the error only occurred before the duplicate error and then wasn't seen again, so did something change?

Comment by Angelique Sklavounos (Inactive) [ 2022-06-20 ]

Tried reproducing based on description and logs, but was not able to. Please reopen if issue is seen again.

Comment by JiraAutomate [ 2023-12-05 ]

Automated message:
----------------------------
Since this issue has not been updated since 6 weeks, it's time to move it back to Stalled.

Generated at Thu Feb 08 10:00:30 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.