[MDEV-28407] 'Out of memory while looking for GTID position in binlog', Internal MariaDB error code: 1236 - Jira

Details

Type: Bug
Status: Stalled (View Workflow)
Priority: Critical
Resolution: Unresolved
Affects Version/s: 10.6.7
Fix Version/s: 10.6
Component/s: Replication
Labels:
None

Description

While testing MDEV-21117, often time hit the following error while old_master got stuck at being demoted as new_slave when old_slave got promoted to new_master.

*new_master(old_slave) error log

2022-04-25 15:03:26 0 [Note] /usr/sbin/mariadbd: ready for connections.

Version: '10.6.7-3-MariaDB-enterprise-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MariaDB Enterprise Server

2022-04-25 15:03:26 0 [Note] InnoDB: Buffer pool(s) load completed at 220425 15:03:26

2022-04-25 15:03:30 24 [Warning] Aborted connection 24 to db: 'unconnected' user: 'maxscale' host: '192.168.254.26' (Got timeout reading communication packets)

2022-04-25 15:03:33 28 [Warning] Aborted connection 28 to db: 'unconnected' user: 'maxscale' host: '192.168.254.26' (Got timeout reading communication packets)

2022-04-25 15:03:36 23 [Warning] Timeout waiting for reply of binlog (file: mariadb-bin.000010, pos: 461), semi-sync up to file , position 0.

2022-04-25 15:03:36 23 [Note] Semi-sync replication switched OFF.

2022-04-25 15:04:24 6 [ERROR] Error reading packet from server: Lost connection to server during query (server_errno=2013)

2022-04-25 15:04:24 6 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mariadb-bin.000002' at position 23973467; GTID position '0-1-1036936'

2022-04-25 15:04:24 6 [ERROR] Slave I/O: error reconnecting to master 'repl@192.168.254.24:3306' - retry-time: 60  maximum-retries: 100000  message: Can't connect to server on '192.168.254.24' (111 "Connection refused"), Internal MariaDB error code: 2003

2022-04-25 15:05:24 6 [Note] Slave: connected to master 'repl@192.168.254.24:3306',replication resumed in log 'mariadb-bin.000002' at position 23973467

2022-04-25 15:06:05 6 [ERROR] Error reading packet from server: Lost connection to server during query (server_errno=2013)

2022-04-25 15:06:05 6 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mariadb-bin.000003' at position 29316013; GTID position '0-1-1096941'

2022-04-25 15:06:05 6 [ERROR] Slave I/O: error reconnecting to master 'repl@192.168.254.24:3306' - retry-time: 60  maximum-retries: 100000  message: Can't connect to server on '192.168.254.24' (111 "Connection refused"), Internal MariaDB error code: 2003

2022-04-25 15:06:11 7 [Note] Error reading relay log event: slave SQL thread was killed

2022-04-25 15:06:11 7 [Note] Slave SQL thread exiting, replication stopped in log 'mariadb-bin.000003' at position 29316013; GTID position '0-1-1096941', master: 192.168.254.24:3306

2022-04-25 15:06:11 6 [Note] Slave I/O thread killed during or after a reconnect done to recover from failed read

2022-04-25 15:06:11 6 [Note] Slave I/O thread exiting, read up to log 'mariadb-bin.000003', position 29316013; GTID position 0-1-1096941, master 192.168.254.24:3306

2022-04-25 15:06:11 6 [Note] cannot connect to master to kill slave io_thread's connection

2022-04-25 15:06:11 33 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.

2022-04-25 15:06:11 33 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.

2022-04-25 15:06:12 3312 [Note] Start binlog_dump to slave_server(1), pos(, 4), using_gtid(1), gtid('0-1-1096912')

*new_slave(old_master) error log

2022-04-25 15:06:12 0 [Note] Successfully truncated binlog file:./mariadb-bin.000003 from previous file size 29319436 to pos:29301832 to remove transactions starting from GTID 0-1-1096913

2022-04-25 15:06:12 0 [Note] Server socket created on IP: '0.0.0.0'.

2022-04-25 15:06:12 0 [Warning] 'proxies_priv' entry '@% root@node1' ignored in --skip-name-resolve mode.

2022-04-25 15:06:12 0 [Note] /usr/sbin/mariadbd: ready for connections.

Version: '10.6.7-3-MariaDB-enterprise-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MariaDB Enterprise Server

2022-04-25 15:06:12 5 [Note] Master connection name: ''  Master_info_file: 'master.info'  Relay_info_file: 'relay-log.info'

2022-04-25 15:06:12 5 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='', master_port='3306', master_log_file='', master_log_pos='4'. New state master_host='192.168.254.25', master_port='3306', master_log_file='', master_log_pos='4'.

2022-04-25 15:06:12 5 [Note] Previous Using_Gtid=No. New Using_Gtid=Current_Pos

2022-04-25 15:06:12 6 [Note] Slave I/O thread: Start semi-sync replication to master 'repl@192.168.254.25:3306' in log '' at position 4

2022-04-25 15:06:12 6 [Note] Slave I/O thread: connected to master 'repl@192.168.254.25:3306',replication starts at GTID position '0-1-1096912'

2022-04-25 15:06:12 7 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './mariadb-relay-bin.000001' position: 4; GTID position '0-1-1096912'

2022-04-25 15:06:12 6 [ERROR] Error reading packet from server: Out of memory while looking for GTID position in binlog (server_errno=1236)

2022-04-25 15:06:12 6 [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'Out of memory while looking for GTID position in binlog', Internal MariaDB error code: 1236

2022-04-25 15:06:12 6 [Note] Slave I/O thread exiting, read up to log 'FIRST', position 4; GTID position 0-1-1096912, master 192.168.254.25:3306

At this point, maxscale server list shows as:

┌─────────┬────────────────┬──────┬─────────────┬─────────────────┬─────────────┐

│ Server  │ Address        │ Port │ Connections │ State           │ GTID        │

├─────────┼────────────────┼──────┼─────────────┼─────────────────┼─────────────┤

│ primary │ 192.168.254.24 │ 3306 │ 0           │ Running         │ 0-1-1096912 │

├─────────┼────────────────┼──────┼─────────────┼─────────────────┼─────────────┤

│ replica │ 192.168.254.25 │ 3306 │ 100         │ Master, Running │ 0-2-1302231 │

└─────────┴────────────────┴──────┴─────────────┴─────────────────┴─────────────┘

slave status and rpl_semi_sync_% status -

*master/slave_config

[mariadb]

bind_address = 0.0.0.0

log_bin      = mariadb-bin.log

server_id    = 2

log_basename = mariadb

binlog_format = ROW

log_slave_updates = ON

skip_name_resolve = ON

log_error

slow_query_log

long_query_time=1

log_warnings=2

rpl_semi_sync_master_enabled=ON

rpl_semi_sync_master_timeout=1000000

rpl_semi_sync_master_wait_point=AFTER_SYNC

rpl_semi_sync_slave_enabled=ON

innodb-flush-log-at-trx-commit=2

sync-binlog=1

relay-log-purge=1

relay-log-recovery=1

innodb_rollback_on_timeout = ON

max_connections = 3000

slave_parallel_threads = 16

slave_parallel_workers = 16

gtid_strict_mode = ON

rpl_semi_sync_slave_enabled = ON

rpl_semi_sync_master_timeout=10000

innodb_autoinc_lock_mode=2

innodb_flush_method = O_DIRECT

innodb_lock_wait_timeout = 3

innodb_print_all_deadlocks = ON

performance_schema = ON

shutdown_wait_for_slaves = ON

slave_parallel_max_queued = 134217728

table_definition_cache = 15000

table_open_cache = 15000

thread_pool_size=64

wait_timeout=310

general_log = on

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

image-2022-04-25-15-20-22-265.png
112 kB
2022-04-25 06:20

Activity

Ascending order - Click to sort in descending order

Andrei Elkin added a comment - 2022-04-28 10:55 - edited

Thanks for being wide-awake, Julian !

Andrei Elkin added a comment - 2022-04-28 10:55 - edited Thanks for being wide-awake, Julian !

Andrei Elkin added a comment - 2022-04-29 09:32

allen.lee@mariadb.com, salve.

Instead of closing I suggest we first try to reproduce ourselves.
Let me ask you to describe your testing scenario for angelique.sklavounos so she could pick up and proceed. Please reassign to her once that's done. If there's the var directory still alive, we need it too.
Thank you for this alerting case and your efforts so far!

Andrei.

Andrei Elkin added a comment - 2022-04-29 09:32 allen.lee@mariadb.com , salve. Instead of closing I suggest we first try to reproduce ourselves. Let me ask you to describe your testing scenario for angelique.sklavounos so she could pick up and proceed. Please reassign to her once that's done. If there's the var directory still alive, we need it too. Thank you for this alerting case and your efforts so far! Andrei.

Andrei Elkin added a comment - 2022-05-06 07:35

Thank you for this bit of info! Changes like `log-bin=name` or `log-bin-index` needed to be reported.
They relate to the binlog index file whose reading, as I analyzed, end up with this bogus OOM. So it's very likely the binlog index got screwed somehow by your manipulations.
The error message needs correction and that's what we're certainly going to do in this ticket.
I'd be good to find out how exactly the binlog index got corrupted, maybe angelique.sklavounos could give a try to few tests to reveal that.
To help us still, allen.lee@mariadb.com, do you remember which of the two options or both you played around with?

Andrei Elkin added a comment - 2022-05-06 07:35 Thank you for this bit of info! Changes like `log-bin=name` or `log-bin-index` needed to be reported. They relate to the binlog index file whose reading, as I analyzed, end up with this bogus OOM. So it's very likely the binlog index got screwed somehow by your manipulations. The error message needs correction and that's what we're certainly going to do in this ticket. I'd be good to find out how exactly the binlog index got corrupted, maybe angelique.sklavounos could give a try to few tests to reveal that. To help us still, allen.lee@mariadb.com , do you remember which of the two options or both you played around with?

Angelique Sklavounos (Inactive) added a comment - 2022-05-20 15:17

I've not been able to reproduce this after multiple times of following the method in ~~MDEV-28461~~, the .cnf above, and changing `log_bin` in the cnf in between stopping and starting the primary and replica.

It was mentioned that 'log-bin' was manually changed but not seen to be "changed as ... changed in config file, but ... it was somewhat different name". I found the same with the above config, in that the log_bin definition is not used but instead log_basename is used. That it "was somewhat different name" makes sense from the config:

log_bin      = mariadb-bin.log

server_id    = 2

log_basename = mariadb

from which the actual binlog will be named mariadb-bin.000001.

Angelique Sklavounos (Inactive) added a comment - 2022-05-20 15:17 I've not been able to reproduce this after multiple times of following the method in MDEV-28461 , the .cnf above, and changing `log_bin` in the cnf in between stopping and starting the primary and replica. It was mentioned that 'log-bin' was manually changed but not seen to be "changed as ... changed in config file, but ... it was somewhat different name". I found the same with the above config, in that the log_bin definition is not used but instead log_basename is used. That it "was somewhat different name" makes sense from the config: log_bin = mariadb-bin.log server_id = 2 log_basename = mariadb from which the actual binlog will be named mariadb-bin.000001.

Angelique Sklavounos (Inactive) added a comment - 2022-05-23 08:50 - edited

allen.lee@mariadb.com would you happen to remember any more details to reproduce this? One thing I did notice from the logs of ~~MDEV-28461~~ was that the error only occurred before the duplicate error and then wasn't seen again, so did something change?

Angelique Sklavounos (Inactive) added a comment - 2022-05-23 08:50 - edited allen.lee@mariadb.com would you happen to remember any more details to reproduce this? One thing I did notice from the logs of MDEV-28461 was that the error only occurred before the duplicate error and then wasn't seen again, so did something change?

Angelique Sklavounos (Inactive) added a comment - 2022-06-20 08:19

Tried reproducing based on description and logs, but was not able to. Please reopen if issue is seen again.

Angelique Sklavounos (Inactive) added a comment - 2022-06-20 08:19 Tried reproducing based on description and logs, but was not able to. Please reopen if issue is seen again.

JiraAutomate added a comment - 2023-12-05 11:48

Automated message:
----------------------------
Since this issue has not been updated since 6 weeks, it's time to move it back to Stalled.

JiraAutomate added a comment - 2023-12-05 11:48 Automated message: ---------------------------- Since this issue has not been updated since 6 weeks, it's time to move it back to Stalled.

People

Assignee:: Andrei Elkin

Reporter:: Allen Lee (Inactive)

Votes:: 1 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 2022-04-25 06:26

Updated:: 2025-03-30 17:15

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Activity

People

Dates

Git Integration