[MDEV-32356] Setting gtid_slave_pos is not atomic - Jira

XML

Word

Printable

Details

Type: Bug
Status: Confirmed (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.4(EOL)
Fix Version/s: 10.4(EOL)
Component/s: Galera, Replication
Labels:
- foundation

Description

Consider first normal master-slave topology with gtid_strict_mode=0 where user stops slave and sets:

SET GLOBAL gtid_slave_pos= '1-2-3,2-4-6';

Yes, this could be totally incorrect i.e. there could not even be any node with domain_id with 1 or 2. This command is executed like this:

rpl_slave_state::record_gtid (this=0x55fb9f8f6c90, thd=0x7f2e80031480, gtid=0x7f2e89efa710, sub_id=2, in_transaction=false,

    in_statement=true, out_hton=0x7f2e89efa6f8) at /home/jan/work/mariadb/10.4/sql/rpl_gtid.cc:690

#1  0x000055fb9b9ff12e in rpl_slave_state::load (this=0x55fb9f8f6c90, thd=0x7f2e80031480, state_from_master=0x7f2e8003e053 "", len=11,

    reset=true, in_statement=true) at /home/jan/work/mariadb/10.4/sql/rpl_gtid.cc:1409

#2  0x000055fb9b81d972 in rpl_gtid_pos_update (thd=0x7f2e80031480, str=0x7f2e8003e048 "1-2-3,2-4-6", len=11)

    at /home/jan/work/mariadb/10.4/sql/sql_repl.cc:4728

#3  0x000055fb9b99469a in Sys_var_gtid_slave_pos::global_update (this=0x55fb9d1fde20 <Sys_gtid_slave_pos>, thd=0x7f2e80031480,

    var=0x7f2e8003dff8) at /home/jan/work/mariadb/10.4/sql/sys_vars.cc:1858

#4  0x000055fb9b6a8c5e in sys_var::update (this=0x55fb9d1fde20 <Sys_gtid_slave_pos>, thd=0x7f2e80031480, var=0x7f2e8003dff8)

    at /home/jan/work/mariadb/10.4/sql/set_var.cc:208

#5  0x000055fb9b6aab8e in set_var::update (this=0x7f2e8003dff8, thd=0x7f2e80031480) at /home/jan/work/mariadb/10.4/sql/set_var.cc:837

#6  0x000055fb9b6aa7f0 in sql_set_variables (thd=0x7f2e80031480, var_list=0x7f2e80036360, free=true)

    at /home/jan/work/mariadb/10.4/sql/set_var.cc:740

#7  0x000055fb9b7db3f1 in mysql_execute_command (thd=0x7f2e80031480) at /home/jan/work/mariadb/10.4/sql/sql_parse.cc:5047

#8  0x000055fb9b7e5303 in mysql_parse (thd=0x7f2e80031480, rawbuf=0x7f2e8003de68 "SET GLOBAL gtid_slave_pos= '1-2-3,2-4-6'", length=40,

    parser_state=0x7f2e89efb300, is_com_multi=false, is_next_command=false) at /home/jan/work/mariadb/10.4/sql/sql_parse.cc:8012

#9  0x000055fb9b7e499d in wsrep_mysql_parse (thd=0x7f2e80031480, rawbuf=0x7f2e8003de68 "SET GLOBAL gtid_slave_pos= '1-2-3,2-4-6'", length=40,

    parser_state=0x7f2e89efb300, is_com_multi=false, is_next_command=false) at /home/jan/work/mariadb/10.4/sql/sql_parse.cc:7814

#10 0x000055fb9b7d0979 in dispatch_command (command=COM_QUERY, thd=0x7f2e80031480,

    packet=0x7f2e8004fa01 "SET GLOBAL gtid_slave_pos= '1-2-3,2-4-6'", packet_length=40, is_com_multi=false, is_next_command=false)

    at /home/jan/work/mariadb/10.4/sql/sql_parse.cc:1843

#11 0x000055fb9b7cf2ce in do_command (thd=0x7f2e80031480) at /home/jan/work/mariadb/10.4/sql/sql_parse.cc:1378

#12 0x000055fb9b975bbe in do_handle_one_connection (connect=0x55fb9fded720) at /home/jan/work/mariadb/10.4/sql/sql_connect.cc:1420

#13 0x000055fb9b97591a in handle_one_connection (arg=0x55fb9fded720) at /home/jan/work/mariadb/10.4/sql/sql_connect.cc:1324

#14 0x000055fb9bf1154b in pfs_spawn_thread (arg=0x55fb9f972430) at /home/jan/work/mariadb/10.4/storage/perfschema/pfs.cc:1869

#15 0x00007f2e97c97ada in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444

#16 0x00007f2e97d282e4 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100

Anyway, this is not atomic because rpl_slave_state::load contains a loop i.e. it takes one gtid and calls rpl_slave_state::record_gtid where we have

    if (err || (err= ha_commit_trans(thd, FALSE)))

      ha_rollback_trans(thd, FALSE);

The fact that storing these gtids is not atomic might have problems in following cases:

In Galera we replicate gtid_slave_pos table to other nodes and assume it is InnoDB. This replication is required at least on case where slave node is configured to use skip_slave_start=0 and node goes down and then starts again
What happens if we have stored first gtid position and committed transaction and then node crashes?
For Galera we need to have galera transaction and so we start it on rpl_slave_state::record_gtid but we lost that transaction because of ha_commit or ha_rollback. We might be able to fix this by cleaning Galera transaction context and start a new transaction but it is not optimal because gtid position update is not atomic.

Attachments

Issue Links

causes

MDEV-32193 Assertion `state() == s_executing || state() == s_prepared || state() == s_committing || state() == s_must_abort || state() == s_replaying' failed.

Stalled

MDEV-33129 Crash in wsrep::wsrep_provider_v26::replay when setting gtid_slave_pos

Closed

Activity

People

Assignee:: Kristian Nielsen

Reporter:: Jan Lindström

Votes:: 1 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2023-10-05 09:02

Updated:: 2025-02-06 07:21

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.