Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34856

Galera Cluster upgrade from 11.0 to 11.1 fails, eventually leads to crash

Details

    • Bug
    • Status: Closed (View Workflow)
    • Blocker
    • Resolution: Fixed
    • 11.1(EOL), 11.2(EOL), 11.3(EOL), 11.4
    • 11.2.6, 11.4.4
    • Galera, Galera SST
    • None

    Description

      Description below was for original 10.11 to 11.4 failure, further testing showed that the actual same problem occurs betweein 11.0 and 11.1. Rolling upgrades from 10.11 to 11.0 and from 11.1 to 11.4 work as expected.
      So it is only the 11.0 to 11.1 step that is critical.

      I'm installing a 10.11 two node cluster with rather basic configuration, wait for a minute, then try a rolling upgrade to 11.4, starting with the 2nd node.

      The upgraded node at first fails to join the cluster with

      Sep 02 12:58:49 node-2 mariadbd[3973]: 2024-09-02 12:58:49 0 [Note] WSREP: Member 1.0 (node-1) synced with group.
      Sep 02 12:58:49 node-2 -wsrep-sst-joiner[4188]: receiving process ended without creating magic file (/tmp/tmp.IJSXAIA9T4/mariadb_backup_galera_info)
      Sep 02 12:58:50 node-2 -wsrep-sst-joiner[4146]: logger: invalid option -- 'r'
      Sep 02 12:58:50 node-2 -wsrep-sst-joiner[4146]: Try 'logger --help' for more information.
      Sep 02 12:58:50 node-2 -wsrep-sst-joiner[4195]: Removing the sst_in_progress file
      Sep 02 12:58:50 node-2 mariadbd[3973]: 2024-09-02 12:58:50 0 [ERROR] WSREP: Failed to read uuid:seqno and wsrep_gtid_domain_id from joiner script.
      Sep 02 12:58:50 node-2 mariadbd[3973]: 2024-09-02 12:58:50 0 [ERROR] WSREP: Process completed with error: wsrep_sst_mariabackup --role 'joiner' --address 'node-2' --datadir '/var/lib/mysql/' --parent 3973 --progress 0 [...]
      

      while the donor side does not report anything unusual, showing the state transfer as successful IST, not needding a ST:

      Sep 02 12:58:49 node-1 mariadbd[1878]: 2024-09-02 12:58:49 0 [Note] WSREP: Member 0.0 (node-2) requested state transfer from '*any*'. Selected 1.0 (node-1)(SYNCED) as donor.
      Sep 02 12:58:49 node-1 mariadbd[1878]: 2024-09-02 12:58:49 0 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 6)
      Sep 02 12:58:49 node-1 mariadbd[1878]: 2024-09-02 12:58:49 2 [Note] WSREP: Detected STR version: 1, req_len: 115, req: STRv1
      Sep 02 12:58:49 node-1 mariadbd[1878]: 2024-09-02 12:58:49 2 [Note] WSREP: IST request: b6d18f4d-692a-11ef-abf2-8bd2c43da2fb:4-6|tcp://node-2:4568
      Sep 02 12:58:49 node-1 mariadbd[1878]: 2024-09-02 12:58:49 2 [Note] WSREP: Server status change synced -> donor
      Sep 02 12:58:49 node-1 mariadbd[1878]: 2024-09-02 12:58:49 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      Sep 02 12:58:49 node-1 mariadbd[1878]: 2024-09-02 12:58:49 0 [Note] WSREP: Running: 'wsrep_sst_mariabackup --role 'donor' --address 'node-2:4444/xtrabackup_sst//1' --local-por>
      Sep 02 12:58:49 node-1 mariadbd[1878]: 2024-09-02 12:58:49 0 [Note] WSREP: Donor monitor thread started to monitor
      Sep 02 12:58:49 node-1 mariadbd[1878]: 2024-09-02 12:58:49 2 [Note] WSREP: sst_donor_thread signaled with 0
      Sep 02 12:58:49 node-1 mariadbd[1878]: 2024-09-02 12:58:49 0 [Note] WSREP: async IST sender starting to serve tcp://node-2:4568 sending 2-6, preload starts from 2
      Sep 02 12:58:49 node-1 mariadbd[1878]: 2024-09-02 12:58:49 0 [Note] WSREP: IST sender 2 -> 6
      Sep 02 12:58:49 node-1 mariadbd[2359]: WSREP_SST: [INFO] mariabackup IST started on donor (20240902 12:58:49.919)
      Sep 02 12:58:49 node-1 mariadbd[2359]: WSREP_SST: [INFO] SSL configuration: CA='', CAPATH='', CERT='', KEY='', MODE='DISABLED', encrypt='0' (20240902 12:58:49.959)
      Sep 02 12:58:50 node-1 mariadbd[2359]: WSREP_SST: [INFO] Logging all stderr of SST/mariadb-backup to syslog (20240902 12:58:50.054)
      Sep 02 12:58:50 node-1 -wsrep-sst-donor[2543]: Streaming with mbstream
      Sep 02 12:58:50 node-1 -wsrep-sst-donor[2546]: Bypassing the SST for IST
      Sep 02 12:58:50 node-1 mariadbd[1878]: 2024-09-02 12:58:50 0 [Note] WSREP: SST sent: b6d18f4d-692a-11ef-abf2-8bd2c43da2fb:4
      Sep 02 12:58:50 node-1 mariadbd[1878]: 2024-09-02 12:58:50 0 [Note] WSREP: Server status change donor -> joined
      Sep 02 12:58:50 node-1 mariadbd[1878]: 2024-09-02 12:58:50 0 [Note] WSREP: 1.0 (node-1): State transfer to 0.0 (node-2) complete.
      Sep 02 12:58:50 node-1 mariadbd[1878]: 2024-09-02 12:58:50 0 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 6)
      Sep 02 12:58:50 node-1 mariadbd[1878]: 2024-09-02 12:58:50 0 [Note] WSREP: Processing event queue:... -nan% (0/0 events) complete.
      Sep 02 12:58:50 node-1 mariadbd[1878]: 2024-09-02 12:58:50 0 [Note] WSREP: Member 1.0 (node-1) synced with group.
      Sep 02 12:58:50 node-1 mariadbd[1878]: 2024-09-02 12:58:50 0 [Note] WSREP: Processing event queue:... 100.0% (1/1 events) complete.
      Sep 02 12:58:50 node-1 mariadbd[1878]: 2024-09-02 12:58:50 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 6)
      Sep 02 12:58:50 node-1 mariadbd[1878]: 2024-09-02 12:58:50 2 [Note] WSREP: Server node-1 synced with group
      Sep 02 12:58:50 node-1 mariadbd[1878]: 2024-09-02 12:58:50 2 [Note] WSREP: Server status change joined -> synced
      Sep 02 12:58:50 node-1 mariadbd[1878]: 2024-09-02 12:58:50 2 [Note] WSREP: Synchronized with group, ready for connections
      Sep 02 12:58:50 node-1 mariadbd[1878]: 2024-09-02 12:58:50 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      Sep 02 12:58:50 node-1 -wsrep-sst-donor[2550]: Total time on donor: 0 seconds
      Sep 02 12:58:50 node-1 -wsrep-sst-donor[2551]: mariabackup IST completed on donor
      Sep 02 12:58:50 node-1 mariadbd[1878]: 2024-09-02 12:58:50 0 [Note] WSREP: Donor monitor thread ended with total time 1 sec
      

      SystemD restarts the server process on node-2 and it runs into the same problem two more times before the restart starts to fail due to a crash instead:

      Sep 02 12:59:19 node-2 mariadbd[5558]: 2024-09-02 12:59:19 2 [Note] WSREP: Prepared IST receiver for 0-12, listening at: tcp://10.30.36.12:4568
      Sep 02 12:59:19 node-2 mariadbd[5558]: 2024-09-02 12:59:19 0 [Note] WSREP: Member 0.0 (node-2) requested state transfer from '*any*'. Selected 1.0 (node-1)(SYNCED) as donor.
      Sep 02 12:59:19 node-2 mariadbd[5558]: 2024-09-02 12:59:19 0 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 12)
      Sep 02 12:59:19 node-2 mariadbd[5558]: 2024-09-02 12:59:19 2 [Note] WSREP: Requesting state transfer: success, donor: 1
      Sep 02 12:59:19 node-2 mariadbd[5558]: 2024-09-02 12:59:19 2 [Note] WSREP: Resetting GCache seqno map due to different histories.
      Sep 02 12:59:19 node-2 mariadbd[5558]: munmap_chunk(): invalid pointer
      Sep 02 12:59:19 node-2 mariadbd[5558]: 240902 12:59:19 [ERROR] mysqld got signal 6 ;
      Sep 02 12:59:19 node-2 mariadbd[5558]: Sorry, we probably made a mistake, and this is a bug.
      Sep 02 12:59:19 node-2 mariadbd[5558]: Your assistance in bug reporting will enable us to fix this for the next release.
      Sep 02 12:59:19 node-2 mariadbd[5558]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
      Sep 02 12:59:19 node-2 mariadbd[5558]: We will try our best to scrape up some info that will hopefully help
      Sep 02 12:59:19 node-2 mariadbd[5558]: diagnose the problem, but since we have already crashed,
      Sep 02 12:59:19 node-2 mariadbd[5558]: something is definitely wrong and this may fail.
      Sep 02 12:59:19 node-2 mariadbd[5558]: Server version: 11.4.3-MariaDB-ubu2004-log source revision: 5ab81ffe0097a22a774957df28c5223cf0201de3
      Sep 02 12:59:19 node-2 mariadbd[5558]: key_buffer_size=0
      Sep 02 12:59:19 node-2 mariadbd[5558]: read_buffer_size=131072
      Sep 02 12:59:19 node-2 mariadbd[5558]: max_used_connections=0
      Sep 02 12:59:19 node-2 mariadbd[5558]: max_threads=153
      Sep 02 12:59:19 node-2 mariadbd[5558]: thread_count=3
      Sep 02 12:59:19 node-2 mariadbd[5558]: It is possible that mysqld could use up to
      Sep 02 12:59:19 node-2 mariadbd[5558]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 336996 K  bytes of memory
      Sep 02 12:59:19 node-2 mariadbd[5558]: Hope that's ok; if not, decrease some variables in the equation.
      Sep 02 12:59:19 node-2 mariadbd[5558]: Thread pointer: 0x7f2954000c58
      Sep 02 12:59:19 node-2 mariadbd[5558]: Attempting backtrace. You can use the following information to find out
      Sep 02 12:59:19 node-2 mariadbd[5558]: where mysqld died. If you see no messages after this, something went
      Sep 02 12:59:19 node-2 mariadbd[5558]: terribly wrong...
      Sep 02 12:59:19 node-2 mariadbd[5558]: stack_bottom = 0x7f297875cd88 thread_stack 0x49000
      Sep 02 12:59:19 node-2 mariadbd[5558]: /usr/sbin/mariadbd(my_print_stacktrace+0x32)[0x5615938a01c2]
      Sep 02 12:59:19 node-2 mariadbd[5558]: /usr/sbin/mariadbd(handle_fatal_signal+0x485)[0x56159335a875]
      Sep 02 12:59:19 node-2 mariadbd[5558]: sigaction.c:0(__restore_rt)[0x7f298056a420]
      Sep 02 12:59:19 node-2 mariadbd[5774]: addr2line: DWARF error: section .debug_info is larger than its filesize! (0x93f189 vs 0x531098)
      Sep 02 12:59:19 node-2 mariadbd[5558]: /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f298007000b]
      Sep 02 12:59:19 node-2 mariadbd[5558]: /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f298004f859]
      Sep 02 12:59:19 node-2 mariadbd[5558]: /lib/x86_64-linux-gnu/libc.so.6(+0x8d26e)[0x7f29800ba26e]
      Sep 02 12:59:19 node-2 mariadbd[5558]: /lib/x86_64-linux-gnu/libc.so.6(+0x952fc)[0x7f29800c22fc]
      Sep 02 12:59:19 node-2 mariadbd[5558]: /lib/x86_64-linux-gnu/libc.so.6(+0x9554c)[0x7f29800c254c]
      Sep 02 12:59:19 node-2 mariadbd[5558]: src/gu_deqmap.hpp:117(gu::DeqMap<long, void const*, std::allocator<void const*> >::index_back() const)[0x7f297a22083e]
      Sep 02 12:59:19 node-2 mariadbd[5558]: src/GCache_seqno.cpp:32(gcache::GCache::seqno_reset(gu::GTID const&))[0x7f297a2055da]
      Sep 02 12:59:19 node-2 mariadbd[5558]: src/replicator_str.cpp:917(galera::ReplicatorSMM::request_state_transfer(void*, int, wsrep_uuid const&, long, void const*, long))[0x7f297a0e6a10]
      Sep 02 12:59:20 node-2 mariadbd[5558]: src/replicator_smm.cpp:2722(galera::ReplicatorSMM::process_st_required(void*, int, wsrep_view_info const*))[0x7f297a0cfcc7]
      Sep 02 12:59:20 node-2 mariadbd[5558]: src/replicator_smm.cpp:3012(galera::ReplicatorSMM::process_prim_conf_change(void*, gcs_act_cchange const&, int, void*))[0x7f297a0d06ff]
      Sep 02 12:59:20 node-2 mariadbd[5558]: src/galera_gcs.hpp:186(galera::Gcs::resume_recv())[0x7f297a0d0d4d]
      Sep 02 12:59:20 node-2 mariadbd[5558]: bits/stl_vector.h:677(std::vector<gcs_act_cchange::member, std::allocator<gcs_act_cchange::member> >::~vector())[0x7f297a10236b]
      Sep 02 12:59:20 node-2 mariadbd[5558]: src/gcs_action_source.cpp:29(Release::~Release())[0x7f297a10285f]
      Sep 02 12:59:20 node-2 mariadbd[5558]: src/replicator_smm.cpp:404(galera::ReplicatorSMM::async_recv(void*))[0x7f297a0cef10]
      Sep 02 12:59:20 node-2 mariadbd[5558]: src/wsrep_provider.cpp:292(galera_recv)[0x7f297a0a2581]
      Sep 02 12:59:20 node-2 mariadbd[5558]: /usr/sbin/mariadbd(_ZN5wsrep18wsrep_provider_v2611run_applierEPNS_21high_priority_serviceE+0x12)[0x56159396cbf2]
      Sep 02 12:59:20 node-2 mariadbd[5558]: /usr/sbin/mariadbd(+0xdcc9a1)[0x56159362e9a1]
      Sep 02 12:59:20 node-2 mariadbd[5558]: /usr/sbin/mariadbd(_Z15start_wsrep_THDPv+0x26b)[0x56159361c3bb]
      Sep 02 12:59:20 node-2 mariadbd[5558]: /usr/sbin/mariadbd(+0xd413a6)[0x5615935a33a6]
      Sep 02 12:59:20 node-2 mariadbd[5558]: nptl/pthread_create.c:478(start_thread)[0x7f298055e609]
      Sep 02 12:59:20 node-2 mariadbd[5778]: addr2line: DWARF error: section .debug_info is larger than its filesize! (0x93f189 vs 0x531098)
      Sep 02 12:59:20 node-2 mariadbd[5558]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f298014c353]
      Sep 02 12:59:20 node-2 mariadbd[5558]: Trying to get some variables.
      Sep 02 12:59:20 node-2 mariadbd[5558]: Some pointers may be invalid and cause the dump to abort.
      Sep 02 12:59:20 node-2 mariadbd[5558]: Query (0x0): (null)
      Sep 02 12:59:20 node-2 mariadbd[5558]: Connection ID (thread ID): 2
      Sep 02 12:59:20 node-2 mariadbd[5558]: Status: NOT_KILLED
      Sep 02 12:59:20 node-2 mariadbd[5558]: Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on>
      Sep 02 12:59:20 node-2 mariadbd[5558]: The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mariadbd/ contains
      Sep 02 12:59:20 node-2 mariadbd[5558]: information that should help you find out what is causing the crash.
      Sep 02 12:59:20 node-2 mariadbd[5558]: We think the query pointer is invalid, but we will try to print it anyway.
      Sep 02 12:59:20 node-2 mariadbd[5558]: Query:
      Sep 02 12:59:20 node-2 mariadbd[5558]: Writing a core file...
      

      Attachments

        Issue Links

          Activity

            Very same automated upgrade test setup works fine with starting version being 10.6.18 and target being 10.11.9.
            It fails as described with starting version being 10.11.9 and target being 11.4.3 though.

            hholzgra Hartmut Holzgraefe added a comment - Very same automated upgrade test setup works fine with starting version being 10.6.18 and target being 10.11.9. It fails as described with starting version being 10.11.9 and target being 11.4.3 though.

            Note the

            Sep 02 12:58:50 node-1 -wsrep-sst-donor[2546]: Bypassing the SST for IST
            

            message on the donor side. I'm not getting that when doing the same test for 10.11.8 to 10.11.9 upgrade.

            So there seems to be a connection between that message and

            Sep 02 12:58:49 node-2 -wsrep-sst-joiner[4188]: receiving process ended without creating magic file (/tmp/tmp.IJSXAIA9T4/mariadb_backup_galera_info)
            

            hholzgra Hartmut Holzgraefe added a comment - Note the Sep 02 12:58:50 node-1 -wsrep-sst-donor[2546]: Bypassing the SST for IST message on the donor side. I'm not getting that when doing the same test for 10.11.8 to 10.11.9 upgrade. So there seems to be a connection between that message and Sep 02 12:58:49 node-2 -wsrep-sst-joiner[4188]: receiving process ended without creating magic file (/tmp/tmp.IJSXAIA9T4/mariadb_backup_galera_info)

            Now tried latest 10.11 to latest 11.2 (the other 11.x not having reached EOL yet)

            -> pretty much the same behavior there, first failed restarts due to

            "Failed to read uuid:seqno and wsrep_gtid_domain_id from joiner script"

            then eventually crashing on

            "munmap_chunk(): invalid pointer"

            in

            src/GCache_seqno.cpp:32(gcache::GCache::seqno_reset(gu::GTID const&))

            during later restart attempts initiated by systemd service

            hholzgra Hartmut Holzgraefe added a comment - Now tried latest 10.11 to latest 11.2 (the other 11.x not having reached EOL yet) -> pretty much the same behavior there, first failed restarts due to "Failed to read uuid:seqno and wsrep_gtid_domain_id from joiner script" then eventually crashing on "munmap_chunk(): invalid pointer" in src/GCache_seqno.cpp:32(gcache::GCache::seqno_reset(gu::GTID const&)) during later restart attempts initiated by systemd service

            When starting from 10.11.7 (the last prior to upgrade to Galera 26.4.18 and the "no downgrades from now on" problem) I can do a rolling upgrade to latest 11.0, but not to 11.1 or later.

            Also trying an upgrade from 11.0 to 11.4 fails in the same "Bypassing the SST for IST" way, too.

            hholzgra Hartmut Holzgraefe added a comment - When starting from 10.11.7 (the last prior to upgrade to Galera 26.4.18 and the "no downgrades from now on" problem) I can do a rolling upgrade to latest 11.0, but not to 11.1 or later. Also trying an upgrade from 11.0 to 11.4 fails in the same "Bypassing the SST for IST" way, too.

            The actual problem lies between 11.0 and 11.1, that is the smallest step I can reproduce that first gets

            "Bypassing the SST for IST"

            on the donor and

            WSREP_SST: [ERROR] receiving process ended without creating magic file (/tmp/tmp.AfuxV3LOSo/mariadb_backup_galera_info)

            on the joiner side.

            hholzgra Hartmut Holzgraefe added a comment - The actual problem lies between 11.0 and 11.1, that is the smallest step I can reproduce that first gets "Bypassing the SST for IST" on the donor and WSREP_SST: [ERROR] receiving process ended without creating magic file (/tmp/tmp.AfuxV3LOSo/mariadb_backup_galera_info) on the joiner side.
            janlindstrom Jan Lindström added a comment - - edited

            This is caused by MDEV-18931. Problem can be seen from logs with following message:

            WSREP_SST: [ERROR] receiving process ended without creating magic file (/tmp/tmp.FtMJSFbceP/mariadb_backup_galera_info) (20240903 15:08:04.026)
            WSREP_SST: [INFO] Contents of datadir: (20240903 15:08:04.028)
            WSREP_SST: [INFO] -rw-rw---- 1 jan jan 41 Sep 3 15:08 /tmp/tmp.FtMJSFbceP/donor_galera_info -rw-rw---- 1 jan jan 56 Sep 3 15:08 /tmp/tmp.FtMJSFbceP/xtrabackup_galera_info (20240903 15:08:04.031)
            WSREP_SST: [ERROR] Cleanup after exit with status: 32 (20240903 15:08:04.033)
            

            Name of this magic file has changed from xtrabackup_galera_info to mariabackup_galera_info

            janlindstrom Jan Lindström added a comment - - edited This is caused by MDEV-18931 . Problem can be seen from logs with following message: WSREP_SST: [ERROR] receiving process ended without creating magic file (/tmp/tmp.FtMJSFbceP/mariadb_backup_galera_info) (20240903 15:08:04.026) WSREP_SST: [INFO] Contents of datadir: (20240903 15:08:04.028) WSREP_SST: [INFO] -rw-rw---- 1 jan jan 41 Sep 3 15:08 /tmp/tmp.FtMJSFbceP/donor_galera_info -rw-rw---- 1 jan jan 56 Sep 3 15:08 /tmp/tmp.FtMJSFbceP/xtrabackup_galera_info (20240903 15:08:04.031) WSREP_SST: [ERROR] Cleanup after exit with status: 32 (20240903 15:08:04.033) Name of this magic file has changed from xtrabackup_galera_info to mariabackup_galera_info

            This is fixed by https://jira.mariadb.org/browse/MDEV-30822 (+ latest SST scripts fixes)

            sysprg Julius Goryavsky added a comment - This is fixed by https://jira.mariadb.org/browse/MDEV-30822 (+ latest SST scripts fixes)

            People

              janlindstrom Jan Lindström
              hholzgra Hartmut Holzgraefe
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.