Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-15770

We have three node galera cluster with mariadb, bootstrap primary node is running but other two nodes not able to recover the data from first node and crash after some data recover like after recovery 15GB data

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.1.31
    • 10.1(EOL)
    • None
    • None
    • CentOS 7.4

    Description

      We have three node galera cluster with mariadb, bootstrap primary node is running but other two nodes not able to recover the data from first node and crash after some data recover like after recovery 15GB data.
      ------------------------------------------

      Service Status:
      -------------------
      [root@AZABIR-ID01 data]# systemctl status mariadb.service
      ● mariadb.service - MariaDB 10.1.31 database server
      Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
      Drop-In: /etc/systemd/system/mariadb.service.d
      └─migrated-from-my.cnf-settings.conf
      Active: failed (Result: exit-code) since Wed 2018-04-04 06:09:06 CEST; 56s ago
      Docs: man:mysqld(8)
      https://mariadb.com/kb/en/library/systemd/
      Process: 25879 ExecStartPre=/bin/sh -c [ ! -e /usr/bin/galera_recovery ] && VAR= || VAR=`/usr/bin/galera_recovery`; [ $? -eq 0 ] && systemctl set-environment _WSREP_START_POSITION=$VAR || exit 1 (code=exited, status=1/FAILURE)
      Process: 25875 ExecStartPre=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)

      Apr 04 06:09:06 AZABIR-ID01.azure.cloud.corp.local sh[25879]: 2018-04-04 6:09:05 139810169485568 [Note] Loaded 'file_key_management.so' with offset 0x7f28081fb000
      Apr 04 06:09:06 AZABIR-ID01.azure.cloud.corp.local sh[25879]: 2018-04-04 6:09:05 139810169485568 [Note] InnoDB: Started in read only mode
      Apr 04 06:09:06 AZABIR-ID01.azure.cloud.corp.local sh[25879]: 2018-04-04 6:09:05 139810169485568 [Note] InnoDB: Using mutexes to ref count buffer pool pages
      Apr 04 06:09:06 AZABIR-ID01.azure.cloud.corp.local sh[25879]: 2018-04-04 6:09:05 139810169485568 [Note] InnoDB: The InnoDB memory heap is disabled
      Apr 04 06:09:06 AZABIR-ID01.azure.cloud.corp.local sh[25879]: 2018-04-04 6:09:05 139810169485568 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
      Apr 04 06:09:06 AZABIR-ID01.azure.cloud.corp.local sh[25879]: 2018-04-04 6:09:05 139810169485568 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for m...y barrier
      Apr 04 06:09:06 AZABIR-ID01.azure.cloud.corp.local systemd[1]: mariadb.service: control process exited, code=exited status=1
      Apr 04 06:09:06 AZABIR-ID01.azure.cloud.corp.local systemd[1]: Failed to start MariaDB 10.1.31 database server.
      Apr 04 06:09:06 AZABIR-ID01.azure.cloud.corp.local systemd[1]: Unit mariadb.service entered failed state.
      Apr 04 06:09:06 AZABIR-ID01.azure.cloud.corp.local systemd[1]: mariadb.service failed.
      Hint: Some lines were ellipsized, use -l to show in full.
      [root@AZABIR-ID01 data]# /etc/init.d/mysql start
      Starting mysql (via systemctl): Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.
      [FAILED]
      =======================================

      Log errors:

      ------------------------------------------------------------------------

      2018-04-04 6:01:38 140400691869952 [Note] InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
      2018-04-04 6:01:38 140400691869952 [Note] InnoDB: Starting final batch to recover 272 pages from redo log
      2018-04-04 6:01:38 140400691869952 [ERROR] InnoDB: Trying to access page number 254 in space 30854 space name DB_GRPUK_P/cache_form, which is outside the tablespace bounds. Byte offset 0, len 16384 i/o type 10.
      2018-04-04 06:01:38 7fb1955d7900 InnoDB: Assertion failure in thread 140400691869952 in file ha_innodb.cc line 22015
      InnoDB: We intentionally generate a memory trap.
      InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
      InnoDB: If you get repeated assertion failures or crashes, even
      InnoDB: immediately after the mysqld startup, there may be
      InnoDB: corruption in the InnoDB tablespace. Please refer to
      InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
      InnoDB: about forcing recovery.
      180404 6:01:38 [ERROR] mysqld got signal 6 ;
      This could be because you hit a bug. It is also possible that this binary
      or one of the libraries it was linked against is corrupt, improperly built,
      or misconfigured. This error can also be caused by malfunctioning hardware.

      To report this bug, see https://mariadb.com/kb/en/reporting-bugs

      We will try our best to scrape up some info that will hopefully help
      diagnose the problem, but since we have already crashed,
      something is definitely wrong and this may fail.

      Server version: 10.1.31-MariaDB
      key_buffer_size=134217728
      read_buffer_size=131072
      max_used_connections=0
      max_threads=2002
      thread_count=0
      It is possible that mysqld could use up to
      key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 4529069 K bytes of memory
      Hope that's ok; if not, decrease some variables in the equation.

      Thread pointer: 0x0
      Attempting backtrace. You can use the following information to find out
      where mysqld died. If you see no messages after this, something went
      terribly wrong...

      stack_bottom = 0x0 thread_stack 0x48400
      mysys/stacktrace.c:268(my_print_stacktrace)[0x55df5319a1ce]
      sql/signal_handler.cc:168(handle_fatal_signal)[0x55df52cbdfb5]
      sigaction.c:0(__restore_rt)[0x7fb1951ed5e0]
      :0(__GI_raise)[0x7fb1934c61f7]
      :0(__GI_abort)[0x7fb1934c78e8]
      handler/ha_innodb.cc:22015(ib_logf(ib_log_level_t, char const*, ...))[0x55df52f0b13f]
      fil/fil0fil.cc:5982(fil_io(unsigned long, bool, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, void*, void*, unsigned long*, trx_t*, bool))[0x55df530bb323]
      buf/buf0rea.cc:263(buf_read_page_low(dberr_t*, bool, unsigned long, unsigned long, unsigned long, unsigned long, long, unsigned long, trx_t*, bool))[0x55df5307b0a0]
      buf/buf0rea.cc:1119(buf_read_recv_pages(unsigned long, unsigned long, unsigned long, unsigned long const*, unsigned long))[0x55df5307eb66]
      log/log0recv.cc:1857(recv_apply_hashed_log_recs(bool))[0x55df52f5d44f]
      srv/srv0start.cc:2663(innobase_start_or_create_for_mysql())[0x55df52ff076e]
      handler/ha_innodb.cc:4479(innobase_init(void*))[0x55df52f0f67d]
      sql/handler.cc:521(ha_initialize_handlerton(st_plugin_int*))[0x55df52cc0264]
      sql/sql_plugin.cc:1409(plugin_initialize(st_mem_root*, st_plugin_int*, int*, char**, bool))[0x55df52b47e70]
      sql/sql_plugin.cc:1686(plugin_init(int*, char**, int))[0x55df52b4975a]
      sql/mysqld.cc:5133(init_server_components())[0x55df52a9eb88]
      sql/mysqld.cc:5722(mysqld_main(int, char**))[0x55df52aa2630]
      /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fb1934b2c05]
      //sbin/mysqld(+0x39910d)[0x55df52a9610d]
      The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
      information that should help you find out what is causing the crash.

      Attachments

        1. azabnl-id05_errors.txt
          98 kB
          Shrikant Anjankar
        2. errors_azabnl-id01_chk1.txt
          76 kB
          Shrikant Anjankar
        3. mysql.error.log
          16 kB
          Shrikant Anjankar
        4. server.cnf
          2 kB
          Shrikant Anjankar

        Issue Links

          Activity

            we have test 10.2.14 mariadb version on other galera cluster with version 25.3.22 but still stuck rsync after 9GB. Please give the solution

            shree0linux Shrikant Anjankar added a comment - we have test 10.2.14 mariadb version on other galera cluster with version 25.3.22 but still stuck rsync after 9GB. Please give the solution
            seppo Seppo Jaakola added a comment -

            The error log shows that rsync has failed earlier, with that it is easy to understand that innodb startup will fail also. Therefore I doubt this would relate to MDEV-11035.

            For troubleshooting rsync transfer issue, please attach the error log from donor node as well.

            seppo Seppo Jaakola added a comment - The error log shows that rsync has failed earlier, with that it is easy to understand that innodb startup will fail also. Therefore I doubt this would relate to MDEV-11035 . For troubleshooting rsync transfer issue, please attach the error log from donor node as well.

            azabnl-id05 is joiner/reciever node
            azabnl-id01 is a doner node

            Please check the logs and let us know if anything need to change or apply new patch.

            Just I want to share one success story:

            When we stuck up with resync method we used below custom command to sync data between two node and it got succeeded, but this was not worked with other case.

            #mysqld --basedir=/usr --datadir=/mnt/data --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --wsrep_provider=/usr/lib64/galera/libgalera_smm.so --log-error=/mnt/data/AZABNL-ID04.XXXXXX.err --pid-file=/mnt/data/AZABNL-ID04.XXXXX.pid --wsrep_start_position=53540047-107d-11e6-8b2a-9a31eea4d5df:0

            shree0linux Shrikant Anjankar added a comment - azabnl-id05 is joiner/reciever node azabnl-id01 is a doner node Please check the logs and let us know if anything need to change or apply new patch. Just I want to share one success story: When we stuck up with resync method we used below custom command to sync data between two node and it got succeeded, but this was not worked with other case. #mysqld --basedir=/usr --datadir=/mnt/data --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --wsrep_provider=/usr/lib64/galera/libgalera_smm.so --log-error=/mnt/data/AZABNL-ID04.XXXXXX.err --pid-file=/mnt/data/AZABNL-ID04.XXXXX.pid --wsrep_start_position=53540047-107d-11e6-8b2a-9a31eea4d5df:0

            anyone have any update regarding three node galara syncing issue?

            shree0linux Shrikant Anjankar added a comment - anyone have any update regarding three node galara syncing issue?
            lax Jacques Amar added a comment - - edited

            @shrikant, if you can afford the downtime, here's how i resorted to fix it:

            Basically stop mariadb on a running server.
            Then do an manual rsync to a server that is out of sync and crashes because some files are too large. Make sure no older logs survive, as they will create problems. Insure proper permissions accross.
            Restart the already functioning server should just start)
            Now restart the newly synced machine. It should just start syncing to the current state.
            If it works, do it again, using the newly synced machine this time!

            I had to do this. On a 15GB, downtime was in the minutes, If you only have one server running, you might have to use 'galera_new_cluster' to start the first server.

            My Case:

            rsync --progress --delete-after -Cave ssh <healthyserver>:/var/lib/mysql /var/lib
            

            Hope this helps.

            lax Jacques Amar added a comment - - edited @shrikant, if you can afford the downtime, here's how i resorted to fix it: Basically stop mariadb on a running server. Then do an manual rsync to a server that is out of sync and crashes because some files are too large. Make sure no older logs survive, as they will create problems. Insure proper permissions accross. Restart the already functioning server should just start) Now restart the newly synced machine. It should just start syncing to the current state. If it works, do it again, using the newly synced machine this time! I had to do this. On a 15GB, downtime was in the minutes, If you only have one server running, you might have to use 'galera_new_cluster' to start the first server. My Case: rsync --progress --delete-after -Cave ssh <healthyserver>: /var/lib/mysql /var/lib Hope this helps.

            People

              seppo Seppo Jaakola
              shree0linux Shrikant Anjankar
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.