Red Hat Enterprise Linux Server release 6.8 (Santiago)
Linux hostname 2.6.32-573.12.1.el6.x86_64 #1 SMP Mon Nov 23 12:55:32 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
RPMs all installed via yum from http://yum.mariadb.org/10.1/rhel6-amd64:
MariaDB-client-10.1.14-1.el6.x86_64
MariaDB-common-10.1.14-1.el6.x86_64
MariaDB-compat-10.1.14-1.el6.x86_64
MariaDB-server-10.1.14-1.el6.x86_64
galera-25.3.15-1.rhel6.el6.x86_64
Running on VMware ESX
3 dedicated DB hosts per cluster each with 4GB RAM.
Red Hat Enterprise Linux Server release 6.8 (Santiago)
Linux hostname 2.6.32-573.12.1.el6.x86_64 #1 SMP Mon Nov 23 12:55:32 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
RPMs all installed via yum from http://yum.mariadb.org/10.1/rhel6-amd64:
MariaDB-client-10.1.14-1.el6.x86_64
MariaDB-common-10.1.14-1.el6.x86_64
MariaDB-compat-10.1.14-1.el6.x86_64
MariaDB-server-10.1.14-1.el6.x86_64
galera-25.3.15-1.rhel6.el6.x86_64
Running on VMware ESX
3 dedicated DB hosts per cluster each with 4GB RAM.
10.2.4-1, 10.2.12, 10.1.31, 10.2.13, 10.1.32
Description
We are running a three-node Galera cluster in both production and preproduction. We recently added data-at-rest encryption to our preproduction cluster. We then found that our standard task to mysqldump production data and load it into preproduction was causing mysqld to crash on the node that was performing the import. The error log showed:
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
160620 15:57:33 mysqld_safe Number of processes running now: 0
160620 15:57:33 mysqld_safe WSREP: not restarting wsrep node automatically
160620 15:57:33 mysqld_safe mysqld from pid file /apps/data/mysqld/hostname.pid ended
I have honed the import file down to a reproducible minimum of creating and populating two tables, attached as crash1b4.sql. With a fresh database, I can cause the same crash as simply as:
ERROR 2013 (HY000) at line 23: Lost connection to MySQL server during query
My observations so far:
1. The statements in the file all succeed in isolation
2. If you reverse the order of the two tables in the file, the import succeeds. This is attached as crash1b4-reordered.sql
3. To force the crash we seem to need the long insert line with many rows; shortening this will allow the import to succeed
4. Changing encrypt-tmp-files=1 to encrypt-tmp-files=0 in my.ini and restarting mysqld will allow the import to succeed
5. Removing the node from the cluster (by removing all wsrep_* from my.ini and restarting mysqld) will allow the import to succeed, even with encrypt-tmp-files=1
My my.cnf is attached, with host and domain names replaced for privacy.
Sachin Setiya (Inactive)
added a comment - Hi I have horned down the cnf file,
Now we can simulate bug with this cnf file(running --start-and-exit and then running crash1b4.sql)
!include ../galera_2nodes.cnf
[mysqld]
encrypt-tmp-files = 1
plugin-load-add= @ENV .FILE_KEY_MANAGEMENT_SO
file-key-management
loose-file-key-management-filename= @ENV .MYSQL_TEST_DIR/std_data/keys.txt
log-bin
There is one more interesting thing I found out , For this particular table `testtable` the number of inserts should be >= 907 to get the crash , If it is less then that it is not crashing , if it is more then that it is crashing every time , one thing is sure, there is no race condition in this case.
Sachin Setiya (Inactive)
added a comment - There is one more interesting thing I found out , For this particular table `testtable` the number of inserts should be >= 907 to get the crash , If it is less then that it is not crashing , if it is more then that it is crashing every time , one thing is sure, there is no race condition in this case.
I tried with serg fix of mdev-14868. But this issue still fails.
Structure of crash1b4.sql
two insert stmt
first long enough that IO_CACHE need to use tmp file
second just small insert
Observation so far.
1. If we remove galera then crash1b4.sql(lets call it crash.sql) passes.
2.Issue is not galera although we get crash inside of galera. Reason is MYSQL_BIN_LOG::write_cache() fails, which signals the galera that there is some issue, but on other node the transaction is successful.
3. The issue which I am unable to understand is we read from the trans_cache two time
1st for wsrep_run_wsrep_commit , it copies the trans_cache to galera buffer, we do not get error in this copy.
2nd this is called from MYSQL_BIN_LOG::write_cache() which write the trans cache to binlog. But in this read we got a error.
Difference between there 2 reads in first read we are not calling my_read (means ) we have data in buffer and we do not need to look into tmp file.
in second read we are calling my_read , we are reading same data why this time we need to look into tmp file ?
Sachin Setiya (Inactive)
added a comment - I tried with serg fix of mdev-14868. But this issue still fails.
Structure of crash1b4.sql
two insert stmt
first long enough that IO_CACHE need to use tmp file
second just small insert
Observation so far.
1. If we remove galera then crash1b4.sql(lets call it crash.sql) passes.
2.Issue is not galera although we get crash inside of galera. Reason is MYSQL_BIN_LOG::write_cache() fails, which signals the galera that there is some issue, but on other node the transaction is successful.
3. The issue which I am unable to understand is we read from the trans_cache two time
1st for wsrep_run_wsrep_commit , it copies the trans_cache to galera buffer, we do not get error in this copy.
2nd this is called from MYSQL_BIN_LOG::write_cache() which write the trans cache to binlog. But in this read we got a error.
Difference between there 2 reads in first read we are not calling my_read (means ) we have data in buffer and we do not need to look into tmp file.
in second read we are calling my_read , we are reading same data why this time we need to look into tmp file ?
FInally got what the issue is , first read of IO_CACHE (my_b_fill from wsrep_write_cache ) equates the info->read_end to info->buffer , which makes next read of IO_CACHE (my_b_fill from MYSQL_BIN_LOG::write_cache )unsuccessful.
Sachin Setiya (Inactive)
added a comment - FInally got what the issue is , first read of IO_CACHE (my_b_fill from wsrep_write_cache ) equates the info->read_end to info->buffer , which makes next read of IO_CACHE (my_b_fill from MYSQL_BIN_LOG::write_cache )unsuccessful.
Hi I have horned down the cnf file,
Now we can simulate bug with this cnf file(running --start-and-exit and then running crash1b4.sql)
!include ../galera_2nodes.cnf
[mysqld]
file-key-management
log-bin