[MDEV-7631] Invalid WSREP_SST rows appear in mysqld-bin.index file Created: 2015-02-25  Updated: 2015-09-04  Resolved: 2015-06-25

Status: Closed
Project: MariaDB Server
Component/s: Galera SST
Affects Version/s: 10.0.16-galera
Fix Version/s: 10.0.20-galera

Type: Bug Priority: Critical
Reporter: Ragnar Rova Assignee: Nirbhay Choubey (Inactive)
Resolution: Fixed Votes: 1
Labels: None

Attachments: File cloexec.patch     Text File db1-twice.log     Text File freopen-stdin-devnull.patch     File mysqld_fd2_usage.scap.gz     File mysqld_fd2_usage.sysdig.txt.gz     File repro-galera-invalid-mysqld-bin-index.tar.gz    
Issue Links:
Relates
relates to MDEV-8743 galera in SST keeps innodb files open Closed
Sprint: 10.1.6-1

 Description   

Invalid rows appear in mysqld-bin.index file, causing server to fail starting. This occurs when WSREP galera replication was enabled. The rows do not point to a valid file but are instead output of wsrep sst.

Example of contents of mysqld-bin.index:

mysqld-bin.000001
./mysqld-bin.000002
./mysqld-bin.000003
WSREP_SST: [INFO] Preparing binlog files for transfer: (20150224 21:17:07.205)
mysqld-bin.000004
./mysqld-bin.000005
./mysqld-bin.000006
./mysqld-bin.000007
WSREP_SST: [INFO] Bypassing state dump. (20150224 21:20:00.648)

on #maria irc channel was mentioned that:

"looks like when mariadb forks for a sst it should close all fds for the sst"



 Comments   
Comment by Ragnar Rova [ 2015-02-25 ]

Sample logfile, contains three consecutive runs of 'service mysql bootstrap'. Server fails to start on third attempt.

Comment by Daniel Black [ 2015-02-26 ]

I add a look and it looks like the process class used to run the sst script - class in sql/wsrep_utils.cc, could under some circumstances assume that STDIN_FD is still open.

Comment by Nirbhay Choubey (Inactive) [ 2015-02-26 ]

That's interesting! How would 'bootstrap' trigger SST? Did you happen to edit the wsrep_sst_rsync script? never mind.

Comment by Ragnar Rova [ 2015-02-26 ]

I run bootstrap like this:

service mysql bootstrap --init-file=/var/lib/mysql/reset_password.sql

reset_password.sql:

update mysql.user set password=PASSWORD('%(password)s') where User='root';
delete from mysql.user where User='';
flush privileges;

Comment by Nirbhay Choubey (Inactive) [ 2015-02-27 ]

rrva We tried reproducing this issue, but haven't been able to repeat it so far. Can you share config files of nodes in the cluster? and other details like OS, etc?

Comment by Ragnar Rova [ 2015-03-03 ]

See README inside attached file repro-galera-invalid-mysqld-bin-index.tar.gz. Docker environment and fabric scripts which reproduce the bug. Docker probably not needed but included for completeness.

Comment by Ragnar Rova [ 2015-03-04 ]

For now, I'm using the following workaround in my start scripts

sed -i '/^\./mysqld-bin./!d' /var/lib/mysql/mysqld-bin.index

Comment by Ragnar Rova [ 2015-03-05 ]

Were you able to reproduce using the scripts/Dockerfiles I provided?

We cannot use an ugly workaround like the above on a production system. We might even remove rows which partially contain a valid mysqld-bin.index filename (when the extra output is interleaved with the correct one).

This is blocking us from using MariaDB + wsrep for now, and we'd really like to. Preferably a version with the wsrep_dirty_reads fix.

Comment by Ragnar Rova [ 2015-04-09 ]

Any updates on this issue?

Comment by Sergei Golubchik [ 2015-04-09 ]

Yes, sorry. We've repeated this issue using your files. Thanks! There is no fix yet, but there is still time till the next 10.0 release.

Comment by Ragnar Rova [ 2015-05-19 ]

Anything I can do to help with this one? Unfortunately it was moved to the next release...

Comment by Ragnar Rova [ 2015-06-16 ]

I've done some debugging using sysdig, and it seems that when the script wsrep_rsync_sst runs, file descriptor 2 points to /var/lib/mysql/mysqld-bin.index.
Attaching a sysdig trace from all processes of the first container during reproduction of the bug. The very last event in the file is the write which corrupts mysqld-bin.index.

Excerpt of sysdig capture (can provide whole file with events from all processes):

4150411 23:56:55.247850446 2 wsrep_sst_rsync (21871) > dup fd=2(<f>/var/lib/mysql/mysqld-bin.index)
4150412 23:56:55.247851085 2 wsrep_sst_rsync (21871) < dup res=1(<f>/var/lib/mysql/mysqld-bin.index)
4150413 23:56:55.247851321 2 wsrep_sst_rsync (21871) > fcntl fd=2(<f>/var/lib/mysql/mysqld-bin.index) cmd=2(F_GETFD)
4150414 23:56:55.247851565 2 wsrep_sst_rsync (21871) < fcntl res=0(<f>/var/lib/mysql/mysqld-bin.index)
4150415 23:56:55.247855405 2 wsrep_sst_rsync (21871) > write fd=1(<f>/var/lib/mysql/mysqld-bin.index) size=79
4150416 23:56:55.247865714 2 wsrep_sst_rsync (21871) < write res=79 data=WSREP_SST: [INFO] Preparing binlog files for transfer: (20150615 21:56:55.246).

Comment by Ragnar Rova [ 2015-06-18 ]

Could it be the following

mysqld closes stderr (fd=2) at some point, so this fd becomes free for new open() calls
/var/lib/mysql/mysqld-bin.index is opened at some point, and gets assigned fd=2
fd=2 is not closed by mysqld in some scenario before spawning wsrep_sst_rsync
wsrep_sst_rsync uses fd=2 explicitly, thinking it's stderr

? (just speculation)

Comment by Sergei Golubchik [ 2015-06-18 ]

I don't think mysqld closes stderr. But it recently started to close stdin. If it — at some point — freopen's stderr, then stderr can get fd=0, thus making fd=2 free.

Wait… You have log-error in your my.cnf. This means, indeed, freopen for stderr! Try to remove it and see if that helps.

Comment by Daniel Black [ 2015-06-18 ]

looks like it.

./scripts/wsrep_sst_common.sh includes writing to FD 2 for for stderr so this implicitly forms part of the calling convention for sst scripts (should be documented http://galeracluster.com/documentation-webpages/scriptablesst.html)

doesn't look like sql/wsrep_utils.cc wsp::process or wsrep_sst.c::sst_joiner_thread do anything special like making the FD 2 mean stderr

Comment by Ragnar Rova [ 2015-06-18 ]

Removing log-error in my.cnf does not help.

If I patch wsrep_sst_rsync with the following, I get rid of the corruption of mysqld-bin.index at least

-- wsrep_sst_rsync.orig        2015-06-18 16:01:41.793419088 +0000
+++ wsrep_sst_rsync     2015-06-18 15:57:22.983406211 +0000
@@ -1,4 +1,5 @@
 #!/bin/bash -ue
+exec 2>/var/lib/mysql/wsrep_sst.err
 
 
 # Copyright (C) 2010-2014 Codership Oy
 #

What do you think of the above as a workaround? Is output to stderr from the sst script for informative reasons or is it processed somehow so its expected to end up elsewhere?

Using the patch above, it looks like this now.

[root@db1 mysql]# cat mysqld-bin.index
./mysqld-bin.000001
./mysqld-bin.000002
./mysqld-bin.000003
./mysqld-bin.000004
./mysqld-bin.000005
[root@db1 mysql]# cat wsrep_sst.err
WSREP_SST: [INFO] Preparing binlog files for transfer: (20150618 15:57:51.450)
mysqld-bin.000005

I think a fix which guarantees that fd=0, 1, 2 never point to anything else than desired before invoking external commands is needed.

Comment by Sergei Golubchik [ 2015-06-18 ]

I think a fix which guarantees that fd=0, 1, 2 never point to anything else than desired before invoking external commands is needed.

Yes, we cannot expect scripts not to use these descriptors. I'll do something along these lines.

Comment by Ragnar Rova [ 2015-06-18 ]

Attaching sysdig capture file mysqld_fd2_usage.scap.gz and mysqld_fd2_usage.sysdig.txt.gz from the run of the already attached Dockerfile testcase.

read it with sysdig -r or use the text output.

evt.num=91048 and evt.num=91049 is the corrupting write call. there is one later at evt.num=333506 as well.

backtracking from those events, maybe the management of fd=2 and/or mysqld-bin.index gives some clues.

mysqld-bin.index after test run:

./mysqld-bin.000001
./mysqld-bin.000002
./mysqld-bin.000003
WSREP_SST: [INFO] PrWSREP_SST: [INFO] Preparing binlog files for transfer: (20150618 17:35:38.287)
mysqld-bin.000005

Comment by Ragnar Rova [ 2015-06-18 ]

Oh, sorry maybe you don't need those sysdig files, your fix sounds good. Thanks!

Comment by Daniel Black [ 2015-06-19 ]

> I think a fix which guarantees that fd=0, 1, 2 never point to anything else than desired before invoking external commands is needed.

more so closing every mysqld FD excluding stdout/stderr that is inherited in the subprocess (FD_CLOEXEC; see fcntl(2)) or something explicit.

I believe the dup2 syscall is what you are after to ensure the log-error fd is assigned the correct number.

Comment by Ragnar Rova [ 2015-06-19 ]

Ok. O_CLOEXEC when using open() on mysqld-bin.index is even better?

man open

 O_CLOEXEC (since Linux 2.6.23)
              Enable the close-on-exec flag for the new file descriptor.
              Specifying this flag permits a program to avoid additional
              fcntl(2) F_SETFD operations to set the FD_CLOEXEC flag.
 
              Note that the use of this flag is essential in some
              multithreaded programs, because using a separate fcntl(2)
              F_SETFD operation to set the FD_CLOEXEC flag does not suffice
              to avoid race conditions where one thread opens a file
              descriptor and attempts to set its close-on-exec flag using
              fcntl(2) at the same time as another thread does a fork(2)
              plus execve(2).  Depending on the order of execution, the race
              may lead to the file descriptor returned by open() being
              unintentionally leaked to the program executed by the child
              process created by fork(2).  (This kind of race is in
              principle possible for any system call that creates a file
              descriptor whose close-on-exec flag should be set, and various
              other Linux system calls provide an equivalent of the
              O_CLOEXEC flag to deal with this problem.)

Comment by Ragnar Rova [ 2015-06-19 ]

I also found this http://danwalsh.livejournal.com/53603.html

Comment by Sergei Golubchik [ 2015-06-19 ]

Okay, it turned out to be a CentOS 6 issue.
This test program:

#include <stdio.h>
int main() {
  fclose(stdin);
  freopen("err.txt", "w", stderr);
  printf("err=%d\n", fileno(stderr));
  return 0;
}

prints 2 for me on centos 7, gentoo, freebsd 9.2, ubuntu trusty and vivid. But on centos 6 it prints 0.
This seems to be a bug, because glibc defines STRERR_ERRNO as 2: http://www.gnu.org/software/libc/manual/html_node/Descriptors-and-Streams.html

There are also symbolic constants defined in unistd.h for the file descriptors belonging to the standard streams stdin, stdout, and stderr; see Standard Streams.

STDIN_FILENO

This macro has value 0, which is the file descriptor for standard input.
STDOUT_FILENO

This macro has value 1, which is the file descriptor for standard output.
STDERR_FILENO

This macro has value 2, which is the file descriptor for standard error output.

Comment by Sergei Golubchik [ 2015-06-19 ]

By the way, normally MariaDB reopens stdout/stderr first and then closes stdin. But you're sending it SIGHUP, and on SIGHUP it closes and reopens log files. So, as a workaround you can stop doing kill -HUP.

Comment by Ragnar Rova [ 2015-06-19 ]

Also a RHEL 6.5 issue. Test program prints err=0 there too.

Comment by Ragnar Rova [ 2015-06-19 ]

Where am I doing kill -HUP?

Comment by Ragnar Rova [ 2015-06-19 ]

The following fix works. Patch needs to be conditional on if O_CLOEXEC is available but otherwise reading the Dan Walsh blog I think it is correct behavior, possibly even the default for mysql_file_open?

 
diff --git a/sql/log.cc b/sql/log.cc
index 12b6cf0..81872da 100644
--- a/sql/log.cc
+++ b/sql/log.cc
@@ -3313,7 +3313,7 @@ bool MYSQL_BIN_LOG::open_index_file(const char *index_file_name_arg,
             ".index", opt);
   if ((index_file_nr= mysql_file_open(m_key_file_log_index,
                                       index_file_name,
-                                      O_RDWR | O_CREAT | O_BINARY,
+                                      O_RDWR | O_CREAT | O_BINARY | O_CLOEXEC,
                                       MYF(MY_WME))) < 0 ||
        mysql_file_sync(index_file_nr, MYF(MY_WME)) ||
        init_io_cache(&index_file, index_file_nr,

Comment by Sergei Golubchik [ 2015-06-19 ]

I don't know. But I see in your sysdig trace that mysqld is doing what it only does for SIGHUP (prints lots of information to stdout and then reloads privilege tables, then reopens stdout/stderr).

I think I'll do O_CLOSEXEC — this seems like a good thing to have. But in that case wsrep_sst logs will be lost, because fd=2 will be closed. So, while O_CLOSEXEC is good in general, the correct fix in your case is to preserve stderr file descriptor. Alas, CentOS 6 (and RHEL 6, not surprisingly) doesn't do it.

Comment by Sergei Golubchik [ 2015-06-19 ]

nirbhay_c, I've attached cloexec.patch, you can apply it to 10.0-galera tree.

Comment by Ragnar Rova [ 2015-06-19 ]

Found some info regarding freopen()

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=295859
https://sourceware.org/bugzilla/show_bug.cgi?id=682

man stderr states:

On program startup, the integer file descriptors  associated  with  the
streams  stdin,  stdout, and stderr are 0, 1, and 2, respectively.  The
preprocessor symbols STDIN_FILENO, STDOUT_FILENO, and STDERR_FILENO are
defined  with  these values in <unistd.h>.  (Applying freopen(3) to one
of these streams can change the file descriptor number associated  with
the stream.)

Comment by Ragnar Rova [ 2015-06-19 ]

Thanks a lot for the cloexec patch. Any side-effects?

In addition to the cloexec.patch, which resolved my issue, I am thinking that a change in

sql/log.cc reopen_fstreams

Is needed, where dup2 is called to force fd=2 for stderr and fd=1 for stdout after freopen. This way subprocesses could rely on the fd numbers being the expected ones. (Or maybe fd=2 is already occupied by something else at this point in time?)

For now, I am adding

exec 2>/var/lib/mysql/wsrep_sst.err

At the top of /usr/bin/wsrep_sst_rsync.

Comment by Sergei Golubchik [ 2015-06-19 ]

About freopen — this bugreport is from 2005. It was originally rejected not because it was wrong to preserve file descriptors, but because it was difficult to do, see this comment.

And while man stdin was changed to say that freopen may change the descriptor, all recent distributions have fixed freopen too. Because otherwise freopen becomes almost useless, note that man freopen says

The primary use of the freopen() function is to change the file associated with a standard text stream (stderr, stdin, or stdout).

And freopen that changes file descriptors cannot be used for its primary use if the process forks children.

Comment by Daniel Black [ 2015-06-20 ]

also for the socket and accept syscalls there is a SOCK_CLOEXEC option with the same meaning.

mysql_socket_socket/mysql_socket_accept sql/mysqld.cc or include/mysql/psi/mysql_socket.h ( inline_mysql_socket_socket/inline_mysql_socket_accept)

Comment by Ragnar Rova [ 2015-06-22 ]

Like @serg mentioned:

I think I'll do O_CLOSEXEC — this seems like a good thing to have. But in that case wsrep_sst logs will be lost, because fd=2 will be closed. So, while O_CLOSEXEC is good in general, the correct fix in your case is to preserve stderr file descriptor. Alas, CentOS 6 (and RHEL 6, not surprisingly) doesn't do it.

I agree. Since RHEL6 is such a common platform, I think that we must have a fix for preserving fd=0,1,2 of freopen() in MariaDB, since they will not patch glibc (they are on glibc-2.12 and that will stay for quite some time). Otherwise log output of wsrep_sst_rsync is lost. The cloexec patch is a good one nevertheless.

Comment by Ragnar Rova [ 2015-06-22 ]

Attaching freopen-stdin-devnull.patch made by @serg which made logging of wsrep_sst_rsync output work again. Tested with the attached docker based testcase (which is CentOS 6.6). Also tested in combination with the cloexec.patch. Both work and and are good. With these two patches, my issue is resolved.

Comment by Ragnar Rova [ 2015-06-22 ]

Needs some ifdefs for portability, fallback to fclose(stdin) when no dev/null or freopen fails

Comment by Daniel Black [ 2015-06-23 ]

I still think you should mark the socket/accept calls with the flag SOCK_CLOEXEC to avoid writing to open connections. https://mariadb.atlassian.net/browse/MDEV-7631?focusedCommentId=72512&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-72512

Comment by Nirbhay Choubey (Inactive) [ 2015-06-25 ]

https://github.com/MariaDB/server/commit/0f4478105f5027ce589a0f08c6019d906d632029
https://github.com/MariaDB/server/commit/9f00950d2ad9be5533e66a5fa87a795a9574585f

Comment by Daniel Black [ 2015-09-03 ]

On a SST (rsync method) donor on a 10.0.21 instances I still see a lot of files open for tar generating binlogs.

COMMAND  PID  USER   FD   TYPE  DEVICE    SIZE/OFF    NODE NAME
tar     3825 mysql  cwd    DIR   253,0       40960 2997037 /var/lib/mysql
tar     3825 mysql  rtd    DIR   253,0        4096       2 /
tar     3825 mysql  txt    REG   253,0      390616  261280 /bin/tar
tar     3825 mysql  mem    REG   253,0       65928  651690 /lib64/libnss_files-2.12.so
tar     3825 mysql  mem    REG   253,0      142640  651563 /lib64/libpthread-2.12.so
tar     3825 mysql  mem    REG   253,0       18712  651585 /lib64/libattr.so.1.1.0
tar     3825 mysql  mem    REG   253,0       19536  651672 /lib64/libdl-2.12.so
tar     3825 mysql  mem    REG   253,0     1921216  651539 /lib64/libc-2.12.so
tar     3825 mysql  mem    REG   253,0       43880  651724 /lib64/librt-2.12.so
tar     3825 mysql  mem    REG   253,0       31280  651601 /lib64/libacl.so.1.1.0
tar     3825 mysql  mem    REG   253,0      122040  651577 /lib64/libselinux.so.1
tar     3825 mysql  mem    REG   253,0      154664  651609 /lib64/ld-2.12.so
tar     3825 mysql    0r   CHR     1,3         0t0    3871 /dev/null
tar     3825 mysql    1w   REG   253,0    70925586 1045768 /var/log/mysql/error.log
tar     3825 mysql    2w   REG   253,0    70925586 1045768 /var/log/mysql/error.log
tar     3825 mysql    3u   CHR     5,0         0t0    5338 /dev/tty
tar     3825 mysql    4u   REG   253,0         110 2998607 /var/lib/mysql/grastate.dat
tar     3825 mysql    5w   REG   253,0 21542461440 2998605 /var/lib/mysql/wsrep_sst_binlog.tar
tar     3825 mysql    6u   REG     0,9           0    3869 [eventpoll]
tar     3825 mysql    7u   REG     0,9           0    3869 [timerfd]
tar     3825 mysql    8u   REG     0,9           0    3869 [eventfd]
tar     3825 mysql    9u   REG     0,9           0    3869 [eventpoll]
tar     3825 mysql   10u   REG     0,9           0    3869 [timerfd]
tar     3825 mysql   11u   REG     0,9           0    3869 [eventfd]
tar     3825 mysql   12r   REG   253,0 35721262759 2998690 /var/lib/mysql/mysql-bin.000061
tar     3825 mysql   14u   REG   253,0   297795584 2997039 /var/lib/mysql/ibdata1
tar     3825 mysql   15u   REG   253,0        5459  651645 /tmp/iblvW3eD (deleted)
tar     3825 mysql   16u   REG   253,0           0  651667 /tmp/ibIUPTrD (deleted)
tar     3825 mysql   17u   REG   253,0           0  651669 /tmp/ib55RJED (deleted)
tar     3825 mysql   18u   REG   253,0           0  651671 /tmp/ibWRMt4E (deleted)
tar     3825 mysql   19u   REG   253,0    67108864 2998611 /var/lib/mysql/ib_logfile0
tar     3825 mysql   20u   REG   253,0    67108864 2998612 /var/lib/mysql/ib_logfile1
tar     3825 mysql   21u   REG   253,0    67108864 2998613 /var/lib/mysql/ib_logfile2
tar     3825 mysql   22u   REG   253,0    67108864 2998614 /var/lib/mysql/ib_logfile3
tar     3825 mysql   23u   REG   253,0    67108864 2998615 /var/lib/mysql/ib_logfile4
tar     3825 mysql   24u   REG   253,0    67108864 2998616 /var/lib/mysql/ib_logfile5
tar     3825 mysql   25u   REG   253,0    67108864 2998617 /var/lib/mysql/ib_logfile6
tar     3825 mysql   26u   REG   253,0    67108864 2998618 /var/lib/mysql/ib_logfile7
tar     3825 mysql   27u   REG   253,0      589824 3258475 /var/lib/mysql/drupal_radio_prod/field_data_field_date_expires.ibd
tar     3825 mysql   28u   REG   253,0           0  651686 /tmp/ibbShLHG (deleted)
tar     3825 mysql   29u   REG   253,0      212992 3258333 /var/lib/mysql/drupal_radio_prod/field_data_field_file_image_title_text.ibd

Comment by Ragnar Rova [ 2015-09-03 ]

Hmm, these files must be opened through different functions than the patched ones in mysys/

{my_open, my_fopen}

?

Comment by Daniel Black [ 2015-09-04 ]

seems so. I'll do a new bug report.

Comment by Ragnar Rova [ 2015-09-04 ]

I did not actually test that there were no open files, just that the corruption of mysqld-bin.index did not occur, which the fix in this ticket did address. /var/lib/mysql/grastate.dat is opened by galera plugin, so they should get a report to add O_CLOEXEC as well I guess. If there are places were O_CLOEXEC cannot be added, one could consider also adding the closing of all currently open fds as well, but O_CLOEXEC was still good since it is free from race conditions. I found some methods here which are OS-specific: http://stackoverflow.com/questions/899038/getting-the-highest-allocated-file-descriptor/918469#918469

Comment by Daniel Black [ 2015-09-04 ]

rrva totallly agree. MDEV-8743 is the new issue for the other files being open. As you notice above the bin.index file isn't in the list of opened ones so the essence of this bug is complete even if other fds are left open and are relying on the good behaviour of sst scripts and their spawn not to corrupt database files.

Generated at Thu Feb 08 07:21:04 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.