[MDEV-4033] Unable to use slave's temporary directory /tmp - Can't create/write to file '/tmp/SQL_LOAD-' (Errcode: 17 "File exists") Created: 2013-01-14  Updated: 2013-04-14  Resolved: 2013-04-14

Status: Closed
Project: MariaDB Server
Component/s: None
Affects Version/s: 10.0.0
Fix Version/s: 10.0.2

Type: Bug Priority: Major
Reporter: Gordan Bobic Assignee: Michael Widenius
Resolution: Fixed Votes: 0
Labels: None
Environment:

Linux (RHEL6)



 Description   

The original unmodified description of the problem can be found at the bottom of the field

On SQL thread start, a slave checks permissions for slave_load_tmpdir by creating (and immediately deleting) a dummy file there:

sql/slave.cc:

3387
3388  /*
3389    Check permissions to create a file.
3390   */
3391  if ((fd= mysql_file_create(key_file_misc,
3392                             tmp_file, CREATE_MODE,
3393                             O_WRONLY | O_BINARY | O_EXCL | O_NOFOLLOW,
3394                             MYF(MY_WME))) < 0)
3395  DBUG_RETURN(1);
3396
3397  /*
3398    Clean up.
3399   */
3400  mysql_file_close(fd, MYF(0));
3401  mysql_file_delete(key_file_misc, tmp_file, MYF(0));
3402
3403  DBUG_RETURN(0);
3404}

The resulting file is <slave_load_tmpdir>/SQL_LOAD-. If the file exists, slave fails to start. The draft test case below shows that, it fails for any version.
Usually it's not a problem, since the slave deletes the file immediately. However, with multiple slaves, if they on some reason restart simultaneously, I suppose it's possible to hit a race condition when one slave has created the file but not deleted it yet, and another slave attempts to create it too and fails.

Test case below shows how slave fails to start when the file exists
(it uses $MYSQLTEST_VARDIR/tmp because that's what MTR sets for slave_load_tmpdir):

--source include/master-slave.inc
 
--save_master_pos
--connection slave
stop slave;
 
--echo # Writing $MYSQLTEST_VARDIR/tmp/SQL_LOAD-
--write_file $MYSQLTEST_VARDIR/tmp/SQL_LOAD-
test
EOF
 
start slave;
--sync_with_master 0
show slave status;

=================================
Original description

Multi-source replication seems to randomly stop with
Last_SQL_Error: Unable to use slave's temporary directory /tmp - Can't create/write to file '/tmp/SQL_LOAD-' (Errcode: 17 "File exists")

Only see it occassionally on a huge replication volume (100s of GB/day).

Is the master name perhaps missing off the end of that file name?
Is there any data that would be useful to capture when this happens to help debugging?



 Comments   
Comment by Gordan Bobic [ 2013-01-14 ]

Looking at some of the other, older error reports on the internet, it seems the file name is missing a suffix of some sort, and it looks like the multiple replication slave threads are trying to write to the same suffixless file.

We are using row-based (rather than statement based or mixed) replication.

Comment by Elena Stepanova [ 2013-01-16 ]

Hi Gordan,

I am able to reproduce the failure, but the curious thing is that, as far as I can see, it should only be happening if your multiple slave threads stop and start during your workload (particularly, SQL slave threads). Can it be so? Is there any indication of SQL threads restart in the error log? I think if there is, and if it's not intentional, it also deserves some attention.

Comment by Gordan Bobic [ 2013-01-16 ]

Not 100% sure that something didn't happen here that explicitly stopped the threads. I will keep an eye on it if it happens again.

Comment by Elena Stepanova [ 2013-01-16 ]

Okay, then for now I assume that they indeed restarted, and update the description accordingly.
If you discover it to be not true, and hence we need to find another reason for the problem, please add a comment.

Comment by Kristian Nielsen [ 2013-04-14 ]

I was able to easily repeat this problem in eg. test multi_source.info_logs by
adding a sleep() in slave.cc:

=== modified file 'sql/slave.cc'
— sql/slave.cc 2013-04-04 15:38:10 +0000
+++ sql/slave.cc 2013-04-14 05:58:17 +0000
@@ -3723,6 +3723,7 @@ int check_temp_dir(char* tmp_file)
Clean up.
*/
mysql_file_close(fd, MYF(0));
+my_sleep(1000000);
mysql_file_delete(key_file_misc, tmp_file, MYF(0));

DBUG_RETURN(0);

Comment by Michael Widenius [ 2013-04-14 ]

I have fixed this by caching the result of check_temp_dir()
I also changed the test to use O_TRUNC instead of O_EXCL so that things will not fail if the file would
happen to be left around from and old run.

Comment by Michael Widenius [ 2013-04-14 ]

Pushed into 10.0-base

Generated at Thu Feb 08 06:53:13 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.