[MDEV-8315] recurrently we got LOST_EVENTS on slave Created: 2015-06-13  Updated: 2015-08-24  Resolved: 2015-08-24

Status: Closed
Project: MariaDB Server
Component/s: Documentation, Replication
Affects Version/s: 10.0
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: bulepage Assignee: Ian Gilfillan
Resolution: Fixed Votes: 0
Labels: replication
Environment:

Ubuntu 12.04.5 LTS
Master slave relication, without GTID, parallell replication and row_based replication


Attachments: Text File dump.txt     File my.cnf    

 Description   

After we upgraded to 10.0.19-MariaDB-1~precise-log our replication stopped weekly.

We got on slave

The incident LOST_EVENTS occured on the master. Message: error writing to the binary log

We use paralell replication without GTID

Relay_Log_File: mysqld-relay-bin.002349
                Relay_Log_Pos: 588779

#150612 22:47:09 server id 1  end_log_pos 804903        Query   thread_id=292902        exec_time=0     error_code=0
SET TIMESTAMP=1434142029/*!*/;
COMMIT
/*!*/;
# at 805188
#150612 22:19:29 server id 1  end_log_pos 804956
# Incident: LOST_EVENTS
RELOAD DATABASE; # Shall generate syntax error
# at 805241



 Comments   
Comment by bulepage [ 2015-06-13 ]

Atteched our mysql-relay log

Comment by Elena Stepanova [ 2015-06-13 ]

Hi,

From which version did you upgrade to 10.0.19?
Did you upgrade both master and slave?
Did you make any configuration changes on master along with or after the upgrade?
Are you getting any (new) errors while executing your workflow on the master?

Please attach your cnf file(s) from the master.

If possible, it would also be useful to enable and capture a general log on the master until the next failure like that occurs; it would help us understand which query causes the problem with the binlog. However, please be aware that in a busy environment general log can cause some performance degradation and require quite a lot of disk space.

Comment by bulepage [ 2015-06-15 ]

Hi,
Yes the master and the slave same version (10.0.19 and the OS is same)
Our configs dosen't changed in this year

Comment by bulepage [ 2015-06-15 ]

The event takes at every Friday night, I try enable gener_log for this time, but this time we have high traffic.

Comment by bulepage [ 2015-06-20 ]

The LOST_EVENTS again occurred in same time, but for this time I enabled general_log ("SET GLOBAL log_output = 'file';set global general_log_file='queries.log';set GLOBAL general_log=1; )
On the master was lot of parallel jobs (master load avarage was above 4, and the slave lag was very high (2 hour) ).
On the slave I got :

             Master_Log_File: bin-log.008290
          Read_Master_Log_Pos: 7704772
               Relay_Log_File: mysqld-relay-bin.002854
                Relay_Log_Pos: 87255152
        Relay_Master_Log_File: bin-log.008287
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 1590
                   Last_Error: The incident LOST_EVENTS occured on the master. Message: error writing to the binary log
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 3652410010
              Relay_Log_Space: 6120556589

In mysqld-relay-bin.002854 not was LOST_EVENT, and in bin-log.008287 neither.
Master log files are very big 3.5 GB-5.2GB (max_binlog_size = 2G)

-rw-rw---- 1 root  root  5498566258 Jun 19 22:13 bin-log.008284
-rw-rw---- 1 root  root  3695867885 Jun 19 22:17 bin-log.008285
-rw-rw---- 1 root  root  3662761083 Jun 19 22:32 bin-log.008286
-rw-rw---- 1 root  root  3652410051 Jun 19 *22:36* bin-log.008287
-rw-rw---- 1 root  root   897893240 Jun 20 *00:01* bin-log.008288

The time maybe important!
In the bin-log.008288 I found lost_event.

#150619 22:46:35 server id 1  end_log_pos 263954        Query   thread_id=363384        exec_time=0     error_code=0
SET TIMESTAMP=1434746795/*!*/;
COMMIT
/*!*/;
# at 263954
#150619 22:17:18 server id 1  end_log_pos 264007
# Incident: LOST_EVENTS
RELOAD DATABASE; # Shall generate syntax error
# at 264007
#150619 22:46:36 server id 1  end_log_pos 264045        GTID 0-1-986036473
 

Examined the queries.log, but I not found any intersting query.
queries.log is big (65M with gzip)

Comment by Elena Stepanova [ 2015-06-21 ]

Can you upload queries.log to our ftp.askmonty.org/private?
If you can also compress and upload bin-log.008287 and bin-log.008288, it would be even better.
Thanks.

Comment by bulepage [ 2015-06-21 ]

I uploaded MDEV-8315_queris.log.gz MDEV-8315_bin-log.008288.gz MDEV-8315_bin-log.008287.gz

Comment by Elena Stepanova [ 2015-06-23 ]

Thanks for the data.

My theory is that the problem was caused by a huge event which exceeds your 8G max_binlog_stmt_cache_size (you have a smaller max_binlog_cache_size, but I don't know if exceeding it can ever cause LOST_EVENTS error; exceeding max_binlog_stmt_cache_size certainly can).

My main suspect is connection 363340. I don't know if I can paste the query, but you can find it yourself in the query log. The connection did only 3 things:

  • connected at 21:50:09 (log time) and immediately called a procedure;
  • quit at 22:46:36, right when the LOST_EVENTS errors occurred.

So, maybe the procedure that it called in some circumstances produces a really, really big event. Naturally, it would be slow, so it could take time (~25 min in this case) to prepare the DML, then it would take time (~30 min) to execute, and eventually it fails to write the event.

Could you maybe inspect the procedure to see if it's possible? (Maybe it does not produce any binlog-related events at all, which will immediately show that the theory is wrong).

Comment by bulepage [ 2015-06-23 ]

In this time start a stored procedure p_data_from_cegek_teszt it is copyed lot of big tables from one to other database.
The table copy is very interesting.

drop table IF EXISTS dst_db dst_table
Create table dst_db.dst_table like src_db.src_table
ALTER TABLE dst_db.dst_table ENGINE = MyISAM ROW_FORMAT = DYNAMIC
insert into dst_db.dst_table (Select * from src_db.src_table);

Examined our process logs and p_data_from_cegek_teszt started 21:50:09 and ened 22:46:35
Accurately this process first call p_data_from_cegek_teszt and call p_make_tablacsod_napi() , but p_make_tablacsod_napi() moves less data than p_data_from_cegek_teszt.

Comment by Elena Stepanova [ 2015-06-23 ]

So, do you think the explanation is plausible?
If so, there are 2 ways to confirm the theory and get rid of the problem:

  • increase max_binlog_stmt_cache_size value (why did you set it in the config file at all? Did you have some issues which you were trying to solve?)
    or
  • modify the procedure so that it does not do such a huge job in a single statement.

In the latter case, it might happen that you have some other procedures that cause the same problem, then the exercise will need to be repeated for them too.

Comment by bulepage [ 2015-06-23 ]

The p_data_from_cegek_teszt procedure is very old and in is unnecessary, It can disable, but we have lot off another process (stored procedure)
with big transactions (data migrations).

Earlier we got often "
ERROR 1705 (HY000) at line 1: Multi-row statements required more than 'max_binlog_stmt_cache_size' bytes of storage; increase this mysqld variable and try again
" error message (in mysql 5.5 in year 2013) and the solution was we set to max_binlog_stmt_cache_size=8302149632 and binlog_stmt_cache_size=191299584.

I think good solution incressed max_binlog_stmt_cache_size, but i worry about max_binlog_stmt_cache_size
because in the description ( https://mariadb.com/kb/en/mariadb/replication-and-binary-log-server-system-variables/#max_binlog_stmt_cache_size )
"binary log positions cannot go beyond 4GB".

I decided tomorrow incressed max_binlog_stmt_cache_size to 16 777 216.

You agree?

Comment by Elena Stepanova [ 2015-06-23 ]

(I assume you mean you'll set it to 16G, not to 16M as above).
I think it's worth trying.

The variable description in the KB does not look quite accurate. At the very least, the default value is 18446744073709547520 and not 4G as it says; so, I don't know if the rest is true, maybe it was meant for 32-bit systems.
knielsen, can you advise on this?

Comment by bulepage [ 2015-06-24 ]

I changed global max_binlog_stmt_cache_size to 17179869184

Comment by bulepage [ 2015-06-24 ]

Shall i enable general_log on friday night ?

Comment by Elena Stepanova [ 2015-06-24 ]

If you can, please do, just in case the problem occurs again.

Comment by bulepage [ 2015-06-29 ]

The event did not occur.

Comment by bulepage [ 2015-07-25 ]

Can I help ?

Comment by Elena Stepanova [ 2015-08-01 ]

bulepage, did it ever happen again after you had increased the cache size?
If not, I assume the theory was correct, and it was indeed the reason.

Comment by bulepage [ 2015-08-03 ]

The event did not occured again.
I think the tehory is correct.

Comment by Elena Stepanova [ 2015-08-20 ]

In this case it is a configuration problem – naturally, if an event exceeds hard limits, it gets lost.
Also, apparently the max_binlog_stmt_cache_size description in KB needs some adjustments (default value and limitations), I am assigning it to Ian as a documentation issue.

Comment by Ian Gilfillan [ 2015-08-24 ]

The documentation has been updated (it was based on an old MySQL version and a Windows bug limiting the size, not relevant to MariaDB)

Generated at Thu Feb 08 07:26:14 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.