|
ponsuresh.pandians: It could be just a configuration issue.
Notice
waiting for room in worker queue status which must be explained by too small value of
slave_parallel_max_queued | 131072
Please try increasing it and let us know.
|
|
Hi Andrei,
I have already increased upto 128MB but no luck the lag was increased.
|
|
ponsuresh.pandians, howdy. Could you please provide Show-Slave-Status and Show-Processlist with the 128 MB slave_parallel_max_queued?
Adding to that, naturally a few periodical (hourly?) show reports could be very useful
in this (growing?) delay scenario.
|
|
Hi Andrei,
Sure.. Please find the requested outputs. After I stopped the sysbench on master server the lag is increasing upto 867 sec..
MariaDB [(none)]> show global variables like 'slave_parallel_max_queued';
|
+---------------------------+-----------+
|
| Variable_name | Value |
|
+---------------------------+-----------+
|
| slave_parallel_max_queued | 134217728 |
|
+---------------------------+-----------+
|
1 row in set (0.001 sec)
|
|
MariaDB [(none)]> select 134217728/1024/1024;
|
+---------------------+
|
| 134217728/1024/1024 |
|
+---------------------+
|
| 128.00000000 |
|
+---------------------+
|
1 row in set (0.000 sec)
|
|
MariaDB [(none)]> show slave status\G
|
*************************** 1. row ***************************
|
Slave_IO_State: Waiting for master to send event
|
Master_Host: 172.31.24.106
|
Master_User: repl
|
Master_Port: 3306
|
Connect_Retry: 60
|
Master_Log_File: mysql-bin.000004
|
Read_Master_Log_Pos: 889720716
|
Relay_Log_File: ip-172-31-17-237-relay-bin.000002
|
Relay_Log_Pos: 889012661
|
Relay_Master_Log_File: mysql-bin.000003
|
Slave_IO_Running: Yes
|
Slave_SQL_Running: Yes
|
Replicate_Do_DB:
|
Replicate_Ignore_DB:
|
Replicate_Do_Table:
|
Replicate_Ignore_Table:
|
Replicate_Wild_Do_Table:
|
Replicate_Wild_Ignore_Table:
|
Last_Errno: 0
|
Last_Error:
|
Skip_Counter: 0
|
Exec_Master_Log_Pos: 889012575
|
Relay_Log_Space: 1963532221
|
Until_Condition: None
|
Until_Log_File:
|
Until_Log_Pos: 0
|
Master_SSL_Allowed: No
|
Master_SSL_CA_File:
|
Master_SSL_CA_Path:
|
Master_SSL_Cert:
|
Master_SSL_Cipher:
|
Master_SSL_Key:
|
Seconds_Behind_Master: 165
|
Master_SSL_Verify_Server_Cert: No
|
Last_IO_Errno: 0
|
Last_IO_Error:
|
Last_SQL_Errno: 0
|
Last_SQL_Error:
|
Replicate_Ignore_Server_Ids:
|
Master_Server_Id: 1
|
Master_SSL_Crl:
|
Master_SSL_Crlpath:
|
Using_Gtid: Slave_Pos
|
Gtid_IO_Pos: 201-1-52064
|
Replicate_Do_Domain_Ids:
|
Replicate_Ignore_Domain_Ids:
|
Parallel_Mode: optimistic
|
SQL_Delay: 0
|
SQL_Remaining_Delay: NULL
|
Slave_SQL_Running_State: Waiting for room in worker thread event queue
|
Slave_DDL_Groups: 0
|
Slave_Non_Transactional_Groups: 5462
|
Slave_Transactional_Groups: 0
|
1 row in set (0.000 sec)
|
|
MariaDB [(none)]> show full processlist;
|
+----+-------------+-----------+------+--------------+------+-----------------------------------------------+-----------------------+----------+
|
| Id | User | Host | db | Command | Time | State | Info | Progress |
|
+----+-------------+-----------+------+--------------+------+-----------------------------------------------+-----------------------+----------+
|
| 67 | system user | | NULL | Slave_IO | 513 | Waiting for master to send event | NULL | 0.000 |
|
| 69 | system user | | NULL | Slave_worker | 0 | Waiting for prior transaction to commit | NULL | 0.000 |
|
| 70 | system user | | NULL | Slave_worker | 0 | Waiting for prior transaction to commit | NULL | 0.000 |
|
| 71 | system user | | NULL | Slave_worker | 0 | closing tables | NULL | 0.000 |
|
| 72 | system user | | NULL | Slave_worker | 0 | Waiting for prior transaction to commit | NULL | 0.000 |
|
| 68 | system user | | NULL | Slave_SQL | 92 | Waiting for room in worker thread event queue | NULL | 0.000 |
|
| 73 | root | localhost | NULL | Query | 0 | starting | show full processlist | 0.000 |
|
+----+-------------+-----------+------+--------------+------+-----------------------------------------------+-----------------------+----------+
|
7 rows in set (0.000 sec)
|
|
|
ponsuresh.pandians, howdy.
In the latest report we have 3 workers waiting for the 4th which the following state:
71 | system user | | NULL | Slave_worker | 0 | closing tables
Having just one show-report, it's not clear, whether this state takes some time (hardly, but still needs attention).
We can also suspect that 71 has been processing an about128 MB size event.
Could you please check (with mysqlbinlog) four (as the number of workers) groups of events starting at mysql-bin.000003:889012575 to report all event sizes.
Just paste the first line of each event's output, e.g
# at 4334
|
#010909 4:46:40 server id 1 end_log_pos 4376 CRC32 0x66c7e680 GTID 0-1-16 ddl
|
And secondly, like I said in the previous comment, I need few measurements, say separated from each other for 5 mins.
I hope they can be provided.
Thanks.
Andrei
|
|
Hi Andrei,
I have decoded the binlog for both master & slave servers. Please check it. Also I have attached the periodical report for the show full processlist & show slave status output.
From Master :
-------------
[root@ip-172-31-24-106 mysql]# less /tmp/t.txt | grep -i 889012575 -A 5 -B10
|
### SET
|
### @1=8316 /* INT meta=0 nullable=0 is_null=0 */
|
### @2=1008847 /* INT meta=0 nullable=0 is_null=0 */
|
### @3='59074100786-30395673519-92582365862-85036053474-42511661822-80049943387-22751241741-93066455078-33849578458-19833344165' /* STRING(120) meta=65144 nullable=0 is_null=0 */
|
### @4='19961618934-29000529160-79561458616-08529793478-20871744749' /* STRING(60) meta=65084 nullable=0 is_null=0 */
|
### INSERT INTO `sbtest`.`sbtest2`
|
### SET
|
### @1=8317 /* INT meta=0 nullable=0 is_null=0 */
|
### @2=1000411 /* INT meta=0 nullable=0 is_null=0 */
|
### @3='00942787768-04589456858-56363613859-26954360214-03885137841-90790223243-49118401991-63723734666-20265439903-69406001720' /* STRING(120) meta=65144 nullable=0 is_null=0 */
|
### @4='27567816782-90768939453-74039568820-61130096963-88901257584' /* STRING(60) meta=65084 nullable=0 is_null=0 */
|
### INSERT INTO `sbtest`.`sbtest2`
|
### SET
|
### @1=8318 /* INT meta=0 nullable=0 is_null=0 */
|
### @2=991417 /* INT meta=0 nullable=0 is_null=0 */
|
### @3='85156321622-04195155158-81842030504-15530669746-15248374975-59125065042-70965205688-11740408230-33044241354-55067656456' /* STRING(120) meta=65144 nullable=0 is_null=0 */
|
--
|
### SET
|
### @1=1602903 /* INT meta=0 nullable=0 is_null=0 */
|
### @2=997980 /* INT meta=0 nullable=0 is_null=0 */
|
### @3='19245401165-61634791196-08558032160-75446215334-11287894282-62158317689-33315747696-83647671351-85003256312-82116001844' /* STRING(120) meta=65144 nullable=0 is_null=0 */
|
### @4='15427896165-07882275704-42136503939-69999184949-70109204507' /* STRING(60) meta=65084 nullable=0 is_null=0 */
|
### INSERT INTO `sbtest`.`sbtest2`
|
### SET
|
### @1=1602904 /* INT meta=0 nullable=0 is_null=0 */
|
### @2=851972 /* INT meta=0 nullable=0 is_null=0 */
|
### @3='67961218718-57512777613-81282639629-28761376616-47038664074-13311110657-90397944965-29132245173-82793777758-84313735692' /* STRING(120) meta=65144 nullable=0 is_null=0 */
|
### @4='01331952066-24282735035-73039394138-67544492870-22889012575' /* STRING(60) meta=65084 nullable=0 is_null=0 */
|
### INSERT INTO `sbtest`.`sbtest2`
|
### SET
|
### @1=1602905 /* INT meta=0 nullable=0 is_null=0 */
|
### @2=998372 /* INT meta=0 nullable=0 is_null=0 */
|
### @3='57036828254-39203173382-99948836467-28245829044-30374712994-72642974076-41577532325-01878141374-11473509824-50658041341' /* STRING(120) meta=65144 nullable=0 is_null=0 */
|
--
|
### @2=1008548 /* INT meta=0 nullable=0 is_null=0 */
|
### @3='41765883594-35283064806-69400677627-91786190235-82678384612-28175692335-21016147975-36461569069-80759003260-74951359859' /* STRING(120) meta=65144 nullable=0 is_null=0 */
|
### @4='92588648459-11114954858-79032460487-15440339861-13263701619' /* STRING(60) meta=65084 nullable=0 is_null=0 */
|
### INSERT INTO `sbtest`.`sbtest4`
|
### SET
|
### @1=278627 /* INT meta=0 nullable=0 is_null=0 */
|
### @2=1002246 /* INT meta=0 nullable=0 is_null=0 */
|
### @3='97587198511-00696167561-74482020591-20340139370-76086500251-57467691776-53001417267-52915161341-99186397045-99945265376' /* STRING(120) meta=65144 nullable=0 is_null=0 */
|
### @4='71593207192-21634929961-54476645761-82885980510-70652887516' /* STRING(60) meta=65084 nullable=0 is_null=0 */
|
# at 889012548
|
#210811 9:10:54 server id 1 end_log_pos 889012575 Xid = 5539
|
COMMIT/*!*/;
|
# at 889012575
|
#210811 9:10:54 server id 1 end_log_pos 889012613 GTID 201-1-5470
|
/*!100001 SET @@session.gtid_seq_no=5470*//*!*/;
|
BEGIN
|
/*!*/;
|
# at 889012613
|
From Slave :
------------
[root@ip-172-31-17-237 mysql]# less /tmp/t.txt | grep -i 889012575
|
### @4='27567816782-90768939453-74039568820-61130096963-88901257584' /* STRING(60) meta=65084 nullable=0 is_null=0 */
|
### @4='01331952066-24282735035-73039394138-67544492870-22889012575' /* STRING(60) meta=65084 nullable=0 is_null=0 */
|
|
|
ponsuresh.pandians, thank your for trying, but sorry, this attachment is not something that I asked. I can not see the event sizes from it.
To repeat, I need two lines, I paste them over again:
# at 4334
|
#010909 4:46:40 server id 1 end_log_pos 4376 ...
|
|
for each event in those 4 trx:s that I specified.
The two lines say the beginning and the end of the event, that is its size.
(On the 2nd line of the first event of each trx we will Gtid event so it should be clear
where the trx boundaries are).
|
|
Hi Andrei,
Please find the below output. This is for (mysql-bin.000003) binlog and (889012575) position output.
# at 889012548
|
#210811 9:10:54 server id 1 end_log_pos 889012575 Xid = 5539
|
COMMIT/*!*/;
|
# at 889012575
|
#210811 9:10:54 server id 1 end_log_pos 889012613 GTID 201-1-5470
|
/*!100001 SET @@session.gtid_seq_no=5470*//*!*/;
|
BEGIN
|
/*!*/;
|
# at 889012613
|
And I re ran the test again, this is for (mysql-bin.000007) binlog and (846206111) position
[root@ip-172-31-24-106 mysql]# less /tmp/t.txt | grep -i 846206111
|
### @3='24739921552-92859805172-35456153230-10715366393-84620611105-45177215237-91470530765-11754233503-02242541633-98239803580' /* STRING(120) meta=65144 nullable=0 is_null=0 */
|
#210811 12:19:06 server id 1 end_log_pos 846206111 Xid = 1662
|
# at 846206111
|
MariaDB [(none)]> show slave status\G
|
*************************** 1. row ***************************
|
Slave_IO_State: Waiting for master to send event
|
Master_Host: 172.31.24.106
|
Master_User: repl
|
Master_Port: 3306
|
Connect_Retry: 60
|
Master_Log_File: mysql-bin.000008
|
Read_Master_Log_Pos: 874659436
|
Relay_Log_File: ip-172-31-17-237-relay-bin.000002
|
Relay_Log_Pos: 846206197
|
Relay_Master_Log_File: mysql-bin.000007
|
Slave_IO_Running: Yes
|
Slave_SQL_Running: Yes
|
Replicate_Do_DB:
|
Replicate_Ignore_DB:
|
Replicate_Do_Table:
|
Replicate_Ignore_Table:
|
Replicate_Wild_Do_Table:
|
Replicate_Wild_Ignore_Table:
|
Last_Errno: 0
|
Last_Error:
|
Skip_Counter: 0
|
Exec_Master_Log_Pos: 846206111
|
Relay_Log_Space: 1948470902
|
Until_Condition: None
|
Until_Log_File:
|
Until_Log_Pos: 0
|
Master_SSL_Allowed: No
|
Master_SSL_CA_File:
|
Master_SSL_CA_Path:
|
Master_SSL_Cert:
|
Master_SSL_Cipher:
|
Master_SSL_Key:
|
Seconds_Behind_Master: 146
|
Master_SSL_Verify_Server_Cert: No
|
Last_IO_Errno: 0
|
Last_IO_Error:
|
Last_SQL_Errno: 0
|
Last_SQL_Error:
|
Replicate_Ignore_Server_Ids:
|
Master_Server_Id: 1
|
Master_SSL_Crl:
|
Master_SSL_Crlpath:
|
Using_Gtid: Slave_Pos
|
Gtid_IO_Pos: 201-1-61512
|
Replicate_Do_Domain_Ids:
|
Replicate_Ignore_Domain_Ids:
|
Parallel_Mode: optimistic
|
SQL_Delay: 0
|
SQL_Remaining_Delay: NULL
|
Slave_SQL_Running_State: Waiting for room in worker thread event queue
|
Slave_DDL_Groups: 0
|
Slave_Non_Transactional_Groups: 1631
|
Slave_Transactional_Groups: 0
|
1 row in set (0.000 sec)
|
|
|
Sorry again. I thought I made it clear that for analysis I need the sizes of every event in 4 event groups that started with
GTID 201-1-5470.
May I ask what was unclear in my request?
|
|
Hi Andrei,
I can't get the event size. So here i have attached the decoded binlog (mysql-bin.000007) please check it.
Link - https://drive.google.com/file/d/1zwIWbpJQU-KIfy2E7GFPAr0Qzp1zysMS/view?usp=sharing
|
|
ponsuresh.pandians, hi. Why did you send me a different file now? I don't think it makes me easier to analyze.
To extract the info I asked simply run
mysqlbinlog --start-position=889012575 --stop-position=`find yourself to count 4 consequent GTID events. The fifth is where to stop` master-bin.000003 | grep '^#'
|
To
> I can't get the event size.
is not an explanation why you could not do it. Next time please don't shy to ask specifically, I can help you with constructing a grep invocation etc.
|
|
Hi Andrei,
Unfortunately I have terminated that AWS instance. So I created new instance and re produced this scenario.
Here I mentioned requested logs Please check it.
[root@ip-172-31-72-170 mysql]# mysqlbinlog --start-position=196084134 --stop-position=196085486 mysql-bin.000001 | grep '^#'
|
# at 4
|
#210815 11:16:45 server id 1 end_log_pos 248 Start: binlog v 4, server v 10.0.38-MariaDB created 210815 11:16:45 at startup
|
# Warning: this binlog is either in use or was not closed properly.
|
# at 196084134
|
#210815 11:32:38 server id 1 end_log_pos 196084172 GTID 201-1-4034
|
# at 196084172
|
# at 196084227
|
#210815 11:32:38 server id 1 end_log_pos 196084227 Table_map: `sbtest`.`sbtest4` mapped to number 80
|
#210815 11:32:38 server id 1 end_log_pos 196084445 Write_rows: table id 80 flags: STMT_END_F
|
# at 196084445
|
#210815 11:32:38 server id 1 end_log_pos 196084472 Xid = 4082
|
# at 196084472
|
#210815 11:32:38 server id 1 end_log_pos 196084510 GTID 201-1-4035
|
# at 196084510
|
# at 196084565
|
#210815 11:32:38 server id 1 end_log_pos 196084565 Table_map: `sbtest`.`sbtest4` mapped to number 80
|
#210815 11:32:38 server id 1 end_log_pos 196084783 Write_rows: table id 80 flags: STMT_END_F
|
# at 196084783
|
#210815 11:32:38 server id 1 end_log_pos 196084810 Xid = 4083
|
# at 196084810
|
#210815 11:32:38 server id 1 end_log_pos 196084848 GTID 201-1-4036
|
# at 196084848
|
# at 196084903
|
#210815 11:32:38 server id 1 end_log_pos 196084903 Table_map: `sbtest`.`sbtest4` mapped to number 80
|
#210815 11:32:38 server id 1 end_log_pos 196085121 Write_rows: table id 80 flags: STMT_END_F
|
# at 196085121
|
#210815 11:32:38 server id 1 end_log_pos 196085148 Xid = 4084
|
# at 196085148
|
#210815 11:32:38 server id 1 end_log_pos 196085186 GTID 201-1-4037
|
# at 196085186
|
# at 196085241
|
#210815 11:32:38 server id 1 end_log_pos 196085241 Table_map: `sbtest`.`sbtest4` mapped to number 80
|
#210815 11:32:38 server id 1 end_log_pos 196085459 Write_rows: table id 80 flags: STMT_END_F
|
# at 196085459
|
#210815 11:32:38 server id 1 end_log_pos 196085486 Xid = 4085
|
# End of log file
|
|
|
ponsuresh.pandians, salute.
I need all bits of information belonging on any existing instance of your configuration.
As it's about a grown latency that includes few periodic reports (from START SLAVE till a moment you will observe a
significantly grown value, do that once in a min or tens of sec:s - depends on your load) from
- SHOW SLAVE STATUS
- show processlist
- save binary logs from which I will need further details once I process the 2 SHOW:s.
- the same for the error logs
My plan is to find the starting point of any (hopefully) sharp jump of SBM, and having that
look into the binlog for data that may have played some role.
As you can see the last paste is not meaningful to the plan.
I am waiting for such way collected data.
Thank you.
|
|
Hi Andrei,
Here I have attached the show slave status & show processlist outputs. This stats was collected on every 10 sec.
Also Attached the MariaDB error log from slave server. Please check it.
Binlogs - https://drive.google.com/drive/folders/1r8Yy4LvmQ73ocamT2Al24L0recgahAm-?usp=sharing
|
|
ponsuresh.pandians, hi. Thank you for a good piece of reporting!
I see that SMB constantly increases and most of time by the value greater than your 10 sec watching period.
Processlist shows that there's "always" a worker in closing tables state. This may not relate to the parallel replication and should be explained why it apparently takes some long time. Maybe that deals with the fact that
your load is apparently not on transactional (Innodb) tables:
Slave_Non_Transactional_Groups: 4397
|
Slave_Transactional_Groups: 0
|
Notice that in the non-transactional case the parallel replication may not scale up at all.
The simplest I can recommend is to set the table engine to Innodb and check.
You can also dwell yourself into profiling - see SHOW PROFILE ALL and how to set it on.
At this point I really needed not the binlog attachment. Thank you once more for providing that!
|
|
Hi Andrei,
I am using sysbench tool to simulated the load. So all the tables are INNODB (transactional) tables. After enabling the parallel threads, replication lag was increasing this behaviour is completely strange for me ..
MariaDB [sbtest]> show create table sbtest1\G
|
*************************** 1. row ***************************
|
Table: sbtest1
|
Create Table: CREATE TABLE `sbtest1` (
|
`id` int(11) NOT NULL AUTO_INCREMENT,
|
`k` int(11) NOT NULL DEFAULT 0,
|
`c` char(120) NOT NULL DEFAULT '',
|
`pad` char(60) NOT NULL DEFAULT '',
|
PRIMARY KEY (`id`),
|
KEY `k_1` (`k`)
|
) ENGINE=InnoDB AUTO_INCREMENT=2287661 DEFAULT CHARSET=latin1
|
1 row in set (0.000 sec)
|
|
MariaDB [sbtest]> show create table sbtest2\G
|
*************************** 1. row ***************************
|
Table: sbtest2
|
Create Table: CREATE TABLE `sbtest2` (
|
`id` int(11) NOT NULL AUTO_INCREMENT,
|
`k` int(11) NOT NULL DEFAULT 0,
|
`c` char(120) NOT NULL DEFAULT '',
|
`pad` char(60) NOT NULL DEFAULT '',
|
PRIMARY KEY (`id`),
|
KEY `k_2` (`k`)
|
) ENGINE=InnoDB AUTO_INCREMENT=2286165 DEFAULT CHARSET=latin1
|
1 row in set (0.000 sec)
|
|
|
Re-opening this issue, as the feedback was provided and we're able to reproduce on our side. I'm hoping it just closed inadvertently, but if you do need further information from Support, please let us know. This is a customer-facing bug.
|
|
julien.fritsch, I would not mind that at all. But first my questions need to be attended.
|
|
Hi Julien Fritsch ,
Sure...Once again I will re create my test environment and get back to you with more details...
|
|
ponsuresh.pandians: In your last Show-Slave-Status reports I had to suspect non-transactional (not innodb) load was run on your slave.
Now when you report on slave again could you please make sure these two (I am pasting what was the last time)
Slave_Non_Transactional_Groups: 4397
Slave_Transactional_Groups: 0
attributes are present?
Your show-create-table, sorry, does not reveal on what server they were done. Even if the def:s are from the slave server (please confirm), I still need the Slave_{,Non}_Transactional_Groups in order to proceed.
|
|
Hi Andrei,
Thanks for your help. As per your suggestion I did below steps,
I have installed 10.4.19 & configured the master slave GTID based replication.
And loaded some data using sys bench tool. I checked the "show slave status" output. The "Slave_Transactional_Groups" was increasing.
Slave_Non_Transactional_Groups: 0
|
Slave_Transactional_Groups: 4492
|
|
At the same I installed 10.0.38 as a master & 10.5.11 as a slave then loaded some data using sys bench tool. I checked the "show slave status" output. In this case "Slave_Non_Transactional_Groups" was increasing.
Slave_Non_Transactional_Groups: 68809
|
Slave_Transactional_Groups: 0
|
On both scenario my master & slave tables are in INNODB engine.
MariaDB [sbtest]> select TABLE_SCHEMA,TABLE_NAME,ENGINE from information_schema.TABLES where TABLE_SCHEMA='sbtest';
|
+--------------+------------+--------+
|
| TABLE_SCHEMA | TABLE_NAME | ENGINE |
|
+--------------+------------+--------+
|
| sbtest | sbtest2 | InnoDB |
|
| sbtest | sbtest3 | InnoDB |
|
| sbtest | sbtest1 | InnoDB |
|
| sbtest | sbtest4 | InnoDB |
|
| sbtest | sbtest5 | InnoDB |
|
+--------------+------------+--------+
|
5 rows in set (0.000 sec)
|
|
|
After more discussion and some analysis it turned out the 10.0 master just can't provide parallelizable load.
More to that even if it's transactional the slave interprets it in its status as the opposite (to explain the observed stats).
|
|
salle, thanks for the heads up! Won't fix is done to it.
|