[MDEV-16952] Introduce SET GLOBAL innodb_max_purge_lag_wait - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.2(EOL), 10.3(EOL), 10.4(EOL), 10.5
Fix Version/s: 10.2.35, 10.3.26, 10.4.16, 10.5.7
Component/s: Storage Engine - InnoDB, Tests
Labels:
None

Description

http://buildbot.askmonty.org/buildbot/builders/kvm-rpm-centos74-amd64-debug/builds/755

gcol.innodb_virtual_debug_purge 'innodb' w3 [ fail ]

        Test ended at 2018-08-10 21:48:13

CURRENT_TEST: gcol.innodb_virtual_debug_purge

--- /usr/share/mysql-test/suite/gcol/r/innodb_virtual_debug_purge.result	2018-08-10 19:08:32.000000000 +0000

+++ /dev/shm/var/3/log/innodb_virtual_debug_purge.reject	2018-08-10 21:48:13.389484458 +0000

@@ -200,7 +200,7 @@

 disconnect prevent_purge;

 connection default;

 SET DEBUG_SYNC='now WAIT_FOR purge_start';

-InnoDB		1 transactions not purged

+InnoDB		2 transactions not purged

 SET DEBUG_SYNC='now SIGNAL release';

 SET GLOBAL debug_dbug=@old_dbug;

 connection truncate;

mysqltest: Result content mismatch

 - skipping '/dev/shm/var/3/log/gcol.innodb_virtual_debug_purge-innodb/'

Retrying test gcol.innodb_virtual_debug_purge, attempt(2/3)...

Attachments

Issue Links

causes

MDEV-31813 SET GLOBAL innodb_max_purge_lag_wait=… hangs if innodb_read_only=ON

Closed

relates to

MDEV-15912 InnoDB: Failing assertion: purge_sys.tail.commit <= purge_sys.rseg->last_commit upon upgrade from 10.0 or 10.1 to 10.3

Closed

MDEV-17745 innodb.innodb_stats_persistent failed in buildbot with wrong result

Closed

MDEV-22958 innodb.instant_alter_debug fails in buildbot with wrong result

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2018-08-23 10:01

Earlier in the test, we successfully got '0 transactions not purged'.
The problem ought to be somewhere between these lines in the test:

--source ../../innodb/include/wait_all_purged.inc

SET DEBUG_SYNC= 'now SIGNAL purged';

connection default;

--echo /* connection default */ ALTER TABLE t1 ADD COLUMN c INT GENERATED ALWAYS AS(a+b), ADD INDEX idx (c), ALGORITHM=INPLACE, LOCK=SHARED;

--reap

SHOW CREATE TABLE t1;

SELECT * FROM t1;

DROP TABLE t1;

# Test adding index on existing virtual column

CREATE TABLE t1 (a INT, b INT, c INT GENERATED ALWAYS AS(a+b));

INSERT INTO t1(a, b) VALUES (1, 1), (2, 2), (3, 3), (4, 4);

connection con1;

--echo # disable purge

BEGIN; SELECT * FROM t0;

connection default;

DELETE FROM t1 WHERE a = 1;

UPDATE t1 SET a = 2, b = 2 WHERE a = 5;

INSERT INTO t1(a, b) VALUES (6, 6);

SET DEBUG_SYNC= 'inplace_after_index_build SIGNAL uncommitted WAIT_FOR purged';

send ALTER TABLE t1 ADD INDEX idx (c), ALGORITHM=INPLACE, LOCK=NONE;

connection con1;

SET DEBUG_SYNC= 'now WAIT_FOR uncommitted';

DELETE FROM t1 WHERE a = 3;

UPDATE t1 SET a = 7, b = 7 WHERE a = 4;

INSERT INTO t1(a, b) VALUES (8, 8);

--echo # enable purge

COMMIT;

--echo # wait for purge to process the deleted/updated records.

let $wait_all_purged=1;

--source ../../innodb/include/wait_all_purged.inc

Unfortunately, I am unable to repeat this, and there were no additional messages in the server error log.

Marko Mäkelä added a comment - 2018-08-23 10:01 Earlier in the test, we successfully got '0 transactions not purged'. The problem ought to be somewhere between these lines in the test: --source ../../innodb/include/wait_all_purged.inc SET DEBUG_SYNC= 'now SIGNAL purged' ; connection default ; --echo /* connection default */ ALTER TABLE t1 ADD COLUMN c INT GENERATED ALWAYS AS(a+b), ADD INDEX idx (c), ALGORITHM=INPLACE, LOCK=SHARED; --reap SHOW CREATE TABLE t1; SELECT * FROM t1; DROP TABLE t1; # Test adding index on existing virtual column CREATE TABLE t1 (a INT , b INT , c INT GENERATED ALWAYS AS (a+b)); INSERT INTO t1(a, b) VALUES (1, 1), (2, 2), (3, 3), (4, 4); connection con1; --echo # disable purge BEGIN ; SELECT * FROM t0; connection default ; DELETE FROM t1 WHERE a = 1; UPDATE t1 SET a = 2, b = 2 WHERE a = 5; INSERT INTO t1(a, b) VALUES (6, 6); SET DEBUG_SYNC= 'inplace_after_index_build SIGNAL uncommitted WAIT_FOR purged' ; send ALTER TABLE t1 ADD INDEX idx (c), ALGORITHM=INPLACE, LOCK=NONE; connection con1; SET DEBUG_SYNC= 'now WAIT_FOR uncommitted' ; DELETE FROM t1 WHERE a = 3; UPDATE t1 SET a = 7, b = 7 WHERE a = 4; INSERT INTO t1(a, b) VALUES (8, 8); --echo # enable purge COMMIT ; --echo # wait for purge to process the deleted/updated records. let $wait_all_purged=1; --source ../../innodb/include/wait_all_purged.inc Unfortunately, I am unable to repeat this, and there were no additional messages in the server error log.

Marko Mäkelä added a comment - 2020-10-27 13:14

I suspect that the test sporadically fails due to the CI system being overloaded. The client-side polling is much less efficient than server-side polling would be. The client would request a large amount of expensive-to-compute information and throw out most of the SHOW ENGINE INNODB STATUS output, 10 times per second, for 60 seconds.

I think that it is more efficient and more useful to implement server-side waiting:

SET GLOBAL innodb_max_purge_lag_wait=0;

If purge is not allowed to proceed due to an active read view that is held by some connection, potentially including the current one, then the wait may be interrupted to normal statement or test case timeout. But it would no longer limited to the 60 seconds that the wait_all_purged.inc implemented.

This new dummy global variable would also assist in upgrades from 10.2 or earlier to 10.3 or later, avoiding ~~MDEV-15912~~ that was caused by the undo log format change that ~~MDEV-12288~~ introduced in 10.3.

Marko Mäkelä added a comment - 2020-10-27 13:14 I suspect that the test sporadically fails due to the CI system being overloaded. The client-side polling is much less efficient than server-side polling would be. The client would request a large amount of expensive-to-compute information and throw out most of the SHOW ENGINE INNODB STATUS output, 10 times per second, for 60 seconds. I think that it is more efficient and more useful to implement server-side waiting: SET GLOBAL innodb_max_purge_lag_wait=0; If purge is not allowed to proceed due to an active read view that is held by some connection, potentially including the current one, then the wait may be interrupted to normal statement or test case timeout. But it would no longer limited to the 60 seconds that the wait_all_purged.inc implemented. This new dummy global variable would also assist in upgrades from 10.2 or earlier to 10.3 or later, avoiding MDEV-15912 that was caused by the undo log format change that MDEV-12288 introduced in 10.3.

Manuel Arostegui added a comment - 2020-11-20 14:33

Marko, any chances this flag can come with some regression on the replication thread?.
We are testing 10.4.17 (it was running 10.4.15 before) on a host that receives quite a bunch of REPLACEs. Same hardware with a host running 10.4.15 shows no lag.
The only difference I have found is that by default

innodb_max_purge_lag_wait

was set to 4294967295 (the default) when we upgraded it to 10.4.17.
I have set it back to 0 and restarted mariadb, and so far it looks like the host is catching up. Early to say this was the culprit as we have some strange writes pattern, but I would like to know if this could be the reason of this sudden lag?

Manuel Arostegui added a comment - 2020-11-20 14:33 Marko, any chances this flag can come with some regression on the replication thread?. We are testing 10.4.17 (it was running 10.4.15 before) on a host that receives quite a bunch of REPLACEs. Same hardware with a host running 10.4.15 shows no lag. The only difference I have found is that by default innodb_max_purge_lag_wait was set to 4294967295 (the default) when we upgraded it to 10.4.17. I have set it back to 0 and restarted mariadb, and so far it looks like the host is catching up. Early to say this was the culprit as we have some strange writes pattern, but I would like to know if this could be the reason of this sudden lag?

Marko Mäkelä added a comment - 2020-11-20 15:45

marostegui, this is a special variable whose value should always read as a constant, even after SET GLOBAL was executed.

The SET GLOBAL statement is special: as long as the connection that executed as is active, that connection will wait until the history list length is at most the desired value. If that connection is holding an open read view, then it is possible that the history list length will keep growing. Say, if you executed

START TRANSACTION WITH CONSISTENT SNAPSHOT;

SET GLOBAL innodb_max_purge_lag_wait=0;

then that would be almost guaranteed to kill performance. Only in the special case that there was nothing to purge, we would get through. In any other case, the read view that we opened before initiating the wait, would prevent purge of any transactions that were committed after the read view was created.

The idea was that this SET GLOBAL statement would typically only be executed while the server is idle, and no new write transactions are arriving. It should be possible to use KILL QUERY or KILL CONNECTION to abort the wait. Or just disconnect the client.

If SELECT @@GLOBAL.innodb.max_purge_lag_wait; is returning different values, that is a bug that could be filed and fixed. My intention was that it would always return a constant. I did not test that.

Note: Specifying an initial value of innodb_max_purge_lag_wait in the start-up configuration has no effect.

Marko Mäkelä added a comment - 2020-11-20 15:45 marostegui , this is a special variable whose value should always read as a constant, even after SET GLOBAL was executed. The SET GLOBAL statement is special: as long as the connection that executed as is active, that connection will wait until the history list length is at most the desired value. If that connection is holding an open read view, then it is possible that the history list length will keep growing. Say, if you executed START TRANSACTION WITH CONSISTENT SNAPSHOT; SET GLOBAL innodb_max_purge_lag_wait=0; then that would be almost guaranteed to kill performance. Only in the special case that there was nothing to purge, we would get through. In any other case, the read view that we opened before initiating the wait, would prevent purge of any transactions that were committed after the read view was created. The idea was that this SET GLOBAL statement would typically only be executed while the server is idle, and no new write transactions are arriving. It should be possible to use KILL QUERY or KILL CONNECTION to abort the wait. Or just disconnect the client. If SELECT @@GLOBAL.innodb.max_purge_lag_wait; is returning different values, that is a bug that could be filed and fixed. My intention was that it would always return a constant. I did not test that. Note: Specifying an initial value of innodb_max_purge_lag_wait in the start-up configuration has no effect.

Manuel Arostegui added a comment - 2020-11-20 16:10

Thank you for the detailed answer. We will keep looking for the culprit of these lags!
Thanks again

Manuel Arostegui added a comment - 2020-11-20 16:10 Thank you for the detailed answer. We will keep looking for the culprit of these lags! Thanks again

Larry Adams added a comment - 2023-07-17 21:29

Is there any good documentation for this yet guys? I've got a massive history list due to very large number of update/insert's over the day, and then at the end of the day I rename that table. I wonder if that creates some sort of issue. But the history list is in the billions and my ibdata1 is 3+ TB. Not sure how to allow MariaDB to catch up if ever.

Larry Adams added a comment - 2023-07-17 21:29 Is there any good documentation for this yet guys? I've got a massive history list due to very large number of update/insert's over the day, and then at the end of the day I rename that table. I wonder if that creates some sort of issue. But the history list is in the billions and my ibdata1 is 3+ TB. Not sure how to allow MariaDB to catch up if ever.

Larry Adams added a comment - 2023-07-17 21:31

Make that hundreds of millions.

MariaDB [cacti]> show global status like '%history%';

+----------------------------+-----------+

| Variable_name              | Value     |

+----------------------------+-----------+

| Innodb_history_list_length | 176100483 |

+----------------------------+-----------+

1 row in set (0.007 sec)

Larry Adams added a comment - 2023-07-17 21:31 Make that hundreds of millions. MariaDB [cacti]> show global status like '%history%'; +----------------------------+-----------+ | Variable_name | Value | +----------------------------+-----------+ | Innodb_history_list_length | 176100483 | +----------------------------+-----------+ 1 row in set (0.007 sec)

People

Assignee:: Marko Mäkelä

Reporter:: Elena Stepanova

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2018-08-11 23:44

Updated:: 2023-08-24 07:20

Resolved:: 2020-10-27 13:48

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server