[MDEV-33213] History list is not shrunk unless there is a pause in the workload - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Blocker
Resolution: Fixed
Affects Version/s: 10.6, 10.11, 11.0(EOL), 11.1(EOL), 11.2(EOL), 11.3(EOL), 11.4
Fix Version/s: 10.6.17, 10.11.7, 11.0.5, 11.1.4, 11.2.3, 11.3.2, 11.4.1
Component/s: Storage Engine - InnoDB
Labels:
- performance

Description

While testing ~~MDEV-32050~~ and its follow-up changes ~~MDEV-33009~~ and ~~MDEV-33112~~, axel pointed out that the InnoDB history list is not being shrunk during a workload.

Some initial debugging indicates that the purge_truncation_task (which was added in ~~MDEV-32050~~) is being invoked but it is not doing anything. Also, the purge_sys.view is advancing; there are no old read views that would block the purge of history. Throttling the Sysbench workload for a few seconds would allow the history list to shrink immediately.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

10.6-MDEV-33213.pdf
107 kB
2024-01-15 13:34
12x5.pdf
58 kB
2024-01-12 09:20
24x5_high_threads_pausing.pdf
59 kB
2024-01-12 09:20
24x5_high_threads.pdf
56 kB
2024-01-12 09:20
24x5.pdf
59 kB
2024-01-12 09:20
no_pause.png
36 kB
2024-01-10 13:08
with_pause.png
47 kB
2024-01-10 13:08

Issue Links

causes

MDEV-33464 Crash when innodb_max_undo_log_size is set to innodb_page_size*4294967296

Closed

relates to

MDEV-34259 Optimization in row_purge_poss_sec Function for Undo Purge Process

Closed

MDEV-30628 10.6 performance regression with sustained high-connection write-only OLTP workload (55-80% degradation)

Closed

MDEV-31676 Innodb history length keeps growing

Closed

MDEV-32050 UNDO logs still growing for write-intensive workloads

Closed

MDEV-33009 Server hangs for a long time with innodb_undo_log_truncate=ON

Closed

MDEV-33112 innodb_undo_log_truncate=ON is blocking page writes

Closed

MDEV-33315 InnoDB history length and undo tablespace files keep growing

Closed

(3 relates to)

Activity

Ascending order - Click to sort in descending order

View 11 older comments

Larry Adams added a comment - 2024-05-22 17:17

Marko,

I've upgraded to 10.6.17-13 this morning. I'm not seeing any increased history flushing, but at least it's flushing now. I've opened a ticket with support to see if there is a way to speed it up though as the number keeps going up incrementally.

Larry

Larry Adams added a comment - 2024-05-22 17:17 Marko, I've upgraded to 10.6.17-13 this morning. I'm not seeing any increased history flushing, but at least it's flushing now. I've opened a ticket with support to see if there is a way to speed it up though as the number keeps going up incrementally. Larry

Larry Adams added a comment - 2024-05-22 17:40

I tried

SET GLOBAL innodb_max_purge_lag_wait=0;

But that command just hung for almost 10 minutes before I killed it.

Larry Adams added a comment - 2024-05-22 17:40 I tried SET GLOBAL innodb_max_purge_lag_wait=0; But that command just hung for almost 10 minutes before I killed it.

Marko Mäkelä added a comment - 2024-05-23 06:01

Hi Larry, let’s follow up on this in the support ticket. One thing that we need to be aware of is that when SHOW ENGINE INNODB STATUS shows any ACTIVE or PREPARED transactions with a transaction identifier or with a read view, then purge will be unable to proceed beyond that, and SET GLOBAL innodb_max_purge_lag_wait=0 will hang. In the worst case, someone starts a connection with START TRANSACTION WITH CONSISTENT SNAPSHOT and walks away.

That said, it is possible that there are no active transactions or read views, and the purge is really taking extremely long to clean up the queue. Implementing MDEV-17598 should help in case the reason is secondary indexes.

Marko Mäkelä added a comment - 2024-05-23 06:01 Hi Larry, let’s follow up on this in the support ticket. One thing that we need to be aware of is that when SHOW ENGINE INNODB STATUS shows any ACTIVE or PREPARED transactions with a transaction identifier or with a read view, then purge will be unable to proceed beyond that, and SET GLOBAL innodb_max_purge_lag_wait=0 will hang. In the worst case, someone starts a connection with START TRANSACTION WITH CONSISTENT SNAPSHOT and walks away. That said, it is possible that there are no active transactions or read views, and the purge is really taking extremely long to clean up the queue. Implementing MDEV-17598 should help in case the reason is secondary indexes.

chad ambrosius added a comment - 2024-08-05 19:12

I have recently upgraded many servers that had been running on 10.3.12 to 10.6.15 at which time I noticed that the history list was not shrinking as it had always done on 10.3.12. I found this bug which seems to describe the situation pretty well. I upgraded to 10.6.17 and tried many variations of innodb_* parameters described above and on other tickets. None seemed to have any effect. I also tried versions 10.11.8 and 11.4.2. The innodb undo logs still do not seem to be getting purged in those versions. If I stop non-replication traffic, the history list length reduces very rapidly. Only the busiest hosts in my infrastructure can't keep up with the undo log purging. however, on 10.3.12 this was never an issue. It feels like the rows in the undo log must already be delete-marked or query performance would be really bad at such large history lengths. So the only impact I'm feeling right now is unbounded disk space growth (or the hassle of juggling hosts in and out of rotation, including the primary).

marko are you sure this issue is fully resolved? thank you!!

chad ambrosius added a comment - 2024-08-05 19:12 I have recently upgraded many servers that had been running on 10.3.12 to 10.6.15 at which time I noticed that the history list was not shrinking as it had always done on 10.3.12. I found this bug which seems to describe the situation pretty well. I upgraded to 10.6.17 and tried many variations of innodb_* parameters described above and on other tickets. None seemed to have any effect. I also tried versions 10.11.8 and 11.4.2. The innodb undo logs still do not seem to be getting purged in those versions. If I stop non-replication traffic, the history list length reduces very rapidly. Only the busiest hosts in my infrastructure can't keep up with the undo log purging. however, on 10.3.12 this was never an issue. It feels like the rows in the undo log must already be delete-marked or query performance would be really bad at such large history lengths. So the only impact I'm feeling right now is unbounded disk space growth (or the hassle of juggling hosts in and out of rotation, including the primary). marko are you sure this issue is fully resolved? thank you!!

Marko Mäkelä added a comment - 2024-08-06 05:37

ch701, unfortunately MariaDB 10.6.16 and 10.6.17 are affected by ~~MDEV-33819~~, which was introduced in ~~MDEV-32050~~. However, that fix should be present in the 10.11.8 and 11.4.2 releases, which you also tested.

MariaDB Server 10.3 is missing some correctness fixes such as ~~MDEV-31355~~. It is possible that history is purged prematurely, while some old transactions could still access the history. I believe that the bug could lead to a situation where the undo log page has been reused and trx_undo_rec_copy() would attempt to copy data at an incorrect offset, and ultimately cause a crash. ~~MDEV-11044~~ is an example of that, although this particular problematic undo page access was not in an MVCC read but the purge of history itself. Perhaps that bug was actually fixed by ~~MDEV-22388~~.

I am currently working on a performance regression in this area, ~~MDEV-34515~~. The test case is artificial: a frequently updated single-row table with a secondary index. But there clearly is some room for improvement.

Marko Mäkelä added a comment - 2024-08-06 05:37 ch701 , unfortunately MariaDB 10.6.16 and 10.6.17 are affected by MDEV-33819 , which was introduced in MDEV-32050 . However, that fix should be present in the 10.11.8 and 11.4.2 releases, which you also tested. MariaDB Server 10.3 is missing some correctness fixes such as MDEV-31355 . It is possible that history is purged prematurely, while some old transactions could still access the history. I believe that the bug could lead to a situation where the undo log page has been reused and trx_undo_rec_copy() would attempt to copy data at an incorrect offset, and ultimately cause a crash. MDEV-11044 is an example of that, although this particular problematic undo page access was not in an MVCC read but the purge of history itself. Perhaps that bug was actually fixed by MDEV-22388 . I am currently working on a performance regression in this area, MDEV-34515 . The test case is artificial: a frequently updated single-row table with a secondary index. But there clearly is some room for improvement.

MariaDB Server

History list is not shrunk unless there is a pause in the workload

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration