Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.1.20
-
None
-
10.1.29
Description
ALTER TABLE ... DISCARD TABLESPACE takes a lot of time with large buffer pool (>128G).
Steps to reproduce:
1. drop / create an InnoDB table
2. discard the table space
3. increase the buffer pool size.
4. Re-try once you have a 128G+ buffer pool, and re-run #1 & #2.
The discard time should be increasing.
Attachments
- perf_10g.data
- 316 kB
- perf_5g.data
- 510 kB
Issue Links
- causes
-
MDEV-16759 InnoDB: Assertion failure in thread 139946191502080 in file row0ins.cc line 285
-
- Closed
-
- is part of
-
MDEV-16283 ALTER TABLE...DISCARD TABLESPACE still takes long on a large buffer pool
-
- Closed
-
- relates to
-
MDEV-14263 FLUSH TABLES FOR EXPORT may write the same pages multiple times
-
- Confirmed
-
-
MDEV-9459 Truncate table causes innodb stalls
-
- Closed
-
-
MDEV-14310 Possible corruption by table-rebuilding or index-creating ALTER TABLE…ALGORITHM=INPLACE
-
- Closed
-
Activity
I am not getting any difference for the given scenario neither on current 10.1 nor on 10.1.20. There must be more to it.
Please collect and provide complete cnf file(s), an example of the exact CREATE TABLE statement, and information whether the table is supposed to be empty.
DROP PROCEDURE IF EXISTS pr; |
DELIMITER $$
|
CREATE PROCEDURE pr(n INT) |
BEGIN
|
DECLARE i INT DEFAULT 0; |
WHILE i < n
|
DO
|
DROP TABLE IF EXISTS t1; |
CREATE TABLE t1 (i INT); |
ALTER TABLE t1 DISCARD TABLESPACE; |
SET i = i+1; |
END WHILE; |
END$$ |
DELIMITER ;
|
128 MB |
MariaDB [test]> SELECT @@innodb_buffer_pool_size; |
+---------------------------+ |
| @@innodb_buffer_pool_size |
|
+---------------------------+ |
| 134217728 |
|
+---------------------------+ |
1 row in set (0.00 sec) |
|
MariaDB [test]> CALL pr(20);
|
Query OK, 0 rows affected (0.09 sec) |
16 GB |
MariaDB [test]> SELECT @@innodb_buffer_pool_size; |
+---------------------------+ |
| @@innodb_buffer_pool_size |
|
+---------------------------+ |
| 17179869184 |
|
+---------------------------+ |
1 row in set (0.00 sec) |
|
MariaDB [test]> CALL pr(20);
|
Query OK, 0 rows affected (0.09 sec) |
160 GB |
MariaDB [test]> SELECT @@innodb_buffer_pool_size; |
+---------------------------+ |
| @@innodb_buffer_pool_size |
|
+---------------------------+ |
| 171798691840 |
|
+---------------------------+ |
1 row in set (0.00 sec) |
|
MariaDB [test]> CALL pr(20);
|
Query OK, 0 rows affected (0.09 sec) |
Actually, it is not only sufficient for the buffer pool to be big, but it also needs to be full of pages ...
– run some artificial load like sysbench so that the buffer pool is full of pages
drop database if exists test_discard;
create database test_discard;
show global variables like 'innodb_buffer_pool_size';
use test_discard;
create table t(id int primary key)engine=InnoDB;
alter table t discard tablespace;
Do you mean that it actually needs to be full?
Cannot the slowness be caused by the mere existence of 128+G of tables?
Yes, it must be full, according to the reporter.
Note that if they lower innodb_buffer_pool_size to, say 16G, then it is fast. So just having 128G+ stored in tables is not sufficient.
With full buffer pool, I do see some difference depending on the size, but it does not seem huge. What kind of slowness we are talking about, is it more like a few seconds, or more like minutes? Maybe there is something more to it, server configuration, data specifics or the scenario?
I've created ~80 tables, each with 1,000,000 rows. Then every test (with different buffer pool size) goes like this:
- start server with buffer pool size X;
- run SELECT COUNT from every table, to populate buffer pool;
- check buffer pool size and status;
- run ALTER TABLE .. DISCARD TABLESPACE on one table.
The fact that after every test there would be one table less is irrelevant, because in all cases the buffer pool is sufficiently populated.
Results are below. In short, I'm getting ~0.5 seconds with 16G buffer pool, and ~3 seconds with 160G. And as far as I can tell, there is no jump after 128G, the time just grows along with the buffer pool size.
128 G |
+-------------------------------+-------+ |
| Variable_name | Value |
|
+-------------------------------+-------+ |
| Innodb_buffer_pool_pages_free | 8192 |
|
+-------------------------------+-------+ |
+-------------------------------+---------+ |
| Variable_name | Value |
|
+-------------------------------+---------+ |
| Innodb_buffer_pool_pages_data | 8380407 |
|
+-------------------------------+---------+ |
+---------------------------+ |
| @@innodb_buffer_pool_size |
|
+---------------------------+ |
| 137438953472 |
|
+---------------------------+ |
MariaDB [test]> ALTER TABLE t79 DISCARD TABLESPACE; |
Query OK, 0 rows affected (3.01 sec) |
16 G |
+-------------------------------+-------+ |
| Variable_name | Value |
|
+-------------------------------+-------+ |
| Innodb_buffer_pool_pages_free | 8192 |
|
+-------------------------------+-------+ |
+-------------------------------+---------+ |
| Variable_name | Value |
|
+-------------------------------+---------+ |
| Innodb_buffer_pool_pages_data | 1040375 |
|
+-------------------------------+---------+ |
+---------------------------+ |
| @@innodb_buffer_pool_size |
|
+---------------------------+ |
| 17179869184 |
|
+---------------------------+ |
MariaDB [test]> ALTER TABLE t78 DISCARD TABLESPACE; |
Query OK, 0 rows affected (0.51 sec) |
130 G |
+-------------------------------+-------+ |
| Variable_name | Value |
|
+-------------------------------+-------+ |
| Innodb_buffer_pool_pages_free | 8192 |
|
+-------------------------------+-------+ |
+-------------------------------+---------+ |
| Variable_name | Value |
|
+-------------------------------+---------+ |
| Innodb_buffer_pool_pages_data | 8511479 |
|
+-------------------------------+---------+ |
+---------------------------+ |
| @@innodb_buffer_pool_size |
|
+---------------------------+ |
| 139586437120 |
|
+---------------------------+ |
MariaDB [test]> ALTER TABLE t77 DISCARD TABLESPACE; |
Query OK, 0 rows affected (2.06 sec) |
160 G |
+-------------------------------+-------+ |
| Variable_name | Value |
|
+-------------------------------+-------+ |
| Innodb_buffer_pool_pages_free | 8192 |
|
+-------------------------------+-------+ |
+-------------------------------+----------+ |
| Variable_name | Value |
|
+-------------------------------+----------+ |
| Innodb_buffer_pool_pages_data | 10477559 |
|
+-------------------------------+----------+ |
+---------------------------+ |
| @@innodb_buffer_pool_size |
|
+---------------------------+ |
| 171798691840 |
|
+---------------------------+ |
MariaDB [test]> ALTER TABLE t76 DISCARD TABLESPACE; |
Query OK, 0 rows affected (2.67 sec) |
100 G |
+-------------------------------+-------+ |
| Variable_name | Value |
|
+-------------------------------+-------+ |
| Innodb_buffer_pool_pages_free | 8192 |
|
+-------------------------------+-------+ |
+-------------------------------+---------+ |
| Variable_name | Value |
|
+-------------------------------+---------+ |
| Innodb_buffer_pool_pages_data | 6545371 |
|
+-------------------------------+---------+ |
+---------------------------+ |
| @@innodb_buffer_pool_size |
|
+---------------------------+ |
| 107374182400 |
|
+---------------------------+ |
MariaDB [test]> ALTER TABLE t75 DISCARD TABLESPACE; |
Query OK, 0 rows affected (1.97 sec) |
160 G |
+-------------------------------+-------+ |
| Variable_name | Value |
|
+-------------------------------+-------+ |
| Innodb_buffer_pool_pages_free | 8192 |
|
+-------------------------------+-------+ |
+-------------------------------+----------+ |
| Variable_name | Value |
|
+-------------------------------+----------+ |
| Innodb_buffer_pool_pages_data | 10477559 |
|
+-------------------------------+----------+ |
+---------------------------+ |
| @@innodb_buffer_pool_size |
|
+---------------------------+ |
| 171798691840 |
|
+---------------------------+ |
MariaDB [test]> ALTER TABLE t74 DISCARD TABLESPACE; |
Query OK, 0 rows affected (3.58 sec) |
They are seeing in the 10+ second range. Here is an example from one of their servers:
MariaDB [(none)]> create database test_discard;
|
Query OK, 1 row affected (0.02 sec)
|
|
MariaDB [(none)]> show global variables like 'innodb_buffer_pool_size';
|
+-------------------------+--------------+
|
| Variable_name | Value |
|
+-------------------------+--------------+
|
| innodb_buffer_pool_size | 549755813888 |
|
+-------------------------+--------------+
|
1 row in set (0.02 sec)
|
|
MariaDB [(none)]> use test_discard;
|
Database changed
|
MariaDB [test_discard]> create table t(id int primary key)engine=InnoDB;
|
Query OK, 0 rows affected (0.03 sec)
|
|
MariaDB [test_discard]> alter table t discard tablespace;
|
Query OK, 0 rows affected (10.62 sec)
|
Customer believes that it is related to this bug, which it very well may be:
https://bugs.mysql.com/bug.php?id=68184
In which case Marko may have good idea about it. And it was fixed in MySQL 8.0.
Okay, I don't have 512G to try, but i suppose if I get from 16G/0.5 sec to 100G/2 sec and 160G/3 sec, then I can reasonably extrapolate to 512G/10 sec. Thus I'll consider it confirmed.
I did two tests using 100x LinkBench database that amounts ~1T. I discarded the largest table i.e. linktable ~763G and in these tests I did not use partitioning. Used several range selects to linktable to get pages to buffer pool. Note that machine has 24G main memory and problem seems to be that mysql command line tool does not much know how to restrict its memory size.
- Using 5G buffer pool discard took < 1s. Perf data is as follows:
+ 63.95% 18.10% mysqld mysqld [.] buf_LRU_block_remove_hashed
+ 44.79% 0.13% mysqld [kernel.kallsyms] [k] page_fault
+ 44.71% 0.12% mysqld [kernel.kallsyms] [k] do_page_fault
+ 44.45% 0.24% mysqld [kernel.kallsyms] [k] __do_page_fault
+ 43.63% 0.72% mysqld [kernel.kallsyms] [k] handle_mm_fault
+ 42.22% 3.16% mysqld [kernel.kallsyms] [k] do_numa_page
+ 36.18% 0.14% mysqld [kernel.kallsyms] [k] migrate_misplaced_page
+ 32.04% 0.17% mysqld [kernel.kallsyms] [k] migrate_pages
+ 15.36% 0.33% mysqld [kernel.kallsyms] [k] move_to_new_page
+ 15.36% 11.15% mysqld mysqld [.] buf_LRU_flush_or_remove_pages
+ 12.05% 0.00% mysqld [kernel.kallsyms] [k] migrate_page
+ 10.11% 10.11% mysqld [kernel.kallsyms] [k] copy_page_rep
+ 8.18% 0.00% mysqld [unknown] [k] 0000000000000000
+ 5.89% 0.30% mysqld [kernel.kallsyms] [k] putback_lru_page
+ 5.30% 0.00% mysqld libaio.so.1.0.1 [.] 0xffff806b830a7644
+ 5.15% 0.00% mysqld [kernel.kallsyms] [k] lru_cache_add
+ 5.15% 0.38% mysqld [kernel.kallsyms] [k] __lru_cache_add
+ 5.06% 0.05% mysqld [kernel.kallsyms] [k] system_call_fastpath
- Using 10G buffer pool discard took < 1s. Perf data as follows:
58.67% 32.41% mysqld mysqld [.] buf_LRU_block_remove_hashed
+ 32.98% 22.19% mysqld mysqld [.] buf_LRU_flush_or_remove_pages
+ 28.21% 0.23% mysqld [kernel.kallsyms] [k] page_fault
+ 27.91% 0.46% mysqld [kernel.kallsyms] [k] do_page_fault
+ 27.27% 1.15% mysqld [kernel.kallsyms] [k] __do_page_fault
+ 24.26% 1.87% mysqld [kernel.kallsyms] [k] handle_mm_fault
+ 21.70% 8.62% mysqld [kernel.kallsyms] [k] do_numa_page
+ 7.81% 7.81% mysqld [kernel.kallsyms] [k] irq_return
+ 7.19% 0.00% mysqld [kernel.kallsyms] [k] migrate_misplaced_page
+ 6.80% 0.07% mysqld [kernel.kallsyms] [k] migrate_pages
Perf files attached. Note the function buf_LRU_flush_or_remove_pages.
The entry point to these functions in ALTER TABLE…DISCARD TABLESPACE is fil_discard_tablespace(), which invokes fil_delete_tablespace(id, BUF_REMOVE_ALL_NO_WRITE), which in turn invokes buf_LRU_flush_or_remove_pages(id, BUF_REMOVE_ALL_NO_WRITE, NULL).
That will invoke buf_LRU_drop_page_hash_for_tablespace() and buf_LRU_remove_pages(buf_pool, id, BUF_REMOVE_ALL_NO_WRITE, NULL) on every buffer pool. That is where the time is consumed. The intention is to remove all traces of the discarded tablespace ID from the buffer pool.
For DROP TABLE, there were performance fixes in MySQL 5.5, especially MySQL Bug #51325.
Like I remembered, the fix to DROP TABLE performance issues was to defer the eviction of the pages from the buffer pool, and instead let the buffer pool LRU replacement mechanism get rid of the orphan pages. MySQL Bug #68184 probably extended this approach to TRUNCATE TABLE.
A similar approach could indeed be applied to DISCARD TABLESPACE.
Note: If there are adaptive hash index (AHI) entries for the table that is being dropped, truncated or discarded, these must be dropped. The design constraint of the AHI is that it must represent a subset of the index contents (all AHI entries must correspond to existing records). Furthermoer, the AHI depends on the dict_index_t objects of the table, and these would be freed at least by DROP TABLE.
Side note: While looking for the above-mentioned DROP TABLE performance bug, I found two other bugs that were fixed back then, related to ROW_FORMAT=COMPRESSED: MySQL Bug #35077, MySQL Bug #61188.
Side note: the refactored TRUNCATE TABLE in MySQL 5.7 (WL#6501) and in MariaDB 10.2.2+ probably cannot use this approach of deferred deletion, because no new tablespace ID would be assigned to the table. This TRUNCATE is not compatible with hot backup (see MDEV-13564) and has introduced some performance issues. If I remember correctly, one of the issues was that some mutex was being held while dropping the adaptive hash index.
If DISCARD TABLESPACE does not evict the pages from the buffer pool, then IMPORT TABLESPACE must do it. It would not be a useful fix to simply move the call to the IMPORT TABLESPACE step.
I tried simply removing the eviction in DISCARD TABLESPACE. If the DISCARD is soon followed by IMPORT, the IMPORT would adjust the imported file to the pre-existing tablespace ID. If the old pages were not evicted from the buffer pool at DISCARD, then after IMPORT we could incorrectly read old pages from the buffer pool, instead of reading the pages that exist in the imported file.
Import is initially bypassing the buffer pool when reading pages for the adjustment phase. A possible fix is that in the adjustment phase, if a page exists in the buffer pool, we would replace it with the page from the imported file. This would allow quick DISCARD TABLESPACE even for large tables, and a slightly slower IMPORT TABLESPACE (the slowness being proportional to the file size).
I have pushed this fix to 10.0 and merged to 10.1 so far.
I spent some time merging this to 10.2.
In 10.2, there is a merge conflict for the TRUNCATE TABLE code, as anticipated.
And the solution ought to be good news for TRUNCATE performance: There should be no need to evict adaptive hash index entries or old pages from the buffer pool during TRUNCATE. I believe that it suffices to ensure that buf_page_create() or equivalent is being called when initializing new pages after TRUNCATE, and that TRUNCATE will edit the dict_index_t objects in place.
I need to re-open this one.
After testing with 10.1.29, we are still seeing slowness with 512G buffer pool size:
MariaDB [test]> drop table if exists t;
|
Query OK, 0 rows affected, 1 warning (0.00 sec)
|
|
MariaDB [test]> create table t(id int) engine=InnoDB;
|
Query OK, 0 rows affected (0.00 sec)
|
|
MariaDB [test]> alter table t discard tablespace;
|
Query OK, 0 rows affected (7.34 sec)
|
|
MariaDB [test]> drop table if exists t;
|
Query OK, 0 rows affected (0.22 sec)
|
|
MariaDB [test]> select @@version;
|
+-----------------+
|
| @@version |
|
+-----------------+
|
| 10.1.29-MariaDB |
|
+-----------------+
|
1 row in set (0.01 sec)
|
|
MariaDB [test]> select @@innodb_buffer_pool_size;
|
+---------------------------+
|
| @@innodb_buffer_pool_size |
|
+---------------------------+
|
| 549755813888 |
|
+---------------------------+
|
1 row in set (0.00 sec)
|
Is there anything further that can be done here?
ccalender, I filed MDEV-16283 for what I believe to be the remaining issue. With the smaller buffer pool that I used, the DISCARD TABLESPACE was still rather fast.
Let us track the progress in MDEV-16283.
The original fix (MDEV-13328) was that we would not evict the pages of the discarded tablespace from the buffer pool; we would do that on a subsequent IMPORT TABLESPACE only. But, we would still scan the buffer pool for adaptive hash index entries, to drop them. It actually suffices to drop the adaptive hash index entries on IMPORT TABLESPACE or on DROP TABLE or DROP INDEX. And there is no need to scan the buffer pool if index->search_info->ref_count==0 for all indexes of the table. MDEV-16283 implemented these fixes.
https://jira.mariadb.org/browse/MDEV-11658
https://jira.mariadb.org/browse/MDEV-7894
These look potentially related at a quick glance (just ones we ran across while searching the bugs db for an existing bug about this).