[MDEV-14602] Reduce malloc()/free() usage in InnoDB Created: 2017-12-07  Updated: 2024-01-18

Status: Confirmed
Project: MariaDB Server
Component/s: Data Manipulation - Insert
Fix Version/s: None

Type: Task Priority: Major
Reporter: Sergey Chernomorets Assignee: Marko Mäkelä
Resolution: Unresolved Votes: 0
Labels: memory, performance

Attachments: Text File gdb.txt     Text File status.txt     Text File vars.txt    
Issue Links:
Relates
relates to MDEV-515 innodb bulk insert Closed
relates to MDEV-21225 Remove ut_align() and use aligned_mal... Closed
relates to MDEV-32050 UNDO logs still growing for write-int... Closed
relates to MDEV-30889 Memory leak issues with MariaDB 10.6.... Stalled

 Description   

I load backup to mariadb and break it. In processlist still present one killed thread:
{code}
show processlist;
---------------------------------------------------------------------------------------------------------

Id User Host db Command Time State Info Progress

---------------------------------------------------------------------------------------------------------

85 loader 10.10.16.216:43971 phonebook_storage Killed 1845 Unlocking tables NULL 0.000
184 root localhost NULL Query 0 init show processlist 0.000

---------------------------------------------------------------------------------------------------------
{code}
In same time I run strace for mysqld and see many calls to madvise:
{code}

  1. strace -ff -p 33080
    ...
    [pid 60412] madvise(0x7f36cb638000, 331776, MADV_DONTNEED) = 0
    [pid 60412] madvise(0x7f36cb441000, 135168, MADV_DONTNEED) = 0
    [pid 60412] madvise(0x7f36cb638000, 335872, MADV_DONTNEED) = 0
    [pid 60412] madvise(0x7f36cb441000, 135168, MADV_DONTNEED) = 0
    [pid 60412] madvise(0x7f36cb638000, 335872, MADV_DONTNEED) = 0
    [pid 60412] madvise(0x7f36cb441000, 135168, MADV_DONTNEED) = 0
    [pid 60412] madvise(0x7f36cb638000, 335872, MADV_DONTNEED) = 0
    {code}

Table which been loading:
{code}
/*!40101 SET @saved_cs_client = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `contact` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`id_phonebook` int(10) unsigned NOT NULL DEFAULT '0',
`id_user` int(10) unsigned NOT NULL DEFAULT '0',
`firstname` varchar(255) NOT NULL DEFAULT '',
`lastname` varchar(255) NOT NULL DEFAULT '',
`local_key` varchar(255) NOT NULL DEFAULT '',
PRIMARY KEY (`id`),
KEY `id_phonebook` (`id_phonebook`) USING BTREE,
KEY `id_user` (`id_user`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=133028725 DEFAULT CHARSET=utf8 ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=2;
/*!40101 SET character_set_client = @saved_cs_client */;
{code}



 Comments   
Comment by Sergey Chernomorets [ 2017-12-07 ]

madvise calls repeated:

# strace -p 33080 -ff 2>&1 | tee /tmp/aaa
# grep 'pid 60412' /tmp/aaa  | grep madvise | sort | uniq -c  | sort -nr     
  10491 [pid 60412] madvise(0x7f36cb617000, 466944, MADV_DONTNEED) = 0
   7979 [pid 60412] madvise(0x7f36cb617000, 331776, MADV_DONTNEED) = 0
   7701 [pid 60412] madvise(0x7f36cb441000, 135168, MADV_DONTNEED) = 0
   6551 [pid 60412] madvise(0x7f36cb4df000, 135168, MADV_DONTNEED) = 0
   6464 [pid 60412] madvise(0x7f36cb617000, 471040, MADV_DONTNEED) = 0
   6091 [pid 60412] madvise(0x7f36cb617000, 335872, MADV_DONTNEED) = 0
   1668 [pid 60412] madvise(0x7f36cb449000, 135168, MADV_DONTNEED) = 0
   1436 [pid 60412] madvise(0x7f36cb638000, 466944, MADV_DONTNEED) = 0
   1186 [pid 60412] madvise(0x7f36cb63a000, 466944, MADV_DONTNEED) = 0
   1184 [pid 60412] madvise(0x7f36cb638000, 331776, MADV_DONTNEED) = 0
    997 [pid 60412] madvise(0x7f36cb638000, 335872, MADV_DONTNEED) = 0
    866 [pid 60412] madvise(0x7f36cb63a000, 471040, MADV_DONTNEED) = 0
    749 [pid 60412] madvise(0x7f36cb638000, 471040, MADV_DONTNEED) = 0
    398 [pid 60412] madvise(0x7f36cb44a000, 135168, MADV_DONTNEED) = 0
    121 [pid 60412] <... madvise resumed> )     = 0
     69 [pid 60412] madvise(0x7f36cb63a000, 331776, MADV_DONTNEED) = 0
     34 [pid 60412] madvise(0x7f36cb63a000, 335872, MADV_DONTNEED) = 0
     22 [pid 60412] madvise(0x7f36cb617000, 471040, MADV_DONTNEED <unfinished ...>
     22 [pid 60412] madvise(0x7f36cb617000, 466944, MADV_DONTNEED <unfinished ...>
     17 [pid 60412] madvise(0x7f36cb4df000, 135168, MADV_DONTNEED <unfinished ...>
     13 [pid 60412] madvise(0x7f36cb617000, 331776, MADV_DONTNEED <unfinished ...>
     13 [pid 60412] madvise(0x7f36cb441000, 135168, MADV_DONTNEED <unfinished ...>
     10 [pid 60412] madvise(0x7f36cb4df000, 143360, MADV_DONTNEED) = 0
....
 
# grep 'pid 60412' /tmp/aaa  | grep madvise | sort | uniq -c  | wc -l
46
# grep 'pid 60412' /tmp/aaa  | grep madvise   | wc -l
22397

46 different calls from 22397 collected

Comment by Marko Mäkelä [ 2017-12-12 ]

chernomorets, can you please provide some more information? What was the SQL statement that you attempted to kill? Can you show some stack traces of the statement execution? For example, use Poor man’s profiler or Quickstack, or attach gdb to the mysqld process.

As far as I can tell, MariaDB 10.1 calls madvise() from MyISAM and Aria, not from InnoDB. Assuming that the table definition was not interpreted as something else (say, ENGINE=InnoDB could have been interpreted as ENGINE=MyISAM), then the other usage of MyISAM or Aria would be due to internal temporary tables for query execution, say, INSERT INTO…SELECT…ORDER BY non_indexed_column.

It would seem to me that some loop is forgetting the thd_killed() check. Inside InnoDB, that check is only done in special long-running operations, such as CHECK TABLE. Otherwise it is supposed to be checked by the SQL executor.

Comment by Marko Mäkelä [ 2017-12-12 ]

Sorry, I should have looked at the gdb.txt. The madvise() is internally called by the jemalloc implementation of free():

#0  0x00007f52696dbae7 in madvise () from /lib64/libc.so.6
#1  0x00007f526aad562e in je_pages_purge () from /lib64/libjemalloc.so.1
#2  0x00007f526aad0a92 in arena_purge () from /lib64/libjemalloc.so.1
#3  0x00007f526aad218d in je_arena_dalloc_large () from /lib64/libjemalloc.so.1
#4  0x00007f522fc8c8bd in mem_heap_block_free (heap=heap@entry=0x7f36cb617000, block=0x7f36cb617000) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.1.29/storage/innobase/mem/mem0mem.cc:521
#5  0x00007f522fcac43f in mem_heap_free_func (file_name=0x7f522fd6eae0 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.1.29/storage/innobase/page/page0zip.cc", line=1476, heap=0x7f36cb617000)
    at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.1.29/storage/innobase/include/mem0mem.ic:510
#6  page_zip_compress (page_zip=page_zip@entry=0x7f41bd5a6110, page=page@entry=0x7f42101e0000 "\204\354\r\222", index=index@entry=0x7f36cb4323e8, level=<optimized out>, mtr=mtr@entry=0x7f4ba1869300)
    at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.1.29/storage/innobase/page/page0zip.cc:1476
#7  0x00007f522fcadf8f in page_zip_reorganize (block=block@entry=0x7f41bd5a6100, index=index@entry=0x7f36cb4323e8, mtr=mtr@entry=0x7f4ba1869300)
    at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.1.29/storage/innobase/page/page0zip.cc:4683
#8  0x00007f522fca5aa9 in page_copy_rec_list_start (new_block=0x7f41bd5a6100, block=block@entry=0x7f450a67eb60, rec=0x7f456f7d4070 "supremum", index=index@entry=0x7f36cb4323e8, mtr=mtr@entry=0x7f4ba1869300)
    at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.1.29/storage/innobase/page/page0page.cc:855
#9  0x00007f522fb5e20c in btr_compress (cursor=cursor@entry=0x7f4ba1869220, adjust=0, mtr=0x7f4ba1869300) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.1.29/storage/innobase/btr/btr0btr.cc:3963

This stack trace is part of the transaction rollback inside InnoDB.

Unfortunately, this is working as designed. Sure, we should try to use fewer mem_heap_create() and mem_heap_free() operations, but the real issue is that ROLLBACK in InnoDB is slow. On the other hand, COMMIT is fast, because all changes have been already made into the persistent data structures, and all that needs to be done is to ensure that all redo log up to the transaction state change has been persistently written to the file.

If this was an insert into an empty table or partition, then MDEV-515 would fix this particular problem, by simply writing less undo log, so that the ROLLBACK would be internally implemented as a TRUNCATE operation.

Comment by Marko Mäkelä [ 2017-12-12 ]

I only know of one work-around to this problem: Kill the server, delete the tablename.ibd file (not tablename.frm), and restart the server. The rollback will still be executed after restart, but it will be much faster, because the data will be missing. I think that DROP TABLE should still work, removing the .frm file and the InnoDB internal data dictionary entries. Of course, this advice should be tried with caution. Test in a staging environment first.

Comment by Sergey Chernomorets [ 2017-12-12 ]

Thank for answer!

I loaded many "INSERT INTO...VALUES" from mysqldump.

This case is not very important for me (it was interrupted experiment with backup recovery), but such state of mysqld seems very strange for me, so I create this bug.

Comment by Sergey Chernomorets [ 2017-12-12 ]

Marko, I see 10491 calls of "madvise(0x7f36cb617000, 466944, MADV_DONTNEED) = 0" and many duplicates with several other arguments, is it correct behaviour?

Comment by Marko Mäkelä [ 2017-12-12 ]

chernomorets, the madvise() system calls are apparently how jemalloc lets the operating system reclaim the memory in free(). I assume that malloc() would allocate private anonymous pages with mmap(). Maybe other memory allocation libraries use munmap() or something else; I have not checked lately.

That InnoDB uses an excessive amount of heap memory allocations is a known problem. Fixing it would require massive changes to the code, and it is not feasible to do in a GA release.

Furthermore, I think that the usefulness of ROW_FORMAT=COMPRESSED is limited. When I designed and implemented it in 2005, data file fragmentation was considered to be an issue, and also FALLOC_FL_PUNCH_HOLE did not exist. With flash-based storage, PAGE_COMPRESSED=YES ought to work work better.

Comment by Sergey Chernomorets [ 2017-12-12 ]

Thanks a lot, Marko!

Comment by Marko Mäkelä [ 2019-03-29 ]

I retitled the bug, because while I consider my implementation of ROW_FORMAT=COMPRESSED in InnoDB obsolete by now, something definitely needs to be done about the frequent memory heap operations in InnoDB.

Comment by Marko Mäkelä [ 2023-12-11 ]

In MDEV-32050, some copying of undo log records during transaction rollback, a special case of commit (MDEV-15250) and purge was removed. We will still copy undo log records from the buffer pool to heap during MVCC reads.

Generated at Thu Feb 08 08:14:50 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.