Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-14602

Reduce malloc()/free() usage in InnoDB

Details

    Description

      I load backup to mariadb and break it. In processlist still present one killed thread:
      {code}
      show processlist;
      ---------------------------------------------------------------------------------------------------------

      Id User Host db Command Time State Info Progress

      ---------------------------------------------------------------------------------------------------------

      85 loader 10.10.16.216:43971 phonebook_storage Killed 1845 Unlocking tables NULL 0.000
      184 root localhost NULL Query 0 init show processlist 0.000

      ---------------------------------------------------------------------------------------------------------
      {code}
      In same time I run strace for mysqld and see many calls to madvise:
      {code}

      1. strace -ff -p 33080
        ...
        [pid 60412] madvise(0x7f36cb638000, 331776, MADV_DONTNEED) = 0
        [pid 60412] madvise(0x7f36cb441000, 135168, MADV_DONTNEED) = 0
        [pid 60412] madvise(0x7f36cb638000, 335872, MADV_DONTNEED) = 0
        [pid 60412] madvise(0x7f36cb441000, 135168, MADV_DONTNEED) = 0
        [pid 60412] madvise(0x7f36cb638000, 335872, MADV_DONTNEED) = 0
        [pid 60412] madvise(0x7f36cb441000, 135168, MADV_DONTNEED) = 0
        [pid 60412] madvise(0x7f36cb638000, 335872, MADV_DONTNEED) = 0
        {code}

      Table which been loading:
      {code}
      /*!40101 SET @saved_cs_client = @@character_set_client */;
      /*!40101 SET character_set_client = utf8 */;
      CREATE TABLE `contact` (
      `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
      `id_phonebook` int(10) unsigned NOT NULL DEFAULT '0',
      `id_user` int(10) unsigned NOT NULL DEFAULT '0',
      `firstname` varchar(255) NOT NULL DEFAULT '',
      `lastname` varchar(255) NOT NULL DEFAULT '',
      `local_key` varchar(255) NOT NULL DEFAULT '',
      PRIMARY KEY (`id`),
      KEY `id_phonebook` (`id_phonebook`) USING BTREE,
      KEY `id_user` (`id_user`) USING BTREE
      ) ENGINE=InnoDB AUTO_INCREMENT=133028725 DEFAULT CHARSET=utf8 ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=2;
      /*!40101 SET character_set_client = @saved_cs_client */;
      {code}

      Attachments

        1. gdb.txt
          7 kB
        2. status.txt
          12 kB
        3. vars.txt
          16 kB

        Issue Links

          Activity

            madvise calls repeated:

            # strace -p 33080 -ff 2>&1 | tee /tmp/aaa
            # grep 'pid 60412' /tmp/aaa  | grep madvise | sort | uniq -c  | sort -nr     
              10491 [pid 60412] madvise(0x7f36cb617000, 466944, MADV_DONTNEED) = 0
               7979 [pid 60412] madvise(0x7f36cb617000, 331776, MADV_DONTNEED) = 0
               7701 [pid 60412] madvise(0x7f36cb441000, 135168, MADV_DONTNEED) = 0
               6551 [pid 60412] madvise(0x7f36cb4df000, 135168, MADV_DONTNEED) = 0
               6464 [pid 60412] madvise(0x7f36cb617000, 471040, MADV_DONTNEED) = 0
               6091 [pid 60412] madvise(0x7f36cb617000, 335872, MADV_DONTNEED) = 0
               1668 [pid 60412] madvise(0x7f36cb449000, 135168, MADV_DONTNEED) = 0
               1436 [pid 60412] madvise(0x7f36cb638000, 466944, MADV_DONTNEED) = 0
               1186 [pid 60412] madvise(0x7f36cb63a000, 466944, MADV_DONTNEED) = 0
               1184 [pid 60412] madvise(0x7f36cb638000, 331776, MADV_DONTNEED) = 0
                997 [pid 60412] madvise(0x7f36cb638000, 335872, MADV_DONTNEED) = 0
                866 [pid 60412] madvise(0x7f36cb63a000, 471040, MADV_DONTNEED) = 0
                749 [pid 60412] madvise(0x7f36cb638000, 471040, MADV_DONTNEED) = 0
                398 [pid 60412] madvise(0x7f36cb44a000, 135168, MADV_DONTNEED) = 0
                121 [pid 60412] <... madvise resumed> )     = 0
                 69 [pid 60412] madvise(0x7f36cb63a000, 331776, MADV_DONTNEED) = 0
                 34 [pid 60412] madvise(0x7f36cb63a000, 335872, MADV_DONTNEED) = 0
                 22 [pid 60412] madvise(0x7f36cb617000, 471040, MADV_DONTNEED <unfinished ...>
                 22 [pid 60412] madvise(0x7f36cb617000, 466944, MADV_DONTNEED <unfinished ...>
                 17 [pid 60412] madvise(0x7f36cb4df000, 135168, MADV_DONTNEED <unfinished ...>
                 13 [pid 60412] madvise(0x7f36cb617000, 331776, MADV_DONTNEED <unfinished ...>
                 13 [pid 60412] madvise(0x7f36cb441000, 135168, MADV_DONTNEED <unfinished ...>
                 10 [pid 60412] madvise(0x7f36cb4df000, 143360, MADV_DONTNEED) = 0
            ....
             
            # grep 'pid 60412' /tmp/aaa  | grep madvise | sort | uniq -c  | wc -l
            46
            # grep 'pid 60412' /tmp/aaa  | grep madvise   | wc -l
            22397
            

            46 different calls from 22397 collected

            chernomorets Sergey Chernomorets added a comment - madvise calls repeated: # strace -p 33080 -ff 2>&1 | tee /tmp/aaa # grep 'pid 60412' /tmp/aaa | grep madvise | sort | uniq -c | sort -nr 10491 [pid 60412] madvise(0x7f36cb617000, 466944, MADV_DONTNEED) = 0 7979 [pid 60412] madvise(0x7f36cb617000, 331776, MADV_DONTNEED) = 0 7701 [pid 60412] madvise(0x7f36cb441000, 135168, MADV_DONTNEED) = 0 6551 [pid 60412] madvise(0x7f36cb4df000, 135168, MADV_DONTNEED) = 0 6464 [pid 60412] madvise(0x7f36cb617000, 471040, MADV_DONTNEED) = 0 6091 [pid 60412] madvise(0x7f36cb617000, 335872, MADV_DONTNEED) = 0 1668 [pid 60412] madvise(0x7f36cb449000, 135168, MADV_DONTNEED) = 0 1436 [pid 60412] madvise(0x7f36cb638000, 466944, MADV_DONTNEED) = 0 1186 [pid 60412] madvise(0x7f36cb63a000, 466944, MADV_DONTNEED) = 0 1184 [pid 60412] madvise(0x7f36cb638000, 331776, MADV_DONTNEED) = 0 997 [pid 60412] madvise(0x7f36cb638000, 335872, MADV_DONTNEED) = 0 866 [pid 60412] madvise(0x7f36cb63a000, 471040, MADV_DONTNEED) = 0 749 [pid 60412] madvise(0x7f36cb638000, 471040, MADV_DONTNEED) = 0 398 [pid 60412] madvise(0x7f36cb44a000, 135168, MADV_DONTNEED) = 0 121 [pid 60412] <... madvise resumed> ) = 0 69 [pid 60412] madvise(0x7f36cb63a000, 331776, MADV_DONTNEED) = 0 34 [pid 60412] madvise(0x7f36cb63a000, 335872, MADV_DONTNEED) = 0 22 [pid 60412] madvise(0x7f36cb617000, 471040, MADV_DONTNEED <unfinished ...> 22 [pid 60412] madvise(0x7f36cb617000, 466944, MADV_DONTNEED <unfinished ...> 17 [pid 60412] madvise(0x7f36cb4df000, 135168, MADV_DONTNEED <unfinished ...> 13 [pid 60412] madvise(0x7f36cb617000, 331776, MADV_DONTNEED <unfinished ...> 13 [pid 60412] madvise(0x7f36cb441000, 135168, MADV_DONTNEED <unfinished ...> 10 [pid 60412] madvise(0x7f36cb4df000, 143360, MADV_DONTNEED) = 0 ....   # grep 'pid 60412' /tmp/aaa | grep madvise | sort | uniq -c | wc -l 46 # grep 'pid 60412' /tmp/aaa | grep madvise | wc -l 22397 46 different calls from 22397 collected

            chernomorets, can you please provide some more information? What was the SQL statement that you attempted to kill? Can you show some stack traces of the statement execution? For example, use Poor man’s profiler or Quickstack, or attach gdb to the mysqld process.

            As far as I can tell, MariaDB 10.1 calls madvise() from MyISAM and Aria, not from InnoDB. Assuming that the table definition was not interpreted as something else (say, ENGINE=InnoDB could have been interpreted as ENGINE=MyISAM), then the other usage of MyISAM or Aria would be due to internal temporary tables for query execution, say, INSERT INTO…SELECT…ORDER BY non_indexed_column.

            It would seem to me that some loop is forgetting the thd_killed() check. Inside InnoDB, that check is only done in special long-running operations, such as CHECK TABLE. Otherwise it is supposed to be checked by the SQL executor.

            marko Marko Mäkelä added a comment - chernomorets , can you please provide some more information? What was the SQL statement that you attempted to kill? Can you show some stack traces of the statement execution? For example, use Poor man’s profiler or Quickstack , or attach gdb to the mysqld process. As far as I can tell, MariaDB 10.1 calls madvise() from MyISAM and Aria, not from InnoDB. Assuming that the table definition was not interpreted as something else (say, ENGINE=InnoDB could have been interpreted as ENGINE=MyISAM ), then the other usage of MyISAM or Aria would be due to internal temporary tables for query execution, say, INSERT INTO…SELECT…ORDER BY non_indexed_column . It would seem to me that some loop is forgetting the thd_killed() check. Inside InnoDB, that check is only done in special long-running operations, such as CHECK TABLE . Otherwise it is supposed to be checked by the SQL executor.

            Sorry, I should have looked at the gdb.txt. The madvise() is internally called by the jemalloc implementation of free():

            #0  0x00007f52696dbae7 in madvise () from /lib64/libc.so.6
            #1  0x00007f526aad562e in je_pages_purge () from /lib64/libjemalloc.so.1
            #2  0x00007f526aad0a92 in arena_purge () from /lib64/libjemalloc.so.1
            #3  0x00007f526aad218d in je_arena_dalloc_large () from /lib64/libjemalloc.so.1
            #4  0x00007f522fc8c8bd in mem_heap_block_free (heap=heap@entry=0x7f36cb617000, block=0x7f36cb617000) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.1.29/storage/innobase/mem/mem0mem.cc:521
            #5  0x00007f522fcac43f in mem_heap_free_func (file_name=0x7f522fd6eae0 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.1.29/storage/innobase/page/page0zip.cc", line=1476, heap=0x7f36cb617000)
                at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.1.29/storage/innobase/include/mem0mem.ic:510
            #6  page_zip_compress (page_zip=page_zip@entry=0x7f41bd5a6110, page=page@entry=0x7f42101e0000 "\204\354\r\222", index=index@entry=0x7f36cb4323e8, level=<optimized out>, mtr=mtr@entry=0x7f4ba1869300)
                at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.1.29/storage/innobase/page/page0zip.cc:1476
            #7  0x00007f522fcadf8f in page_zip_reorganize (block=block@entry=0x7f41bd5a6100, index=index@entry=0x7f36cb4323e8, mtr=mtr@entry=0x7f4ba1869300)
                at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.1.29/storage/innobase/page/page0zip.cc:4683
            #8  0x00007f522fca5aa9 in page_copy_rec_list_start (new_block=0x7f41bd5a6100, block=block@entry=0x7f450a67eb60, rec=0x7f456f7d4070 "supremum", index=index@entry=0x7f36cb4323e8, mtr=mtr@entry=0x7f4ba1869300)
                at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.1.29/storage/innobase/page/page0page.cc:855
            #9  0x00007f522fb5e20c in btr_compress (cursor=cursor@entry=0x7f4ba1869220, adjust=0, mtr=0x7f4ba1869300) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.1.29/storage/innobase/btr/btr0btr.cc:3963
            

            This stack trace is part of the transaction rollback inside InnoDB.

            Unfortunately, this is working as designed. Sure, we should try to use fewer mem_heap_create() and mem_heap_free() operations, but the real issue is that ROLLBACK in InnoDB is slow. On the other hand, COMMIT is fast, because all changes have been already made into the persistent data structures, and all that needs to be done is to ensure that all redo log up to the transaction state change has been persistently written to the file.

            If this was an insert into an empty table or partition, then MDEV-515 would fix this particular problem, by simply writing less undo log, so that the ROLLBACK would be internally implemented as a TRUNCATE operation.

            marko Marko Mäkelä added a comment - Sorry, I should have looked at the gdb.txt . The madvise() is internally called by the jemalloc implementation of free() : #0 0x00007f52696dbae7 in madvise () from /lib64/libc.so.6 #1 0x00007f526aad562e in je_pages_purge () from /lib64/libjemalloc.so.1 #2 0x00007f526aad0a92 in arena_purge () from /lib64/libjemalloc.so.1 #3 0x00007f526aad218d in je_arena_dalloc_large () from /lib64/libjemalloc.so.1 #4 0x00007f522fc8c8bd in mem_heap_block_free (heap=heap@entry=0x7f36cb617000, block=0x7f36cb617000) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.1.29/storage/innobase/mem/mem0mem.cc:521 #5 0x00007f522fcac43f in mem_heap_free_func (file_name=0x7f522fd6eae0 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.1.29/storage/innobase/page/page0zip.cc", line=1476, heap=0x7f36cb617000) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.1.29/storage/innobase/include/mem0mem.ic:510 #6 page_zip_compress (page_zip=page_zip@entry=0x7f41bd5a6110, page=page@entry=0x7f42101e0000 "\204\354\r\222", index=index@entry=0x7f36cb4323e8, level=<optimized out>, mtr=mtr@entry=0x7f4ba1869300) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.1.29/storage/innobase/page/page0zip.cc:1476 #7 0x00007f522fcadf8f in page_zip_reorganize (block=block@entry=0x7f41bd5a6100, index=index@entry=0x7f36cb4323e8, mtr=mtr@entry=0x7f4ba1869300) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.1.29/storage/innobase/page/page0zip.cc:4683 #8 0x00007f522fca5aa9 in page_copy_rec_list_start (new_block=0x7f41bd5a6100, block=block@entry=0x7f450a67eb60, rec=0x7f456f7d4070 "supremum", index=index@entry=0x7f36cb4323e8, mtr=mtr@entry=0x7f4ba1869300) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.1.29/storage/innobase/page/page0page.cc:855 #9 0x00007f522fb5e20c in btr_compress (cursor=cursor@entry=0x7f4ba1869220, adjust=0, mtr=0x7f4ba1869300) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.1.29/storage/innobase/btr/btr0btr.cc:3963 This stack trace is part of the transaction rollback inside InnoDB. Unfortunately, this is working as designed. Sure, we should try to use fewer mem_heap_create() and mem_heap_free() operations, but the real issue is that ROLLBACK in InnoDB is slow. On the other hand, COMMIT is fast, because all changes have been already made into the persistent data structures, and all that needs to be done is to ensure that all redo log up to the transaction state change has been persistently written to the file. If this was an insert into an empty table or partition, then MDEV-515 would fix this particular problem, by simply writing less undo log, so that the ROLLBACK would be internally implemented as a TRUNCATE operation.

            I only know of one work-around to this problem: Kill the server, delete the tablename.ibd file (not tablename.frm), and restart the server. The rollback will still be executed after restart, but it will be much faster, because the data will be missing. I think that DROP TABLE should still work, removing the .frm file and the InnoDB internal data dictionary entries. Of course, this advice should be tried with caution. Test in a staging environment first.

            marko Marko Mäkelä added a comment - I only know of one work-around to this problem: Kill the server, delete the tablename.ibd file (not tablename.frm ), and restart the server. The rollback will still be executed after restart, but it will be much faster, because the data will be missing. I think that DROP TABLE should still work, removing the .frm file and the InnoDB internal data dictionary entries. Of course, this advice should be tried with caution. Test in a staging environment first.

            Thank for answer!

            I loaded many "INSERT INTO...VALUES" from mysqldump.

            This case is not very important for me (it was interrupted experiment with backup recovery), but such state of mysqld seems very strange for me, so I create this bug.

            chernomorets Sergey Chernomorets added a comment - Thank for answer! I loaded many "INSERT INTO...VALUES" from mysqldump. This case is not very important for me (it was interrupted experiment with backup recovery), but such state of mysqld seems very strange for me, so I create this bug.
            chernomorets Sergey Chernomorets added a comment - - edited

            Marko, I see 10491 calls of "madvise(0x7f36cb617000, 466944, MADV_DONTNEED) = 0" and many duplicates with several other arguments, is it correct behaviour?

            chernomorets Sergey Chernomorets added a comment - - edited Marko, I see 10491 calls of "madvise(0x7f36cb617000, 466944, MADV_DONTNEED) = 0" and many duplicates with several other arguments, is it correct behaviour?

            chernomorets, the madvise() system calls are apparently how jemalloc lets the operating system reclaim the memory in free(). I assume that malloc() would allocate private anonymous pages with mmap(). Maybe other memory allocation libraries use munmap() or something else; I have not checked lately.

            That InnoDB uses an excessive amount of heap memory allocations is a known problem. Fixing it would require massive changes to the code, and it is not feasible to do in a GA release.

            Furthermore, I think that the usefulness of ROW_FORMAT=COMPRESSED is limited. When I designed and implemented it in 2005, data file fragmentation was considered to be an issue, and also FALLOC_FL_PUNCH_HOLE did not exist. With flash-based storage, PAGE_COMPRESSED=YES ought to work work better.

            marko Marko Mäkelä added a comment - chernomorets , the madvise() system calls are apparently how jemalloc lets the operating system reclaim the memory in free() . I assume that malloc() would allocate private anonymous pages with mmap() . Maybe other memory allocation libraries use munmap() or something else; I have not checked lately. That InnoDB uses an excessive amount of heap memory allocations is a known problem. Fixing it would require massive changes to the code, and it is not feasible to do in a GA release. Furthermore, I think that the usefulness of ROW_FORMAT=COMPRESSED is limited. When I designed and implemented it in 2005, data file fragmentation was considered to be an issue, and also FALLOC_FL_PUNCH_HOLE did not exist. With flash-based storage, PAGE_COMPRESSED=YES ought to work work better.

            Thanks a lot, Marko!

            chernomorets Sergey Chernomorets added a comment - Thanks a lot, Marko!

            I retitled the bug, because while I consider my implementation of ROW_FORMAT=COMPRESSED in InnoDB obsolete by now, something definitely needs to be done about the frequent memory heap operations in InnoDB.

            marko Marko Mäkelä added a comment - I retitled the bug, because while I consider my implementation of ROW_FORMAT=COMPRESSED in InnoDB obsolete by now, something definitely needs to be done about the frequent memory heap operations in InnoDB.

            In MDEV-32050, some copying of undo log records during transaction rollback, a special case of commit (MDEV-15250) and purge was removed. We will still copy undo log records from the buffer pool to heap during MVCC reads.

            marko Marko Mäkelä added a comment - In MDEV-32050 , some copying of undo log records during transaction rollback, a special case of commit ( MDEV-15250 ) and purge was removed. We will still copy undo log records from the buffer pool to heap during MVCC reads.

            For the record, I just quickly checked the following test:

            ./mtr --rr --skip-ssl --mysqld=--innodb-stats-persistent=0 innodb.insert_into_empty,4k
            

            A significant portion of heap memory allocation operations during this test, which is inserting data into empty tables is originating from the internal SQL interpreter in InnoDB, which is used for updating persistent statistics. The above command disables that, as well as the use of SSL, which is another notable source of malloc() calls. There is quite a bit of heap memory allocation going on also outside storage engines, for example related to acquiring metadata locks.

            Furthermore, while working on MDEV-29445 I recently learned that MADV_DONTNEED actually requests the memory to be unmapped immediately. For workloads that frequently free and allocate memory, it would seem to make sense to defer the release of memory to the operating system kernel, or to request the operating system to defer the modification of the MMU mappings (MADV_FREE).

            marko Marko Mäkelä added a comment - For the record, I just quickly checked the following test: ./mtr --rr --skip-ssl --mysqld=--innodb-stats-persistent=0 innodb.insert_into_empty,4k A significant portion of heap memory allocation operations during this test, which is inserting data into empty tables is originating from the internal SQL interpreter in InnoDB, which is used for updating persistent statistics. The above command disables that, as well as the use of SSL, which is another notable source of malloc() calls. There is quite a bit of heap memory allocation going on also outside storage engines, for example related to acquiring metadata locks. Furthermore, while working on MDEV-29445 I recently learned that MADV_DONTNEED actually requests the memory to be unmapped immediately. For workloads that frequently free and allocate memory, it would seem to make sense to defer the release of memory to the operating system kernel, or to request the operating system to defer the modification of the MMU mappings ( MADV_FREE ).

            People

              marko Marko Mäkelä
              chernomorets Sergey Chernomorets
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.