[MDEV-24051] Remapping .text and .data application segments to huge pages Created: 2020-10-29  Updated: 2023-09-19

Status: Open
Project: MariaDB Server
Component/s: Server
Fix Version/s: 10.6

Type: Task Priority: Major
Reporter: Dmitriy Philimonov Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: foundation, performance

Attachments: File mmap_ksm.c     File remapping_text_and_data_to_huge_pages.v10_6_0.git.diff.v1    

 Description   

The applications usually benefit from remapping .text and .data ELF sections to huge pages. The performance speedup comes form significant reduce of iTLB and dTLB misses. Of course, the approach isn't new, the example implementations at the moment are:

libhugetlbfs uses huge pages, meanwhile Google/Facebook rely on transparent huge pages. We decided to follow the approach which is used by libhugetlbfs, since it has less dependency on the particular kernel allocation/defragmentation algorithm, so provides more persistent results.

We tried libhugetlbfs, however currently it has four major drawbacks:
1. A bug with position independent executables (linked with '--pie' parameter): https://github.com/libhugetlbfs/libhugetlbfs/issues/49
2. It might potentially unmap heap segment which immediately follows data segment in popular OS systems (e.g. Linux).
3. It supports remapping of maximum 3 ELF segments.
4. No integration with the target application: it works silently right during the startup.

So the custom implementation is provided, well adjusted for the MySQL code base:
1. No issues with position independent code / additional virtual memory randomization.
2. Tested with lld/gold/bfd linkers.
3. Preserves heap segment from unmapping, tested with standard glibc and jemalloc allocators.
4. Since it's a part of mysqld code now, any number of segments could be specified (currently = 16).
5. Integration with the MySQL code base: configuration variable is used to turn the functionality on and current logging system for error/notification messages.

Performance increase is up to 9% in sysbench OLTP_PS.

Restrictions:
1. Currently works with 2mb huge pages only.
2. Needs to be linked in a specific way (additional alignment for ELF segments).
3. Support is provided only for Linux systems (tested for kernels >= 3.10).

For more information refer to the documentation inside the sql/huge.cc (contains in the patch).

The patch is tested with the commit "5d4599f9750140f92cfdbbe4d292ae1b8dd456f8" (v10.6.0)

I submit this contribution under the New BSD License (in the compliance with https://mariadb.org/easier-licensing-for-mariadb-contributors).



 Comments   
Comment by Daniel Black [ 2020-11-12 ]

dmitriy.philimonov thanks for your contribution. I'm reading it though and are impressed with the 7% TLB miss reduction. Do you have a breakdown of iTLB vs dTLB? Was there a QPS speed up or a drop in latency, or query speed jitter? What has been done in testing to exhaust TLB cache and raise its miss rate?

I'm not sure if you noted my previous work in mysys/my_largepage.c that exposes multiple page sizes and a my_next_large_page_size iterator. Is this usable to allow for a page size of other than 2mb? As you know, some hardware doesn't support this size. I understand part of 2MB is the linker flags, but I think the executable code could detect a suitable page size from its alignment and size of the elf segments to use a different size.

I'm bit concerned about allowing write flags to the text segment. With ~30M text segment and 2MB (or even 16MB on other arches) large pages granularity I think there's sufficient to keep the text in its own huge pages with r-xp, even it it does waste up to a most of single huge page size.

So I see you've used the hugetlbfs mount point and an explicit temp file to grab a hugetlb page. Would a my_large_malloc be sufficient to grab an anonymous mmap of a hugetlb page? I'm happy to alter this routine to ensure that it is a huge page otherwise its not worth copying. If that occurs is there a need to hugetlb mounts? Removing the requirement was a goal in supporting the large-pages for data (that still isn't documented - notes are in MDEV-22135).

Other questions:

  • How does it play with ASAN / UBSAN?
  • An additional segment like MDEV-21145 is probably only a minor change?

more detailed:

  • Note sure that large-pages-for-code needs an enforce option, a warning if insufficient large pages might be sufficient
  • CMake - I think that the MY_CHECK_AND_SET_LINKER_FLAG can be split to expose a check for the linker flags
  • sql/huge.c -> mysys/my_elfremap (or something - mainly concerned about directory sql really contains sql focused bits). Allows reuse if things like mariabackup
  • a few classes of your patch like memory_logger, string buffers, readers, MY_ALIGN have duplicated existing works. As much as possible I'd like to see existing functions used or extended if they don't suite the needs.
  • I hope with anon mmap the hugetlbfs parsing is not needed.

I am interested in this and work with you to get a polished version merged.

Comment by Daniel Black [ 2020-11-16 ]

Is there any work being done on a glibc/linux kernel loader to do this function directly on the first load into memory rather than each application having to do the move itself?

Comment by Dmitriy Philimonov [ 2020-11-20 ]

Good day, Daniel

1. We got 9% in TPS speedup. iTLB && dTLB misses reduced in 38 times and 5 times respectively.

2. Latency becomes lower, the jitter in the TPS reduced significantly. The typical noise dropped to 0.05% (TPS/OLTP_PS).

3. We didn't exhaust TLB cache manually, the perf results are sufficient for us.

4. I used your work in mysys/my_largepage.c to support huge pages for buffer pool. Thanks a lot, nice job. However, the performance improvement from huge pages for buffer pool gives us no more than 1.5% of TPS speedup (x86-64/1GB pages and aarch64/16GB pages for the 64GB buffer pool). By the way, could you share your benchmark results for buffer pool using huge pages?
Of course, I'm aware of multiple page size support and the fact that some hardware doesn't support 2MB huge pages. Of course, we thought a lot about calculating HUGEPAGE_SIZE on the fly depending on the current hardware. We have a dependency on the linker flags because of the security:

  • You're absolutely correct, we don't want to assign write flags to the .text segment. Security matters.
  • It depends on the linker how many sections are produced. E.g. ld.gold generates 2 sections (.text r-x and .data rw-), ld.lld generates 4 sections (.text r--, .text r-x, .data rw-, .data rw-).
  • We use linker flags to align each section to the 2MB.
  • We remap each section to its own memory mapping with its own memory protection flags (.text: r-x, .data: rw- mostly).

So both HUGEPAGE_SIZE and linker flags in the cmake should be the same, ideally. There're couple of ideas:

  • align each segment to the bigger value (e.g. 32MB) and then use the smallest available / most appropriate huge page for the current segment taking into consideration the segment size. Meanwhile the binary size becomes (32*N) MB (N=number of sections) which is unacceptable in most cases. Which means we have to pick the page size to align as a trade-off between efficiency and the binary size. Anyway, it should be carefully tested.
  • detect during cmake configuration the available huge pages, then choose the minimal possible ( >= threshold ), then use it for source configuration and linker flags. Again, should be tested.
  • is it reliable enough to use alignment of the ELF section as a HUGEPAGE_SIZE value?

We don't have CPUs with "exotic" small huge pages to test our assumptions, so we provide you with the code which is tested.

5. We partially remap [heap] segment if it follows ELF sections too closely (otherwise the data in the starting addresses of [heap] segment is lost). During this procedure, we append rw- flags to the last program segment. All linkers I've tested generate firstly .text segment, then .data segment. Moreover, it's quite common in Linux when .bss segment resides into the beginning of [heap] segment (from the Linux point of view). So, it's almost sure we don't assign write flag to text segment. Additionally, look at paragraph 4.

6. Of course, firstly we tried anonymous huge pages via mmap(nullptr, ... MAP_FIXED | MAP_ANON | MAP_HUGETLB | MAP_HUGE_2M). However, if you use this approach then finally you must use mremap() to substitute virtual addresses where .text and .data sections reside. Unfortunately, mremap() doesn't work with huge pages, see the kernel sources: https://github.com/torvalds/linux/blob/master/mm/mremap.c:

if (is_vm_hugetlb_page(vma))
  return ERR_PTR(-EINVAL);

So we introduce a workaround with hugetlbfs where you have additionally a file descriptor associated with the mapping. If you know how to get away with anonymous huge pages, tell us, please. Keeping an accessible correct hugetlbfs mount point is inconvenient for production systems.

7. We didn't test it with ASAN/UBSAN

8. After applying our patch the problem described in the MDEV-21145 might be mitigated (should be checked). Anyway, if you add additional LOAD section to the binary, our patch should work correctly.

9. We introduced ENFORCE mode for:

  • production systems: we believe that people responsible for deploying and operation of software systems at scale may prefer their systems to fail fast, rather than fallback to some suboptimal/reduced mode silently;
  • automated benchmarks: we want to ensure our benchmarks use the correct configuration with maximum performance.

10. No objections to rename: sql/huge.cc -> mysys/my_elfremap.

11. There're two major reasons for additional "util" (memory_logger/string buffers/readers) inside the huge.cc:

  • major: this "util" works with stack only, it doesn't use heap memory. I work with brk/sbrk directly, manage the raw addresses and I'm afraid of race conditions here. So I chose to be on the safe side during refactorings and reviews: even logging doesn't use malloc/operator_new explicitly or implicitly. Otherwise the code could become too complex with unpredictable crashes.
  • minor: this code now looks like a library and it can be used in any project, not only in MySQL family.

12. We aren't aware about any work being done by glibc/kernel developers to automate ELF sections' remapping to huge pages.

Comment by Daniel Black [ 2020-12-15 ]

Hi,

I'm not going to be able to test this further this year.

Looking at the kernel reference provided for mremap is seems that not huge pages reference has been there for 11 years at least. Based on this implementation we both can conclude that huge pages can be executable memory. Maybe the huge pages constraint isn't necessary any longer. Could you try to remove from the kernel and see if mremap of anonymous memory works on a general system test?

Having done a similar removal of wrong huge pages limitations I can attest that the Linux kernel people are quite friendly and will listen to a well reasoned/tested changes. Like my change below the linux-mm email list (my first kernel patch, and accepted) should be your first place and look closely over the kernel documentation for submitting patches or RFC.
https://marc.info/?l=linux-mm&m=153828641520732&w=2

Seek assistance from the Google and Facebook people who have done commits to do similar things. The https://www.spinics.net/lists/linux-api/ list will also be a good place to communicate with libc and kernel developers on getting a general solution to loading application into huge memory.

I will look at this again next year, however I think a closer look at getting the generic functions into the kernel and loaders will help the general case and pose less burden on the application developers to maintain large invasive patches like this and place the functionality with people capable of understanding it better.

Comment by Dmitriy Philimonov [ 2020-12-15 ]

Good day, Daniel

I wish I could invest more time on the Linux kernel research, regrettably, I've already spent a huge amount of time on this feature, much more that was originally planned. At the moment, I'm obliged to switch to other priorities. Moreover, in the industry, we often need a solution right here and now. Waiting for the same functionality to be implemented elsewhere is unfortunately not an option.

Thank you for sharing your experience and for the links. If our priorities change, I'll follow your example.

Comment by Dmitriy Philimonov [ 2020-12-18 ]

Good day, Daniel.

We provide a bug fix to the current contribution. It changes huge page mapping flag from MAP_SHARED to MAP_PRIVATE.
So, after the remapping to huge pages is done, it:

  • fixes sporadic SIGSEGVs in child and parent processes after fork() syscall if fork() is called after remapping is done;
  • gdb attach starts servicing breakpoints;
  • gdb starts work correctly with produced core dumps.

Updated patch is added to the attachments.
I submit this under the New BSD License (in the compliance with https://mariadb.org/easier-licensing-for-mariadb-contributors).

Comment by Daniel Black [ 2021-02-11 ]

I asked the libc folks and they suggested just implement the linker flags and rely on THP to get the segments in order.

https://sourceware.org/pipermail/libc-alpha/2021-February/122334.html

So I'm going to do that as a first cut and see how much that reduces the TLB misses. If there's still a case afterwards we can take it up with libc folks. Doing it there https://sourceware.org/git/?p=glibc.git;a=blob;f=elf/dl-map-segments.h;hb=HEAD means that you don't need to use the remap and would be comparatively straight forwards, if interested however, ask for approval first.

I'm sorry I so amazed at the implementation and was a bit slow to communicate how much beyond the advanced skills set of a database userspace developer maintaining this code would be.

Comment by Dmitriy Philimonov [ 2022-01-10 ]

Happy New Year!

We shared our experience in remapping code segments to huge pages in this article and opened code on github. The published code significantly differs from the patch shared by me a year ago: it became simpler and more robust. Since the ticket is still open, I think it would be useful for your project.

P.S. For Russian speaking people there's a Russian blog post on habr.

Comment by Daniel Black [ 2022-01-11 ]

Happy new year dmitriy.philimonov, FWIW I was looking the LD_RELOAD path with mmap_ksm.c with an intent to look at modifying the appropriate flags for MMAP in the as a model for this and for KSM (multiple mariadb instances). I haven't quite got it working. I'll look into your code too.

Comment by Daniel Black [ 2022-11-23 ]

FYI From glibc-2.35

  • On Linux, a new tunable, glibc.malloc.hugetlb, can be used to
    either make malloc issue madvise plus MADV_HUGEPAGE on mmap and sbrk
    or to use huge pages directly with mmap calls with the MAP_HUGETLB
    flags). The former can improve performance when Transparent Huge Pages
    is set to 'madvise' mode while the latter uses the system reserved
    huge pages.
Generated at Thu Feb 08 09:27:04 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.