[MDEV-24051] Remapping .text and .data application segments to huge pages Created: 2020-10-29 Updated: 2023-09-19 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Server |
| Fix Version/s: | 10.6 |
| Type: | Task | Priority: | Major |
| Reporter: | Dmitriy Philimonov | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | foundation, performance | ||
| Attachments: |
|
| Description |
|
The applications usually benefit from remapping .text and .data ELF sections to huge pages. The performance speedup comes form significant reduce of iTLB and dTLB misses. Of course, the approach isn't new, the example implementations at the moment are:
libhugetlbfs uses huge pages, meanwhile Google/Facebook rely on transparent huge pages. We decided to follow the approach which is used by libhugetlbfs, since it has less dependency on the particular kernel allocation/defragmentation algorithm, so provides more persistent results. We tried libhugetlbfs, however currently it has four major drawbacks: So the custom implementation is provided, well adjusted for the MySQL code base: Performance increase is up to 9% in sysbench OLTP_PS. Restrictions: For more information refer to the documentation inside the sql/huge.cc (contains in the patch). The patch is tested with the commit "5d4599f9750140f92cfdbbe4d292ae1b8dd456f8" (v10.6.0) I submit this contribution under the New BSD License (in the compliance with https://mariadb.org/easier-licensing-for-mariadb-contributors). |
| Comments |
| Comment by Daniel Black [ 2020-11-12 ] | ||
|
dmitriy.philimonov thanks for your contribution. I'm reading it though and are impressed with the 7% TLB miss reduction. Do you have a breakdown of iTLB vs dTLB? Was there a QPS speed up or a drop in latency, or query speed jitter? What has been done in testing to exhaust TLB cache and raise its miss rate? I'm not sure if you noted my previous work in mysys/my_largepage.c that exposes multiple page sizes and a my_next_large_page_size iterator. Is this usable to allow for a page size of other than 2mb? As you know, some hardware doesn't support this size. I understand part of 2MB is the linker flags, but I think the executable code could detect a suitable page size from its alignment and size of the elf segments to use a different size. I'm bit concerned about allowing write flags to the text segment. With ~30M text segment and 2MB (or even 16MB on other arches) large pages granularity I think there's sufficient to keep the text in its own huge pages with r-xp, even it it does waste up to a most of single huge page size. So I see you've used the hugetlbfs mount point and an explicit temp file to grab a hugetlb page. Would a my_large_malloc be sufficient to grab an anonymous mmap of a hugetlb page? I'm happy to alter this routine to ensure that it is a huge page otherwise its not worth copying. If that occurs is there a need to hugetlb mounts? Removing the requirement was a goal in supporting the large-pages for data (that still isn't documented - notes are in MDEV-22135). Other questions:
more detailed:
I am interested in this and work with you to get a polished version merged. | ||
| Comment by Daniel Black [ 2020-11-16 ] | ||
|
Is there any work being done on a glibc/linux kernel loader to do this function directly on the first load into memory rather than each application having to do the move itself? | ||
| Comment by Dmitriy Philimonov [ 2020-11-20 ] | ||
|
Good day, Daniel 1. We got 9% in TPS speedup. iTLB && dTLB misses reduced in 38 times and 5 times respectively. 2. Latency becomes lower, the jitter in the TPS reduced significantly. The typical noise dropped to 0.05% (TPS/OLTP_PS). 3. We didn't exhaust TLB cache manually, the perf results are sufficient for us. 4. I used your work in mysys/my_largepage.c to support huge pages for buffer pool. Thanks a lot, nice job. However, the performance improvement from huge pages for buffer pool gives us no more than 1.5% of TPS speedup (x86-64/1GB pages and aarch64/16GB pages for the 64GB buffer pool). By the way, could you share your benchmark results for buffer pool using huge pages?
So both HUGEPAGE_SIZE and linker flags in the cmake should be the same, ideally. There're couple of ideas:
We don't have CPUs with "exotic" small huge pages to test our assumptions, so we provide you with the code which is tested. 5. We partially remap [heap] segment if it follows ELF sections too closely (otherwise the data in the starting addresses of [heap] segment is lost). During this procedure, we append rw- flags to the last program segment. All linkers I've tested generate firstly .text segment, then .data segment. Moreover, it's quite common in Linux when .bss segment resides into the beginning of [heap] segment (from the Linux point of view). So, it's almost sure we don't assign write flag to text segment. Additionally, look at paragraph 4. 6. Of course, firstly we tried anonymous huge pages via mmap(nullptr, ... MAP_FIXED | MAP_ANON | MAP_HUGETLB | MAP_HUGE_2M). However, if you use this approach then finally you must use mremap() to substitute virtual addresses where .text and .data sections reside. Unfortunately, mremap() doesn't work with huge pages, see the kernel sources: https://github.com/torvalds/linux/blob/master/mm/mremap.c:
So we introduce a workaround with hugetlbfs where you have additionally a file descriptor associated with the mapping. If you know how to get away with anonymous huge pages, tell us, please. Keeping an accessible correct hugetlbfs mount point is inconvenient for production systems. 7. We didn't test it with ASAN/UBSAN 8. After applying our patch the problem described in the MDEV-21145 might be mitigated (should be checked). Anyway, if you add additional LOAD section to the binary, our patch should work correctly. 9. We introduced ENFORCE mode for:
10. No objections to rename: sql/huge.cc -> mysys/my_elfremap. 11. There're two major reasons for additional "util" (memory_logger/string buffers/readers) inside the huge.cc:
12. We aren't aware about any work being done by glibc/kernel developers to automate ELF sections' remapping to huge pages. | ||
| Comment by Daniel Black [ 2020-12-15 ] | ||
|
Hi, I'm not going to be able to test this further this year. Looking at the kernel reference provided for mremap is seems that not huge pages reference has been there for 11 years at least. Based on this implementation we both can conclude that huge pages can be executable memory. Maybe the huge pages constraint isn't necessary any longer. Could you try to remove from the kernel and see if mremap of anonymous memory works on a general system test? Having done a similar removal of wrong huge pages limitations I can attest that the Linux kernel people are quite friendly and will listen to a well reasoned/tested changes. Like my change below the linux-mm email list (my first kernel patch, and accepted) should be your first place and look closely over the kernel documentation for submitting patches or RFC. Seek assistance from the Google and Facebook people who have done commits to do similar things. The https://www.spinics.net/lists/linux-api/ list will also be a good place to communicate with libc and kernel developers on getting a general solution to loading application into huge memory. I will look at this again next year, however I think a closer look at getting the generic functions into the kernel and loaders will help the general case and pose less burden on the application developers to maintain large invasive patches like this and place the functionality with people capable of understanding it better. | ||
| Comment by Dmitriy Philimonov [ 2020-12-15 ] | ||
|
Good day, Daniel I wish I could invest more time on the Linux kernel research, regrettably, I've already spent a huge amount of time on this feature, much more that was originally planned. At the moment, I'm obliged to switch to other priorities. Moreover, in the industry, we often need a solution right here and now. Waiting for the same functionality to be implemented elsewhere is unfortunately not an option. Thank you for sharing your experience and for the links. If our priorities change, I'll follow your example. | ||
| Comment by Dmitriy Philimonov [ 2020-12-18 ] | ||
|
Good day, Daniel. We provide a bug fix to the current contribution. It changes huge page mapping flag from MAP_SHARED to MAP_PRIVATE.
Updated patch is added to the attachments. | ||
| Comment by Daniel Black [ 2021-02-11 ] | ||
|
I asked the libc folks and they suggested just implement the linker flags and rely on THP to get the segments in order. https://sourceware.org/pipermail/libc-alpha/2021-February/122334.html So I'm going to do that as a first cut and see how much that reduces the TLB misses. If there's still a case afterwards we can take it up with libc folks. Doing it there https://sourceware.org/git/?p=glibc.git;a=blob;f=elf/dl-map-segments.h;hb=HEAD means that you don't need to use the remap and would be comparatively straight forwards, if interested however, ask for approval first. I'm sorry I so amazed at the implementation and was a bit slow to communicate how much beyond the advanced skills set of a database userspace developer maintaining this code would be. | ||
| Comment by Dmitriy Philimonov [ 2022-01-10 ] | ||
|
Happy New Year! We shared our experience in remapping code segments to huge pages in this article and opened code on github. The published code significantly differs from the patch shared by me a year ago: it became simpler and more robust. Since the ticket is still open, I think it would be useful for your project. P.S. For Russian speaking people there's a Russian blog post on habr. | ||
| Comment by Daniel Black [ 2022-01-11 ] | ||
|
Happy new year dmitriy.philimonov, FWIW I was looking the LD_RELOAD path with mmap_ksm.c with an intent to look at modifying the appropriate flags for MMAP in the as a model for this and for KSM (multiple mariadb instances). I haven't quite got it working. I'll look into your code too. | ||
| Comment by Daniel Black [ 2022-11-23 ] | ||
|
FYI From glibc-2.35
|