|
taking an idle initial mariadb instance with no data loaded.
Note: innodb fails (endless spin) with 0 chunks allocated - [Warning] InnoDB: Difficult to find free blocks in the buffer pool (21 search iterations)! 21 failed attempts to flush a page! Consider increasing innodb_buffer_pool_size. Pending flushes (fsync) log: 0; buffer pool: 0. 166 OS file reads, 2 OS file writes, 2 OS fsyncs.
probably an easy fix.
--innodb-buffer-pool-chunk-size=1M on default 10.6 build:
|
10.6-default
|
$ ps -o pid,tid,class,rtprio,stat,vsz,rss,comm -p $(pidof mysqld)
|
PID TID CLS RTPRIO STAT VSZ RSS COMMAND
|
599335 599335 TS - Sl+ 1272820 72184 mysqld
|
|
|
10.6 only 4 chunks allocated
|
$ ps -o pid,tid,class,rtprio,stat,vsz,rss,comm -p $(pidof mysqld)
|
PID TID CLS RTPRIO STAT VSZ RSS COMMAND
|
599054 599054 TS - Sl+ 1145584 69832 mysqld
|
As we scale up with --innodb-buffer-pool-chunk-size=128M --innodb-buffer-pool-size=12G
|
10.6 default 12G allocation 128M chunk size
|
$ ps -o pid,tid,class,rtprio,stat,vsz,rss,comm -p $(pidof mysqld)
|
PID TID CLS RTPRIO STAT VSZ RSS COMMAND
|
599807 599807 TS - Sl+ 14004060 335648 mysqld
|
|
patch to limit to 1 chunk
|
diff --git a/storage/innobase/buf/buf0buf.cc b/storage/innobase/buf/buf0buf.cc
|
index 24e6e549da1..df29e079f22 100644
|
--- a/storage/innobase/buf/buf0buf.cc
|
+++ b/storage/innobase/buf/buf0buf.cc
|
@@ -1169,6 +1169,7 @@ bool buf_pool_t::create()
|
curr_size= 0;
|
auto chunk= chunks;
|
|
+ n_chunks= 1;
|
do
|
{
|
if (!chunk->create(chunk_size))
|
|
10.6 default 12G allocation 128M chunk size - hack to 1 chunk allocation
|
$ ps -o pid,tid,class,rtprio,stat,vsz,rss,comm -p $(pidof mysqld)
|
PID TID CLS RTPRIO STAT VSZ RSS COMMAND
|
600020 600020 TS - Sl+ 1452408 177692 mysqld
|
So on this 12G case 47% of RSS memory was saved by not allocating or initialing chunks. Overall the RSS/VSZ is still 1-2%
Numbers overall are small (vsz/rss are in KB), 162M difference, however it will add up on cloud environment or hosting provides where there are many more than one instance running (so multipled by applications or users)
|
|
MDEV-25341 currently starts with the allocated but unused pages marked as MADV_FREE.
|
|
The buffer page descriptors (buf_block_t) could be allocated for buffer page descriptors at predetermined offsets of the contiguous address range of the buffer pool. For example, at offset N*innodb_page_size, have descriptors for the next N buffer page frames. There is a pointer-to-descriptor reverse mapping buf_pool_t::block_from_ahi() that will need to be adjusted for this.
As far as I can tell, buf_pool_t::chunk_t::create() could rely on everything having been zero-initialized by the operating system. After initialization, the nonzero fields should be buf_page_t::frame and the buf_pool.free list membership links. We would have to adjust buf_block_alloc() to support lazy initialization. If the buf_pool.free list is empty (and with this lazy initialization it would initially be), we would maintain a pointer to the currently last allocated frame in the buffer pool. If that pointer is less than the current size of the buffer pool, we will use that block, calculate the address of its descriptor, and initialize the descriptor as well.
We do not strictly need a buf_page_t::frame field. It could be replaced with a member function: If this is not inside the buffer pool, we know that the descriptor was returned by malloc() and will return nullptr. Else, we would calculate the frame address based on the address of this.
Finally, I think that we should allocate the buffer pool in a single chunk, corresponding to the maximum value of innodb_buffer_pool_size. We would MADV_FREE (or equivalent) the unused part to the operating system, and we would never access any memory beyond the configured innodb_buffer_pool_size. On operating systems other than Linux or Microsoft Windows, we should try to find a suitable alternative, so that SET GLOBAL innodb_buffer_pool_size will continue to work.
|
|
Implementing the lazy initialization of the InnoDB buffer pool in buf_pool_t::create() and buf_pool_t::chunk_t::create() is a rather small change and in my opinion best done as a bug fix. We would initialize only one page descriptor in each buffer pool chunk, and buf_LRU_get_free_only() would initialize further ones, until all allocated blocks are in use.
MDEV-29445 has been filed for the remaining part, to remove buf_pool.chunks and to allocate a contiguous range of virtual memory addresses for a single buffer pool. That would simplify the buffer pool resizing and all related calculations.
|
|
Results on origin/bb-10.6-MDEV-30100-MDEV-25340 38ee8129d85e867c86b16404a3906030ddab7129 2023-08-11T17:01:57+03:00
|
|
1. Just some assert during the DB server is under concurrent SQL load
|
---------------------------------------------------------------------------------------------------------
|
# 2023-08-11T13:18:32 [2090518] | 2023-08-11 13:18:26 0x7f46817fa640 InnoDB: Assertion failure in file /data/Server/bb-10.6-MDEV-30100-MDEV-25340/storage/innobase/btr/btr0sea.cc line 1017
|
# 2023-08-11T13:18:32 [2090518] | InnoDB: Failing assertion: ptrdiff_t(offs) < chunk->blocks_end - chunk->blocks
|
#4 <signal handler called>
|
#5 __pthread_kill_implementation (no_tid=0, signo=6, threadid=139940797064768) at ./nptl/pthread_kill.c:44
|
#6 __pthread_kill_internal (signo=6, threadid=139940797064768) at ./nptl/pthread_kill.c:78
|
#7 __GI___pthread_kill (threadid=139940797064768, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
|
#8 0x00007f47de931476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
|
#9 0x00007f47de9177f3 in __GI_abort () at ./stdlib/abort.c:79
|
#10 0x000055663b3f900e in ut_dbg_assertion_failed (expr=expr@entry=0x55663bea0908 "ptrdiff_t(offs) < chunk->blocks_end - chunk->blocks",
|
file=file@entry=0x55663bea0850 "/data/Server/bb-10.6-MDEV-30100-MDEV-25340/storage/innobase/btr/btr0sea.cc", line=line@entry=1017) at /data/Server/bb-10.6-MDEV-30100-MDEV-25340/storage/innobase/ut/ut0dbg.cc:60
|
#11 0x000055663b3fca7e in buf_pool_t::block_from_ahi (this=<optimized out>, ptr=0x7f47d0c2008d "") at /data/Server/bb-10.6-MDEV-30100-MDEV-25340/storage/innobase/btr/btr0sea.cc:1017
|
#12 btr_search_guess_on_hash (index=<optimized out>, info=info@entry=0x7f476416b2a8, tuple=tuple@entry=0x7f46540496c8, mode=mode@entry=4, latch_mode=latch_mode@entry=2, cursor=cursor@entry=0x55663e9b5558, mtr=0x7f46817f93a0)
|
at /data/Server/bb-10.6-MDEV-30100-MDEV-25340/storage/innobase/btr/btr0sea.cc:1113
|
#13 0x000055663baaaad5 in btr_cur_t::search_leaf (this=this@entry=0x55663e9b5558, tuple=tuple@entry=0x7f46540496c8, mode=mode@entry=PAGE_CUR_LE, latch_mode=BTR_MODIFY_LEAF, latch_mode@entry=1050367320, mtr=0x7f46817f93a0,
|
mtr@entry=0x7f46540496c8) at /data/Server/bb-10.6-MDEV-30100-MDEV-25340/storage/innobase/btr/btr0cur.cc:1101
|
#14 0x000055663ba642ba in btr_pcur_open (mtr=0x7f46540496c8, cursor=0x55663e9b5558, latch_mode=1050367320, mode=PAGE_CUR_LE, tuple=0x7f46540496c8)
|
at /data/Server/bb-10.6-MDEV-30100-MDEV-25340/storage/innobase/include/btr0pcur.h:431
|
#15 row_search_on_row_ref (pcur=pcur@entry=0x55663e9b5558, mode=mode@entry=BTR_MODIFY_LEAF, table=<optimized out>, ref=0x7f46540496c8, mtr=mtr@entry=0x7f46817f93a0)
|
at /data/Server/bb-10.6-MDEV-30100-MDEV-25340/storage/innobase/row/row0row.cc:1219
|
#16 0x000055663ba5f035 in row_purge_reposition_pcur (mode=mode@entry=BTR_MODIFY_LEAF, node=node@entry=0x55663e9b54b8, mtr=mtr@entry=0x7f46817f93a0)
|
at /data/Server/bb-10.6-MDEV-30100-MDEV-25340/storage/innobase/row/row0purge.cc:83
|
#17 0x000055663ba5f350 in row_purge_remove_clust_if_poss_low (node=0x55663e9b54b8, mode=<optimized out>) at /data/Server/bb-10.6-MDEV-30100-MDEV-25340/storage/innobase/row/row0purge.cc:139
|
#18 0x000055663ba60902 in row_purge_remove_clust_if_poss (node=0x55663e9b54b8) at /data/Server/bb-10.6-MDEV-30100-MDEV-25340/storage/innobase/row/row0purge.cc:259
|
#19 row_purge_del_mark (node=0x55663e9b54b8) at /data/Server/bb-10.6-MDEV-30100-MDEV-25340/storage/innobase/row/row0purge.cc:635
|
#20 row_purge_record_func (node=0x55663e9b54b8, undo_rec=0x7f47a8047f40 "", thr=0x55663e9b5418, updated_extern=<optimized out>) at /data/Server/bb-10.6-MDEV-30100-MDEV-25340/storage/innobase/row/row0purge.cc:1215
|
#21 0x000055663ba60f39 in row_purge (thr=<optimized out>, undo_rec=<optimized out>, node=0x55663e9b54b8) at /data/Server/bb-10.6-MDEV-30100-MDEV-25340/storage/innobase/row/row0purge.cc:1276
|
#22 row_purge_step (thr=thr@entry=0x55663e9b5418) at /data/Server/bb-10.6-MDEV-30100-MDEV-25340/storage/innobase/row/row0purge.cc:1339
|
#23 0x000055663ba34d4b in que_thr_step (thr=0x55663e9b5418) at /data/Server/bb-10.6-MDEV-30100-MDEV-25340/storage/innobase/que/que0que.cc:588
|
#24 que_run_threads_low (thr=<optimized out>) at /data/Server/bb-10.6-MDEV-30100-MDEV-25340/storage/innobase/que/que0que.cc:644
|
#25 que_run_threads (thr=thr@entry=0x55663e9b5418) at /data/Server/bb-10.6-MDEV-30100-MDEV-25340/storage/innobase/que/que0que.cc:664
|
#26 0x000055663ba7de8d in srv_task_execute () at /data/Server/bb-10.6-MDEV-30100-MDEV-25340/storage/innobase/srv/srv0srv.cc:1598
|
#27 purge_worker_callback () at /data/Server/bb-10.6-MDEV-30100-MDEV-25340/storage/innobase/srv/srv0srv.cc:1853
|
#28 0x000055663bb32b36 in tpool::task_group::execute (this=0x55663cd4ef00 <purge_task_group>, t=0x55663cd1c2a0 <purge_worker_task>) at /data/Server/bb-10.6-MDEV-30100-MDEV-25340/tpool/task_group.cc:55
|
#29 0x000055663bb311c7 in tpool::thread_pool_generic::worker_main (this=0x55663e8e1780, thread_var=0x55663e8e2fb0) at /data/Server/bb-10.6-MDEV-30100-MDEV-25340/tpool/tpool_generic.cc:580
|
#30 0x00007f47decfc2b3 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
|
#31 0x00007f47de983b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
|
#32 0x00007f47dea15a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
|
|
sdp:/data1/results/1691783127/TBR-2031$ gdb -c ./1/data/core.2090952 /data/Server_bin/bb-10.6-MDEV-30100-MDEV-25340_RelWithDebInfo/bin/mariadbd
|
|
2. The restart of the server after intentional crash fails
|
---------------------------------------------------------------------------------
|
There is no remarkable detail about corruption or similar within the server error log.
|
The server seems to not have written some core file.
|
But there exists a file copy made after intentional crash and before restart attempt
|
sdp:/data1/results/1691783127/Restart_fail/1/fbackup
|
|
|
|
|
I tested the start-up time before and after this change, using innodb_buffer_pool_size=1g and a tiny datadir on /dev/shm:
Before:
real 0m1.202s
|
user 0m0.239s
|
sys 0m0.775s
|
After:
real 0m1.106s
|
user 0m0.088s
|
sys 0m0.829s
|
This saved about 63% of user time on my system. The bulk of the time is being spent in the Linux kernel counterpart of my_large_malloc(). I am seeing 75% of the CPU samples being attributed to clear_page_erms(), which is filling a 4096-byte page 1 byte at a time:
SYM_FUNC_START(clear_page_erms)
|
movl $4096,%ecx
|
xorl %eax,%eax
|
rep stosb
|
RET
|
SYM_FUNC_END(clear_page_erms)
|
EXPORT_SYMBOL_GPL(clear_page_erms)
|
Edit: The erms flag is set in my /proc/cpuinfo, indicating that "Enhanced REP MOVSB/STOSB" is supported by the processor. This is an optimized implementation.
What might be done better is to make the kernel use copy-on-write of already zero-filled pages in my_large_malloc().
|
|
Without the lazy initialization, I observed that 9.83% of CPU cycles are consumed in buf_pool_t::chunk_t::create().
With the lazy initialization in place, I see 6.9% of CPU cycles attributed to __memset_avx2_unaligned_erms_rtm, which in turn is spending 99.62% of samples on a similar loop than the one in the kernel’s clear_page_erms:
I tried to produce call stacks of that in rr record. The function __memset_avx2_unaligned_erms will be chosen instead, because the CPUID faulting will disallow the use of TSX-NI instructions, a.k.a. the rtm ISA extension. Those calls are part of static initialization, for things like this:
|
10.6 0be4781428a4044b13b085965820a995652bb0e9
|
#0 __memset_avx2_unaligned_erms () at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:159
|
#1 0x00007fe71625c20f in __new_exitfn (listp=0x7fe7163f1840 <__exit_funcs>) at ./stdlib/cxa_atexit.c:112
|
#2 0x00007fe71625c298 in __internal_atexit (func=0x560686214c90 <Item_bool_static::~Item_bool_static()>, arg=0x560686c891e0 <Item_false>, d=0x560686bb9148, listp=<optimized out>) at ./stdlib/cxa_atexit.c:44
|
#3 0x000056068621ba3c in __static_initialization_and_destruction_0 () at /mariadb/10.6/sql/item.cc:59
|
#4 _GLOBAL__sub_I_item.cc(void) () at /mariadb/10.6/sql/item.cc:10951
|
#5 0x00007fe7162457f6 in call_init (env=<optimized out>, argv=0x7ffd9676d438, argc=5) at ../csu/libc-start.c:145
|
#6 __libc_start_main_impl (main=0x5606861b9f00 <main(int, char**)>, argc=5, argv=0x7ffd9676d438, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffd9676d428)
|
at ../csu/libc-start.c:347
|
#7 0x00005606861b9e31 in _start ()
|
As far as I understand, the choice of this implementation is intentional for small sizes. But it is not ideal. Well-written C++ should simply rely on the zero-initialized BSS and not generate any additional initialization.
I was curious if the following patch would help, but the size of the huge function __static_initialization_and_destruction_0 was unaffected by this:
diff --git a/sql/item.h b/sql/item.h
|
index 0f59b38cd38..250aa976892 100644
|
--- a/sql/item.h
|
+++ b/sql/item.h
|
@@ -1114,13 +1114,15 @@ class Item :public Value_source,
|
*/
|
Item(THD *thd, Item *item);
|
Item(); /* For const item */
|
+#ifdef EXTRA_DEBUG
|
virtual ~Item()
|
{
|
-#ifdef EXTRA_DEBUG
|
name.str= 0;
|
name.length= 0;
|
-#endif
|
} /*lint -e1509 */
|
+#else
|
+ virtual ~Item() = default;
|
+#endif
|
void set_name(THD *thd, const char *str, size_t length, CHARSET_INFO *cs);
|
void set_name(THD *thd, String *str)
|
{
|
|
|
mleich, I hope to see some core dumps or rr replay traces of any failures. I checked the recovery failure that you mentioned:
10.6/sql/mariadbd --plugin-dir /dev/shm/10.6/plugin/file_key_management --plugin-load-add=file_key_management.so --file-key-management-filename=/dev/shm/encryption_keys.txt --innodb-page-size=32k --datadir /dev/shm/data
|
This will fail with
2023-08-14 16:07:40 0 [Note] InnoDB: Retry with innodb_force_recovery=5
|
2023-08-14 16:07:40 0 [ERROR] InnoDB: Plugin initialization aborted at srv0start.cc[1484] with error Data structure corruption
|
The immediate reason for that is that buf_read_page() for the undo log page 0x2e in undo001 fails. That page is all-zero:
od -Ax -t x1 -j 0x170000 -N 0x8000 data/undo001
|
170000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|
*
|
178000
|
I did not find any copy of this page in the doublewrite buffer either, not even after checking buf_dblwr_t::recover() in an rr replay of a trace that I saved of the failed recovery.
Without an rr replay trace of the previously killed server, it is very difficult to tell what exactly went wrong here and if it could in any way be related to the changes that were being tested.
|
|
Replaying any of the problems mentioned above if using rr is extreme unlikely.
|
> 40000 RQG tests using the setup from above were executed.
|
|
Other findings:
|
==1==
|
Scenario:
|
1. Start server (innodb_file_per_table=0, innodb-encrypt-log, innodb-encrypt-tables,
|
innodb-immediate-scrub-data-uncompressed=1)
|
2. Concurrent DDL/DML load including modifying
|
innodb_undo_log_truncate, innodb_purge_rseg_truncate_frequency, innodb_immediate_scrub_data_uncompressed
|
3. During 2. is ongoing intentional crash
|
4. Restart the server with success == A session is able to connect.
|
4. SHOW KEYS FROM `test`.`B` harvested 1932: Table 'test.B' doesn't exist in engine. RQG reacts with crashing the server.
|
sdp:/data1/results/1692031142/DD-DIFF$ _RR_TRACE_DIR=./1/rr rr replay --mark-stdio
|
The server error log sdp:/data1/results/1692031142/DD-DIFF/1/mysql.err does not contain suspicious entries.
|
|
==2==
|
Scenario like in ==1==
|
SHOW KEYS FROM `test`.`t2` harvested 1030: Got error 194 "Tablespace is missing for a table".
|
sdp:/data1/results/1692042723/TABLESPACE_MISSING1$ _RR_TRACE_DIR=./1/rr rr replay --mark-stdio
|
There are suspicious entries in the server error log even before the line
|
... mariadbd: ready for connections
|
|
==3==
|
Scenario like in ==1==
|
sdp:/data1/results/1692042723/TABLESPACE_MISSING2$ _RR_TRACE_DIR=./1/rr rr replay --mark-stdio
|
|
MDEV-31826 was hit several times. rr traces for that are available too.
|
|
|
|
I realized that when I tested the performance of server startup (actually server bootstrap), I forgot to specify an innodb_buffer_pool_chunk_size equal to innodb_buffer_pool_size. This could have reduced the overhead of clear_page_erms or clear_huge_page in the kernel.
Some performance tests of actual workload suggests that we’d better bundle this with MDEV-29445, to allocate a single contiguous virtual address range for the buffer pool. I am guessing that calls to buf_pool_t::lazy_allocate_size() can seriously hurt performance when buf_pool.n_chunks is large.
I checked one of the rr replay traces that mleich provided. Something was definitely broken in MDEV-30100, which is included in the same branch. The table test.B was not found by InnoDB, because some time before killing the server, the statement CREATE OR REPLACE TABLE B LIKE `t7` /* E_R Thread15 QNO 134 CON_ID 30 */ had been executed. InnoDB returned control to the SQL layer even though it did not advance log_sys.flushed_to_disk_lsn to a logical point of time where the table would be created. In fact, the InnoDB log was never written out before the server was killed.
|
|
I was not using hugepages when first testing this.
echo 96 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
|
rm -fr /dev/shm/data
|
mkdir /dev/shm/data
|
time sql/mariadbd --datadir /dev/shm/data --bootstrap --innodb-log-file-size=5m --innodb-buffer-pool-size=96g --innodb-buffer-pool-chunk-size=96g < /dev/null
|
For this server startup, and without any lazy initialization implemented on a 10.6 branch, I see the following:
real 0m0.967s
|
user 0m0.165s
|
sys 0m0.634s
|
With the lazy initialization (initializing just one block descriptor of the single buffer pool chunks) we save some 0.33 seconds real time, mostly due to saving CPU time in the Linux kernel:
real 0m0.639s
|
user 0m0.177s
|
sys 0m0.278s
|
The perf report for some top functions looks like this:
61,01% mariadbd-old [kernel.kallsyms] [k] clear_page_erms
|
11,67% mariadbd-old mariadbd-old [.] buf_pool_t::chunk_t::create(unsigned long)
|
5,20% mariadbd-old libc.so.6 [.] __memset_avx2_unaligned_erms_rtm
|
2,98% mariadbd-old mariadbd-old [.] dict_sys_t::close()
|
1,17% mariadbd-old mariadbd-old [.] buf_flush_list_holding_mutex(unsigned long, unsigned long)
|
With the lazy initialization:
62,28% mariadbd [kernel.kallsyms] [k] clear_page_erms
|
4,57% mariadbd mariadbd [.] dict_sys_t::close()
|
3,65% mariadbd libc.so.6 [.] __memset_avx2_unaligned_erms_rtm
|
1,70% mariadbd mariadbd [.] buf_flush_list_holding_mutex(unsigned long, unsigned long)
|
The CPU cycles spent on clear_page_erms was not actually reduced, either in relative or absolute terms: 0.6129*1208991150 < 0.6228*1206764213, about 741M vs 752M events. But, we should remember that the CPU cores should be starved for memory bus bandwidth.
In addition to the default perf record --event=cycles I also measured bus-cycles:
62,08% mariadbd-old [kernel.kallsyms] [k] clear_page_erms
|
13,31% mariadbd-old mariadbd-old [.] buf_pool_t::chunk_t::create(unsigned long)
|
4,74% mariadbd-old mariadbd-old [.] dict_sys_t::close()
|
4,37% mariadbd-old libc.so.6 [.] __memset_avx2_unaligned_erms_rtm
|
1,09% mariadbd-old [vdso] [.] __vdso_clock_gettime
|
0,98% mariadbd-old mariadbd-old [.] buf_flush_list_holding_mutex(unsigned long, unsigned long)
|
0,66% mariadbd-old [kernel.kallsyms] [k] __rmqueue_pcplist
|
0,61% mariadbd-old [kernel.kallsyms] [k] free_tail_page_prepare
|
0,47% mariadbd-old [kernel.kallsyms] [k] asm_exc_page_fault
|
0,46% mariadbd-old mariadbd-old [.] my_timer_milliseconds
|
0,45% mariadbd-old ld-linux-x86-64.so.2 [.] do_lookup_x
|
0,37% mariadbd-old [kernel.kallsyms] [k] get_page_from_freelist
|
0,34% mariadbd-old [kernel.kallsyms] [k] free_unref_page_prepare
|
0,30% mariadbd-old [kernel.kallsyms] [k] check_preemption_disabled
|
0,29% mariadbd-old [kernel.kallsyms] [k] shmem_add_to_page_cache
|
0,25% mariadbd-old [kernel.kallsyms] [k] memset_orig
|
Thanks to the lazy initialization, the kernel should not have to zero-fill the 95 allocated but untouched 1GiB hugepages. We do not see that dramatic savings in the kernel, but we do see the share of clear_page_erms shrinking both in absolute and relative terms:
60,23% mariadbd [kernel.kallsyms] [k] clear_page_erms
|
10,59% mariadbd libc.so.6 [.] __memset_avx2_unaligned_erms_rtm
|
3,58% mariadbd mariadbd [.] dict_sys_t::close()
|
2,82% mariadbd mariadbd [.] buf_flush_list_holding_mutex(unsigned long, unsigned long)
|
1,85% mariadbd [vdso] [.] __vdso_clock_gettime
|
1,07% mariadbd mariadbd [.] my_timer_milliseconds
|
0,97% mariadbd [kernel.kallsyms] [k] asm_exc_page_fault
|
0,67% mariadbd [kernel.kallsyms] [k] __list_del_entry_valid
|
0,66% mariadbd [kernel.kallsyms] [k] memset_orig
|
|
|
I had made a mistake again. Because I had forgotten the parameter --large-pages=1, conventional 4096-byte pages were being allocated. This is the correct invocation of the server:
sql/mariadbd --large-pages=1 --datadir /dev/shm/data --innodb-log-file-size=5m --bootstrap --innodb-buffer-pool-size=96g --innodb-buffer-pool-chunk-size=96g
|
I had to allocate some additional 2MiB pages to avoid a warning about some other large buffer allocation on server startup:
echo 96 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
|
echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
|
When initializing a buffer pool of 96*1GiB huge pages, the start-up takes about 1 second, or 0.1 second less when using the lazy initialization. Almost all the savings come from the user-space code (0.13s instead of 0.26s). Starting up the server using a single 1GiB huge page takes between 0.52 and 0.54 seconds, and there is no clear difference between the two.
|
|
For the record, bootstrapping MariaDB Server 10.5 with innodb_buffer_pool_size=96g will take 5× to 9× the time it takes on 10.6:
|
mariadb-10.5.22
|
2023-08-18 14:17:06 0 [Note] InnoDB: Initializing buffer pool, total size = 103079215104, chunk size = 103079215104
|
2023-08-18 14:17:08 0 [Note] InnoDB: Completed initialization of buffer pool
|
The Bash built-in time reported the following bootstrap:
real 0m5.722s
|
user 0m2.628s
|
sys 0m2.808s
|
Without --large-pages it was slightly faster:
real 0m5.392s
|
user 0m1.955s
|
sys 0m3.175s
|
MariaDB Server 10.6 and later would avoid calls like this:
4,73% mariadbd libc.so.6 [.] malloc
|
3,96% mariadbd libc.so.6 [.] malloc_consolidate
|
3,96% mariadbd libc.so.6 [.] pthread_cond_destroy@@GLIBC_2.3.2
|
3,89% mariadbd mariadbd [.] rw_lock_free_func(rw_lock_t*)
|
2,63% mariadbd libc.so.6 [.] _int_free
|
2,10% mariadbd mariadbd [.] rw_lock_create_func(rw_lock_t*, char const*, unsigned int)
|
1,46% mariadbd libc.so.6 [.] pthread_mutex_init@@GLIBC_2.2.5
|
1,32% mariadbd mariadbd [.] OSMutex::destroy()
|
1,15% mariadbd mariadbd [.] os_event_destroy(os_event*&)
|
0,98% mariadbd libc.so.6 [.] unlink_chunk.constprop.0
|
0,96% mariadbd mariadbd [.] os_event_create(char const*)
|
These ought to be mainly thanks to MDEV-24142. A buffer page latch no longer encapsulates 2 pthread_mutex_t and pthread_cond_t, wrapped inside os_event and OSMutex and the 80-byte rw_lock_t. Instead, we rely on implicit zero-initialization of the 24-byte buf_page_t::lock.
|
|
danblack, do you think that the current startup time with 10.6 is acceptable?
I would fix the rest as part of MDEV-29445.
|
|
Was getting 3 seconds on 20G with memlock, only some of it huge pages.
|
10.6-4bd94afbd3cecdcca9edb87834b41de5494fbbea
|
# echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
|
# cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
|
201
|
# ulimit -l 20000000
|
# sql/mariadbd --no-defaults --datadir=/tmp/${PWD##*/}-datadir --socket=/tmp/${PWD##*/}.sock --plugin-dir=${PWD}/mysql-test/var/plugins/ --verbose --skip-networking --large-pages=1 --innodb-buffer-pool-size=20G --memlock --user=root
|
2023-12-01 13:16:42 0 [Note] Starting MariaDB 10.6.17-MariaDB source revision 4bd94afbd3cecdcca9edb87834b41de5494fbbea as process 185716
|
2023-12-01 13:16:42 0 [Note] InnoDB: Compressed tables use zlib 1.2.13
|
2023-12-01 13:16:42 0 [Note] InnoDB: Number of pools: 1
|
2023-12-01 13:16:42 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
|
2023-12-01 13:16:42 0 [Note] InnoDB: Using liburing
|
2023-12-01 13:16:42 0 [Note] InnoDB: Initializing buffer pool, total size = 21474836480, chunk size = 134217728
|
2023-12-01 13:16:42 0 [Note] InnoDB: Completed initialization of buffer pool
|
2023-12-01 13:16:42 0 [Note] InnoDB: Setting O_DIRECT on file ./ibdata1 failed
|
2023-12-01 13:16:42 0 [Note] InnoDB: 128 rollback segments are active.
|
2023-12-01 13:16:42 0 [Note] InnoDB: Creating shared tablespace for temporary tables
|
2023-12-01 13:16:42 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
|
2023-12-01 13:16:42 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
|
2023-12-01 13:16:42 0 [Note] InnoDB: 10.6.17 started; log sequence number 55985; transaction id 23
|
2023-12-01 13:16:42 0 [Note] Plugin 'FEEDBACK' is disabled.
|
2023-12-01 13:16:42 0 [Note] InnoDB: Loading buffer pool(s) from /tmp/build-mariadb-server-10.6-datadir/ib_buffer_pool
|
2023-12-01 13:16:42 0 [Note] InnoDB: Buffer pool(s) load completed at 231201 13:16:42
|
2023-12-01 13:16:45 0 [Note] sql/mariadbd: ready for connections.
|
Version: '10.6.17-MariaDB' socket: '/tmp/build-mariadb-server-10.6.sock' port: 0 Source distribution
|
With 38ee8129d85e867c86b16404a3906030ddab7129
# sql/mariadbd --no-defaults --datadir=/tmp/${PWD##*/}-datadir --socket=/tmp/${PWD##*/}.sock --plugin-dir=${PWD}/mysql-test/var/plugins/ --verbose --skip-networking --large-pages=1 --innodb-buffer-pool-size=20G --memlock --user=root
|
2023-12-01 13:32:49 0 [Note] Starting MariaDB 10.6.17-MariaDB source revision 4bd94afbd3cecdcca9edb87834b41de5494fbbea as process 186718
|
2023-12-01 13:32:49 0 [Note] InnoDB: Compressed tables use zlib 1.2.13
|
2023-12-01 13:32:49 0 [Note] InnoDB: Number of pools: 1
|
2023-12-01 13:32:49 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
|
2023-12-01 13:32:49 0 [Note] InnoDB: Using liburing
|
2023-12-01 13:32:49 0 [Note] InnoDB: Initializing buffer pool, total size = 21474836480, chunk size = 134217728
|
2023-12-01 13:32:49 0 [Note] InnoDB: Completed initialization of buffer pool
|
2023-12-01 13:32:49 0 [Note] InnoDB: Setting O_DIRECT on file ./ibdata1 failed
|
2023-12-01 13:32:49 0 [Note] InnoDB: 128 rollback segments are active.
|
2023-12-01 13:32:49 0 [Note] InnoDB: Creating shared tablespace for temporary tables
|
2023-12-01 13:32:49 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
|
2023-12-01 13:32:49 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
|
2023-12-01 13:32:49 0 [Note] InnoDB: 10.6.17 started; log sequence number 56081; transaction id 23
|
2023-12-01 13:32:49 0 [Note] InnoDB: Loading buffer pool(s) from /tmp/build-mariadb-server-10.6-datadir/ib_buffer_pool
|
2023-12-01 13:32:49 0 [Note] Plugin 'FEEDBACK' is disabled.
|
2023-12-01 13:32:49 0 [Note] InnoDB: Buffer pool(s) load completed at 231201 13:32:49
|
2023-12-01 13:32:51 0 [Note] sql/mariadbd: ready for connections.
|
Version: '10.6.17-MariaDB' socket: '/tmp/build-mariadb-server-10.6.sock' port: 0 Source distribution
|
Gained a second (1/3) on smallish 20G pools. Good gain. Thanks.
|