[MDEV-12213] NUMA support Created: 2017-03-09  Updated: 2023-09-19

Status: Open
Project: MariaDB Server
Component/s: None
Fix Version/s: None

Type: Task Priority: Major
Reporter: Daniel Black Assignee: Unassigned
Resolution: Unresolved Votes: 1
Labels: gsoc17

Issue Links:
Relates
relates to MDEV-5774 Enable numa interleaving by default w... Open

 Description   

NUMA hardware is becoming more common. Access to RAM that is not local to the CPU nodes is more expensive than accessing it locally. MariaDB should implement mechanisms to optimize the workload to keep CPUs of a node accessing their local memory.

example numa architecture:

$ numactl  --hardware
available: 2 nodes (0,8)
node 0 cpus: 0 8 16 24 32 40 48 56 64 72
node 0 size: 130705 MB
node 0 free: 80310 MB
node 8 cpus: 80 88 96 104 112 120 128 136 144
node 8 size: 130649 MB
node 8 free: 81152 MB
node distances:
node   0   8
  0:  10  40
  8:  40  10

Components of the implementation include:

  • A meaningful configuration that makes conflicts with existing settings obvious
  • each innodb buffer pool instances to be constrained to a NUMA node
  • SQL threads to be allocated by a user configurable map of one or more of user, connecting host, default database (based on initial connection)
  • The user SQL thread will be pinned to CPUs associated with a node
  • Innodb accesses by the SQL thread will be to/from the innodb buffer pool instances first
  • Accounting of CPU/memory utilization for the mapping identifier to enable automated or configuration based of node to this mapping identifier.
  • Innodb background threads to be per node to facilitate the innodb buffer pool instance processing locally

(Marko, Jan, et al. please edit with important design/implementation details)

I'm willing to mentor this (with help).



 Comments   
Comment by Sergei Golubchik [ 2017-03-09 ]

Thanks!This sounds interesting and useful, a good project.
It'll need to be much more clearly defined though, unless you expect a student
to fill in all blanks (it's a valid assumption, but, in my opinion, a bit
optimistic).

A couple of thoughts:

  • What kind of meaningful configuration? Example?
  • pinned SQL threads - ok, and any thread local allocation should use the
    appropriate NUMA node.
Comment by Daniel Black [ 2017-03-10 ]

Implementation plan from a configuration point of view.

  • numa=off,on - this will be a read only system variable and a mysqld start option to enable numa which defaults to off
  • numa_scheduler= {user,host,db}

    - this will be one or more of these elements by which the server will allocate a node.

  • numa_scheduler_host_mask = a CIDR that is applied to the host for the purposes of numa scheduling (IPv6?)
  • thread_handling=one-thread-per-connection - With thread cache entries - threads will have a numa node node assigned. Threads with an desired numa affinity will be used before altering the affinity of an existing thread
  • thread_handling=pool-of-threads (Unix only) - thread_pool_size will be limited to multiples the number of numa nodes. Each thread has affinity to the CPUs corresponding to the numa node.
  • innodb_buffer_pool_instances - will start of with being one per node only when numa is enabled - can expand if time permits
  • innodb_read_io_threads and innodb_write_io_threads - default to two threads per nodes and both affinity bound.
  • innodb_page_cleaners one per number node

Best guesses so far:

  • The mysqld client thread loop will be bound to a node as will innodb_encryption_threads

Unsure how to handle

  • slave_parallel_threads - mapping per domain, per (master) connection or just group them
  • innodb_ft_sort_pll_degree
  • innodb_mtflush_threads
  • innodb_purge_threads

Threads generally - will have sched_setaffinity/SetThreadAffinityMask set to the cpu set corresponding to the numa node.

NUMA implementation will be abstracted and support Windows equivalent NUMA functions - https://msdn.microsoft.com/en-us/library/windows/desktop/aa363804(v=vs.85).aspx

Eventually - persistent table of mappings

Out of scope:

  • MyISAM - key cache segments maybe eventually
  • All other storage engines
Comment by Sergey Vojtovich [ 2017-05-05 ]

Eventually we'll have to bind table cache instances and MDL instances (not yet implemented) to NUMA nodes. As well as PFS counters and some status variables. Please keep this in mind.

Comment by Daniel Black [ 2017-05-05 ]

Thanks svoj. All tips gratefully received.

GSOC approved with Sumit Lakra as the student. Mentors: jplindst and me.

Comment by Daniel Black [ 2017-05-29 ]

Tip from irc worthy of consideration at some stage: memory engine is a good candidate

Comment by Daniel Black [ 2021-02-01 ]

futex2 - designed for numa

Comment by Marko Mäkelä [ 2021-02-16 ]

innodb_mtflush_threads and its replacement were removed, and in MDEV-23855 the single page cleaner thread was simplified. MDEV-16264 refactored many of the InnoDB background threads into tasks.

I think that it would be very challenging make all users of the buffer pool aware of NUMA (say, actively migrate execution threads to the NUMA node that owns most of the data that is likely to be addressed). I wonder if it could make sense to partition the buf_pool.page_hash in such a way that pages would be mapped to NUMA nodes by some simple formula like page_id.raw()%N_NUMA. All entries of a buf_pool_numa[i].page_hash would point to buffer pool block descriptors and blocks that reside in that NUMA node. I think that we should keep a global buf_pool.LRU and buf_pool.flush_list in any case.

Generated at Thu Feb 08 07:55:59 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.