Details

    • Task
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • None
    • None

    Description

      NUMA hardware is becoming more common. Access to RAM that is not local to the CPU nodes is more expensive than accessing it locally. MariaDB should implement mechanisms to optimize the workload to keep CPUs of a node accessing their local memory.

      example numa architecture:

      $ numactl  --hardware
      available: 2 nodes (0,8)
      node 0 cpus: 0 8 16 24 32 40 48 56 64 72
      node 0 size: 130705 MB
      node 0 free: 80310 MB
      node 8 cpus: 80 88 96 104 112 120 128 136 144
      node 8 size: 130649 MB
      node 8 free: 81152 MB
      node distances:
      node   0   8
        0:  10  40
        8:  40  10
      

      Components of the implementation include:

      • A meaningful configuration that makes conflicts with existing settings obvious
      • each innodb buffer pool instances to be constrained to a NUMA node
      • SQL threads to be allocated by a user configurable map of one or more of user, connecting host, default database (based on initial connection)
      • The user SQL thread will be pinned to CPUs associated with a node
      • Innodb accesses by the SQL thread will be to/from the innodb buffer pool instances first
      • Accounting of CPU/memory utilization for the mapping identifier to enable automated or configuration based of node to this mapping identifier.
      • Innodb background threads to be per node to facilitate the innodb buffer pool instance processing locally

      (Marko, Jan, et al. please edit with important design/implementation details)

      I'm willing to mentor this (with help).

      Attachments

        Issue Links

          Activity

            Thanks!This sounds interesting and useful, a good project.
            It'll need to be much more clearly defined though, unless you expect a student
            to fill in all blanks (it's a valid assumption, but, in my opinion, a bit
            optimistic).

            A couple of thoughts:

            • What kind of meaningful configuration? Example?
            • pinned SQL threads - ok, and any thread local allocation should use the
              appropriate NUMA node.
            serg Sergei Golubchik added a comment - Thanks!This sounds interesting and useful, a good project. It'll need to be much more clearly defined though, unless you expect a student to fill in all blanks (it's a valid assumption, but, in my opinion, a bit optimistic). A couple of thoughts: What kind of meaningful configuration? Example? pinned SQL threads - ok, and any thread local allocation should use the appropriate NUMA node.
            danblack Daniel Black added a comment -

            Implementation plan from a configuration point of view.

            • numa=off,on - this will be a read only system variable and a mysqld start option to enable numa which defaults to off
            • numa_scheduler= {user,host,db}

              - this will be one or more of these elements by which the server will allocate a node.

            • numa_scheduler_host_mask = a CIDR that is applied to the host for the purposes of numa scheduling (IPv6?)
            • thread_handling=one-thread-per-connection - With thread cache entries - threads will have a numa node node assigned. Threads with an desired numa affinity will be used before altering the affinity of an existing thread
            • thread_handling=pool-of-threads (Unix only) - thread_pool_size will be limited to multiples the number of numa nodes. Each thread has affinity to the CPUs corresponding to the numa node.
            • innodb_buffer_pool_instances - will start of with being one per node only when numa is enabled - can expand if time permits
            • innodb_read_io_threads and innodb_write_io_threads - default to two threads per nodes and both affinity bound.
            • innodb_page_cleaners one per number node

            Best guesses so far:

            • The mysqld client thread loop will be bound to a node as will innodb_encryption_threads

            Unsure how to handle

            • slave_parallel_threads - mapping per domain, per (master) connection or just group them
            • innodb_ft_sort_pll_degree
            • innodb_mtflush_threads
            • innodb_purge_threads

            Threads generally - will have sched_setaffinity/SetThreadAffinityMask set to the cpu set corresponding to the numa node.

            NUMA implementation will be abstracted and support Windows equivalent NUMA functions - https://msdn.microsoft.com/en-us/library/windows/desktop/aa363804(v=vs.85).aspx

            Eventually - persistent table of mappings

            Out of scope:

            • MyISAM - key cache segments maybe eventually
            • All other storage engines
            danblack Daniel Black added a comment - Implementation plan from a configuration point of view. numa=off,on - this will be a read only system variable and a mysqld start option to enable numa which defaults to off numa_scheduler= {user,host,db} - this will be one or more of these elements by which the server will allocate a node. numa_scheduler_host_mask = a CIDR that is applied to the host for the purposes of numa scheduling (IPv6?) thread_handling=one-thread-per-connection - With thread cache entries - threads will have a numa node node assigned. Threads with an desired numa affinity will be used before altering the affinity of an existing thread thread_handling=pool-of-threads (Unix only) - thread_pool_size will be limited to multiples the number of numa nodes. Each thread has affinity to the CPUs corresponding to the numa node. innodb_buffer_pool_instances - will start of with being one per node only when numa is enabled - can expand if time permits innodb_read_io_threads and innodb_write_io_threads - default to two threads per nodes and both affinity bound. innodb_page_cleaners one per number node Best guesses so far: The mysqld client thread loop will be bound to a node as will innodb_encryption_threads Unsure how to handle slave_parallel_threads - mapping per domain, per (master) connection or just group them innodb_ft_sort_pll_degree innodb_mtflush_threads innodb_purge_threads Threads generally - will have sched_setaffinity/SetThreadAffinityMask set to the cpu set corresponding to the numa node. NUMA implementation will be abstracted and support Windows equivalent NUMA functions - https://msdn.microsoft.com/en-us/library/windows/desktop/aa363804(v=vs.85).aspx Eventually - persistent table of mappings Out of scope: MyISAM - key cache segments maybe eventually All other storage engines

            Eventually we'll have to bind table cache instances and MDL instances (not yet implemented) to NUMA nodes. As well as PFS counters and some status variables. Please keep this in mind.

            svoj Sergey Vojtovich added a comment - Eventually we'll have to bind table cache instances and MDL instances (not yet implemented) to NUMA nodes. As well as PFS counters and some status variables. Please keep this in mind.
            danblack Daniel Black added a comment -

            Thanks svoj. All tips gratefully received.

            GSOC approved with Sumit Lakra as the student. Mentors: jplindst and me.

            danblack Daniel Black added a comment - Thanks svoj . All tips gratefully received. GSOC approved with Sumit Lakra as the student. Mentors: jplindst and me.
            danblack Daniel Black added a comment -

            Tip from irc worthy of consideration at some stage: memory engine is a good candidate

            danblack Daniel Black added a comment - Tip from irc worthy of consideration at some stage: memory engine is a good candidate
            danblack Daniel Black added a comment - futex2 - designed for numa

            innodb_mtflush_threads and its replacement were removed, and in MDEV-23855 the single page cleaner thread was simplified. MDEV-16264 refactored many of the InnoDB background threads into tasks.

            I think that it would be very challenging make all users of the buffer pool aware of NUMA (say, actively migrate execution threads to the NUMA node that owns most of the data that is likely to be addressed). I wonder if it could make sense to partition the buf_pool.page_hash in such a way that pages would be mapped to NUMA nodes by some simple formula like page_id.raw()%N_NUMA. All entries of a buf_pool_numa[i].page_hash would point to buffer pool block descriptors and blocks that reside in that NUMA node. I think that we should keep a global buf_pool.LRU and buf_pool.flush_list in any case.

            marko Marko Mäkelä added a comment - innodb_mtflush_threads and its replacement were removed, and in MDEV-23855 the single page cleaner thread was simplified. MDEV-16264 refactored many of the InnoDB background threads into tasks. I think that it would be very challenging make all users of the buffer pool aware of NUMA (say, actively migrate execution threads to the NUMA node that owns most of the data that is likely to be addressed). I wonder if it could make sense to partition the buf_pool.page_hash in such a way that pages would be mapped to NUMA nodes by some simple formula like page_id.raw()%N_NUMA . All entries of a buf_pool_numa[i].page_hash would point to buffer pool block descriptors and blocks that reside in that NUMA node. I think that we should keep a global buf_pool.LRU and buf_pool.flush_list in any case.

            People

              Unassigned Unassigned
              danblack Daniel Black
              Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.