Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34699

mhnsw: support aarch64 SIMD instructions

Details

    Description

      support aarch64 SIMD instructions in mhnsw algorithm. This means the distance calculation function and the bloom filter.

      Attachments

        Issue Links

          Activity

            There are two possible implementation of dot_product() vmull+vmull_high and vmull+vmlal2_high implementations. There seem to be no performance difference between them.

              d+= vaddlvq_s32(vmlal_high_s16(vmull_s16(vget_low_s16(p1),
                                                       vget_low_s16(p2)), p1, p2));
            

              d+= vaddlvq_s32(vmull_s16(vget_low_s16(p1), vget_low_s16(p2))) +
                  vaddlvq_s32(vmull_high_s16(p1, p2));
            

            svoj Sergey Vojtovich added a comment - There are two possible implementation of dot_product() vmull+vmull_high and vmull+vmlal2_high implementations. There seem to be no performance difference between them. d+= vaddlvq_s32(vmlal_high_s16(vmull_s16(vget_low_s16(p1), vget_low_s16(p2)), p1, p2)); d+= vaddlvq_s32(vmull_s16(vget_low_s16(p1), vget_low_s16(p2))) + vaddlvq_s32(vmull_high_s16(p1, p2));

            ok to push after adding two comments as noted in the PR

            serg Sergei Golubchik added a comment - ok to push after adding two comments as noted in the PR

            branch bb-11.8-MDEV-34699-vector-arm

            serg Sergei Golubchik added a comment - branch bb-11.8- MDEV-34699 -vector-arm
            elenst Elena Stepanova added a comment - - edited

            I have no objections against pushing the feature as of c3f901d1a9 into the main branch and releasing it with 11.8.1.

            However, given that it's supposed to be a performance tuning, proper performance tests could make sense here.

            I ran concurrent random tests targeted for stability and different variations of MTR tests for functionality on a cloud machine equipped with

            Architecture:             aarch64
              CPU op-mode(s):         32-bit, 64-bit
              Byte Order:             Little Endian
            CPU(s):                   32
              On-line CPU(s) list:    0-31
            Vendor ID:                ARM
              Model name:             Neoverse-N1
                Model:                1
                Thread(s) per core:   1
                Core(s) per socket:   32
                Socket(s):            1
                Stepping:             r3p1
                BogoMIPS:             243.75
                Flags:                fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
                                       ssbs
            

            The patch coverage is full, both in concurrent tests and MTR, so the environment is apparently suitable. I don't have access to other aarch64 flavors, e.g. those with the literal neon if there are any. The assumption is that the change will be applicable to all ARMs v8 and up, and it will be properly skipped for v7 as it's not aarch64.

            Neither concurrent tests nor MTR revealed any patch-specific problems.
            Neither old nor new buildbot also shows anything specific to the feature branch.

            I also tried to run some MTR tests, mostly main.vector, with high --repeat, on the patch vs baseline (release builds), as a poor man's performance check to see if it shows any difference in the execution time, but didn't get anything noticeable.

            elenst Elena Stepanova added a comment - - edited I have no objections against pushing the feature as of c3f901d1a9 into the main branch and releasing it with 11.8.1. However, given that it's supposed to be a performance tuning, proper performance tests could make sense here. I ran concurrent random tests targeted for stability and different variations of MTR tests for functionality on a cloud machine equipped with Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: ARM Model name: Neoverse-N1 Model: 1 Thread(s) per core: 1 Core(s) per socket: 32 Socket(s): 1 Stepping: r3p1 BogoMIPS: 243.75 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs The patch coverage is full, both in concurrent tests and MTR, so the environment is apparently suitable. I don't have access to other aarch64 flavors, e.g. those with the literal neon if there are any. The assumption is that the change will be applicable to all ARMs v8 and up, and it will be properly skipped for v7 as it's not aarch64. Neither concurrent tests nor MTR revealed any patch-specific problems. Neither old nor new buildbot also shows anything specific to the feature branch. I also tried to run some MTR tests, mostly main.vector, with high --repeat , on the patch vs baseline (release builds), as a poor man's performance check to see if it shows any difference in the execution time, but didn't get anything noticeable.

            People

              svoj Sergey Vojtovich
              serg Sergei Golubchik
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.