[MDEV-34699] mhnsw: support aarch64 SIMD instructions - Jira

Details

Type: New Feature
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Fix Version/s: 11.8.1
Component/s: Vector search
Labels:
- Preview_11.8

Description

support aarch64 SIMD instructions in mhnsw algorithm. This means the distance calculation function and the bloom filter.

Attachments

Issue Links

relates to

MDEV-34804 mhnsw: compiler-independent choice of CPU-specific optimizations

Open

MDEV-32887 vector search

Stalled

Activity

Ascending order - Click to sort in descending order

Sergey Vojtovich added a comment - 2024-12-03 09:40

There are two possible implementation of dot_product() vmull+vmull_high and vmull+vmlal2_high implementations. There seem to be no performance difference between them.

  d+= vaddlvq_s32(vmlal_high_s16(vmull_s16(vget_low_s16(p1),

                                           vget_low_s16(p2)), p1, p2));

  d+= vaddlvq_s32(vmull_s16(vget_low_s16(p1), vget_low_s16(p2))) +

      vaddlvq_s32(vmull_high_s16(p1, p2));

Sergey Vojtovich added a comment - 2024-12-03 09:40 There are two possible implementation of dot_product() vmull+vmull_high and vmull+vmlal2_high implementations. There seem to be no performance difference between them. d+= vaddlvq_s32(vmlal_high_s16(vmull_s16(vget_low_s16(p1), vget_low_s16(p2)), p1, p2)); d+= vaddlvq_s32(vmull_s16(vget_low_s16(p1), vget_low_s16(p2))) + vaddlvq_s32(vmull_high_s16(p1, p2));

Sergei Golubchik added a comment - 2024-12-03 14:23

ok to push after adding two comments as noted in the PR

Sergei Golubchik added a comment - 2024-12-03 14:23 ok to push after adding two comments as noted in the PR

Sergei Golubchik added a comment - 2024-12-04 19:40

branch bb-11.8-~~MDEV-34699~~-vector-arm

Sergei Golubchik added a comment - 2024-12-04 19:40 branch bb-11.8- MDEV-34699 -vector-arm

Elena Stepanova added a comment - 2025-01-17 15:26 - edited

I have no objections against pushing the feature as of c3f901d1a9 into the main branch and releasing it with 11.8.1.

However, given that it's supposed to be a performance tuning, proper performance tests could make sense here.

I ran concurrent random tests targeted for stability and different variations of MTR tests for functionality on a cloud machine equipped with

Architecture:             aarch64

  CPU op-mode(s):         32-bit, 64-bit

  Byte Order:             Little Endian

CPU(s):                   32

  On-line CPU(s) list:    0-31

Vendor ID:                ARM

  Model name:             Neoverse-N1

    Model:                1

    Thread(s) per core:   1

    Core(s) per socket:   32

    Socket(s):            1

    Stepping:             r3p1

    BogoMIPS:             243.75

    Flags:                fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp

                           ssbs

The patch coverage is full, both in concurrent tests and MTR, so the environment is apparently suitable. I don't have access to other aarch64 flavors, e.g. those with the literal neon if there are any. The assumption is that the change will be applicable to all ARMs v8 and up, and it will be properly skipped for v7 as it's not aarch64.

Neither concurrent tests nor MTR revealed any patch-specific problems.
Neither old nor new buildbot also shows anything specific to the feature branch.

I also tried to run some MTR tests, mostly main.vector, with high --repeat, on the patch vs baseline (release builds), as a poor man's performance check to see if it shows any difference in the execution time, but didn't get anything noticeable.

Elena Stepanova added a comment - 2025-01-17 15:26 - edited I have no objections against pushing the feature as of c3f901d1a9 into the main branch and releasing it with 11.8.1. However, given that it's supposed to be a performance tuning, proper performance tests could make sense here. I ran concurrent random tests targeted for stability and different variations of MTR tests for functionality on a cloud machine equipped with Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: ARM Model name: Neoverse-N1 Model: 1 Thread(s) per core: 1 Core(s) per socket: 32 Socket(s): 1 Stepping: r3p1 BogoMIPS: 243.75 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs The patch coverage is full, both in concurrent tests and MTR, so the environment is apparently suitable. I don't have access to other aarch64 flavors, e.g. those with the literal neon if there are any. The assumption is that the change will be applicable to all ARMs v8 and up, and it will be properly skipped for v7 as it's not aarch64. Neither concurrent tests nor MTR revealed any patch-specific problems. Neither old nor new buildbot also shows anything specific to the feature branch. I also tried to run some MTR tests, mostly main.vector, with high --repeat , on the patch vs baseline (release builds), as a poor man's performance check to see if it shows any difference in the execution time, but didn't get anything noticeable.

People

Assignee:: Sergey Vojtovich

Reporter:: Sergei Golubchik

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2024-08-04 16:43

Updated:: 2025-04-30 17:57

Resolved:: 2025-01-17 22:54

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server