Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-40145

Speed up brute-force VEC_DISTANCE_EUCLIDEAN()/VEC_DISTANCE_COSINE() with a multi-accumulator reduction

    XMLWordPrintable

Details

    Description

      Problem

      calc_distance_euclidean() and calc_distance_cosine() in sql/item_vectorfunc.cc compute the float32 distance used by VEC_DISTANCE_EUCLIDEAN(), VEC_DISTANCE_COSINE() and VEC_DISTANCE() on the brute-force path — no vector index on the column, or direct evaluation (full-scan ORDER BY VEC_DISTANCE(...) LIMIT k, re-ranking, ad-hoc distance).

      Both accumulate into a single double:
      double d= 0;
      for (size_t i= 0; i < v_len; i+, v1, v2+)

      { double dist= get_float(v1) - get_float(v2); d+= dist * dist; }

      The server is built with strict FP (no -ffast-math), so the compiler must preserve summation order. The single accumulator is a serial dependency chain it cannot reorder or vectorize — even though it already vectorizes the surrounding subtract / float→double widen / multiply. The loop is latency-bound on the scalar fadd chain.

      Fix

      Use several independent partial-sum accumulators, combined after the loop (euclidean: 8; cosine: 4 each for dot / |a|² / |b|²). This breaks the dependency chain so the compiler emits a vectorized reduction on every architecture — no intrinsics, no #ifdef, portable C. A scalar tail handles the remainder.

      Benchmark

      Real server build, brute-force query (no vector index), 20000 rows × 512-dim, warm buffer pool, server-side query time (median of 40):

      ┌────────────────────────┬─────────┬─────────┬─────────┐
      │ metric │ before │ after │ speedup │
      ├────────────────────────┼─────────┼─────────┼─────────┤
      │ VEC_DISTANCE_EUCLIDEAN │ 7.37 ms │ 3.98 ms │ 1.85× │
      ├────────────────────────┼─────────┼─────────┼─────────┤
      │ VEC_DISTANCE_COSINE │ 9.05 ms │ 6.94 ms │ 1.30× │
      └────────────────────────┴─────────┴─────────┴─────────┘

      (End-to-end per-query on Apple M4; the kernel is ~3-4×, diluted by row iteration + ORDER BY ... LIMIT.) Cross-platform kernel A/B (real -O2 flags), euclidean: ~2.0× AMD Zen/AVX2, ~1.8× Intel Xeon/AVX-512 — the win is portable.

      Correctness

      Summation order changes, so results differ only in the last ~1 ULP; rankings unaffected. Verified: all main.vector* tests pass unchanged (no result-file edits); standalone check over every dimension 1..2050 (all loop-tail remainders) → max rel-err 1.3e-14 euclidean, 9e-9 cosine; top-k result sets identical on the benchmark data.

      Scope

      Only the float32 brute-force path. Indexed (HNSW) search uses the separate int16 quantized dot_product in sql/vector_mhnsw.cc — already SIMD (AVX2/AVX-512/NEON/POWER) and unchanged; measuring showed the indexed path isn't sensitive to this code.

      Attachments

        Activity

          People

            serg Sergei Golubchik
            tonuonu Tõnu Samuel
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.