[MDEV-40145] Speed up brute-force VEC_DISTANCE_EUCLIDEAN()/VEC_DISTANCE_COSINE() with a multi-accumulator reduction - Jira

XML

Word

Printable

Details

Type: Task
Status: In Review (View Workflow)
Priority: Critical
Resolution: Unresolved
Fix Version/s: 13.1
Component/s: None
Labels:
- contribution
- performance

Description

Problem

calc_distance_euclidean() and calc_distance_cosine() in sql/item_vectorfunc.cc compute the float32 distance used by VEC_DISTANCE_EUCLIDEAN(), VEC_DISTANCE_COSINE() and VEC_DISTANCE() on the brute-force path — no vector index on the column, or direct evaluation (full-scan ORDER BY VEC_DISTANCE(...) LIMIT k, re-ranking, ad-hoc distance).

Both accumulate into a single double:
double d= 0;
for (size_t i= 0; i < v_len; i+, v1, v2+)

{ double dist= get_float(v1) - get_float(v2); d+= dist * dist; }

The server is built with strict FP (no -ffast-math), so the compiler must preserve summation order. The single accumulator is a serial dependency chain it cannot reorder or vectorize — even though it already vectorizes the surrounding subtract / float→double widen / multiply. The loop is latency-bound on the scalar fadd chain.

Fix

Use several independent partial-sum accumulators, combined after the loop (euclidean: 8; cosine: 4 each for dot / |a|² / |b|²). This breaks the dependency chain so the compiler emits a vectorized reduction on every architecture — no intrinsics, no #ifdef, portable C. A scalar tail handles the remainder.

Benchmark

Real server build, brute-force query (no vector index), 20000 rows × 512-dim, warm buffer pool, server-side query time (median of 40):

┌────────────────────────┬─────────┬─────────┬─────────┐
│ metric │ before │ after │ speedup │
├────────────────────────┼─────────┼─────────┼─────────┤
│ VEC_DISTANCE_EUCLIDEAN │ 7.37 ms │ 3.98 ms │ 1.85× │
├────────────────────────┼─────────┼─────────┼─────────┤
│ VEC_DISTANCE_COSINE │ 9.05 ms │ 6.94 ms │ 1.30× │
└────────────────────────┴─────────┴─────────┴─────────┘

(End-to-end per-query on Apple M4; the kernel is ~3-4×, diluted by row iteration + ORDER BY ... LIMIT.) Cross-platform kernel A/B (real -O2 flags), euclidean: ~2.0× AMD Zen/AVX2, ~1.8× Intel Xeon/AVX-512 — the win is portable.

Correctness

Summation order changes, so results differ only in the last ~1 ULP; rankings unaffected. Verified: all main.vector* tests pass unchanged (no result-file edits); standalone check over every dimension 1..2050 (all loop-tail remainders) → max rel-err 1.3e-14 euclidean, 9e-9 cosine; top-k result sets identical on the benchmark data.

Scope

Only the float32 brute-force path. Indexed (HNSW) search uses the separate int16 quantized dot_product in sql/vector_mhnsw.cc — already SIMD (AVX2/AVX-512/NEON/POWER) and unchanged; measuring showed the indexed path isn't sensitive to this code.

Attachments

Activity

People

Assignee:: Sergei Golubchik

Reporter:: Tõnu Samuel

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 3 days ago 12:04

Updated:: 2 days ago 07:47

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.