Details
-
Task
-
Status: In Review (View Workflow)
-
Critical
-
Resolution: Unresolved
-
None
Description
Problem
calc_distance_euclidean() and calc_distance_cosine() in sql/item_vectorfunc.cc compute the float32 distance used by VEC_DISTANCE_EUCLIDEAN(), VEC_DISTANCE_COSINE() and VEC_DISTANCE() on the brute-force path — no vector index on the column, or direct evaluation (full-scan ORDER BY VEC_DISTANCE(...) LIMIT k, re-ranking, ad-hoc distance).
Both accumulate into a single double:
double d= 0;
for (size_t i= 0; i < v_len; i+, v1, v2+)
The server is built with strict FP (no -ffast-math), so the compiler must preserve summation order. The single accumulator is a serial dependency chain it cannot reorder or vectorize — even though it already vectorizes the surrounding subtract / float→double widen / multiply. The loop is latency-bound on the scalar fadd chain.
Fix
Use several independent partial-sum accumulators, combined after the loop (euclidean: 8; cosine: 4 each for dot / |a|² / |b|²). This breaks the dependency chain so the compiler emits a vectorized reduction on every architecture — no intrinsics, no #ifdef, portable C. A scalar tail handles the remainder.
Benchmark
Real server build, brute-force query (no vector index), 20000 rows × 512-dim, warm buffer pool, server-side query time (median of 40):
┌────────────────────────┬─────────┬─────────┬─────────┐
│ metric │ before │ after │ speedup │
├────────────────────────┼─────────┼─────────┼─────────┤
│ VEC_DISTANCE_EUCLIDEAN │ 7.37 ms │ 3.98 ms │ 1.85× │
├────────────────────────┼─────────┼─────────┼─────────┤
│ VEC_DISTANCE_COSINE │ 9.05 ms │ 6.94 ms │ 1.30× │
└────────────────────────┴─────────┴─────────┴─────────┘
(End-to-end per-query on Apple M4; the kernel is ~3-4×, diluted by row iteration + ORDER BY ... LIMIT.) Cross-platform kernel A/B (real -O2 flags), euclidean: ~2.0× AMD Zen/AVX2, ~1.8× Intel Xeon/AVX-512 — the win is portable.
Correctness
Summation order changes, so results differ only in the last ~1 ULP; rankings unaffected. Verified: all main.vector* tests pass unchanged (no result-file edits); standalone check over every dimension 1..2050 (all loop-tail remainders) → max rel-err 1.3e-14 euclidean, 9e-9 cosine; top-k result sets identical on the benchmark data.
Scope
Only the float32 brute-force path. Indexed (HNSW) search uses the separate int16 quantized dot_product in sql/vector_mhnsw.cc — already SIMD (AVX2/AVX-512/NEON/POWER) and unchanged; measuring showed the indexed path isn't sensitive to this code.