[MDEV-35244] Vector-related system variables could use better names - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: N/A
Fix Version/s: 11.7.1
Component/s: Variables, Vector search
Labels:
None

Description

We discussed the possibility of renaming the variables before. If it is to be done, it should be done before the release, after that it gets more complicated.

Currently we have 4 system variables:

7fce19bd215ac0671855044520092aa4210049d1
+--------------------------+-----------+
\| Variable_name \| Value \|
+--------------------------+-----------+
\| mhnsw_cache_size \| 16777216 \|
\| mhnsw_distance_function \| euclidean \|
\| mhnsw_max_edges_per_node \| 6 \|
\| mhnsw_min_limit \| 20 \|
+--------------------------+-----------+

Considerations:

the presence of HNSW in the name suggests there may be other algorithms in the future; if so, I think it would be more user-friendly to group all vector-related variables together, by giving them a common prefix. vector_ is first that comes to mind, for further use as vector_mhnsw_xxx, vector_lsh_xxx, etc., but maybe there are better ideas.
I don't know whether there is already a vision how it will be configured when there are alternative algorithms, e.g. whether it would make sense to have, for example, distance_function variable for each algorithm separately, it seems too cumbersome given that the function can also be set in the table definition. If, however, any of the options will be shared among different algorithms, they should lose the algorithm prefix already now, e.g. be not [vector_]mhnsw_distance_function or some [vector_]lsh_distance_function in the future, but just vector_distance_function.
it was also discussed that min_limit and max_edges_per_node as such are not very meaningful name and could be improved. I don't have suggestions for the better naming for them, though.

Attachments

Issue Links

is caused by

MDEV-34939 vector search in 11.7

Closed

Activity

Ascending order - Click to sort in descending order

View 5 older comments

Sergei Golubchik added a comment - 2024-10-28 19:04

also, perhaps mhnsw_cache_size should be max cache size ? Because it doesn't allocate all that memory at once, instead memory usage grows until it reaches that value.

Sergei Golubchik added a comment - 2024-10-28 19:04 also, perhaps mhnsw_cache_size should be max cache size ? Because it doesn't allocate all that memory at once, instead memory usage grows until it reaches that value.

Elena Stepanova added a comment - 2024-10-28 20:05

If you think it's more consistent with other similar variables, sure.
I thought when "max" is used for such purposes in MariaDB, it usually means that a query which hits it will actually fail complaining that it cannot be executed. But I don't actually have any statistics to support it, it was just a subjective impression.

Elena Stepanova added a comment - 2024-10-28 20:05 If you think it's more consistent with other similar variables, sure. I thought when "max" is used for such purposes in MariaDB, it usually means that a query which hits it will actually fail complaining that it cannot be executed. But I don't actually have any statistics to support it, it was just a subjective impression.

Sergei Golubchik added a comment - 2024-10-29 12:52

I think these are better at least, although the term "quality" is very subjective. For somebody, the main "quality" of an approximate search will be the correctness, for others the performance. Maybe mhnsw_search_precision_level? For the index, "quality" may even be all right.

mhnsw_search_precision_level is a bit too long to my taste, but, of course, it's not a deciding factor.

Having "quality" in both highlights that they're complementary, both improve results when increased and improve speed when decreased, and one can increase one and compensate by decreasing the other. So I'd suggest to have the same suffix for both.

search/index precision level? Or may be "accuracy" So

mhnsw_index_precision_level — mhnsw_search_precision_level
mhnsw_index_precision — mhnsw_search_precision
mhnsw_index_quality — mhnsw_search_quality
mhnsw_index_accuracy — mhnsw_search_accuracy

Sergei Golubchik added a comment - 2024-10-29 12:52 I think these are better at least, although the term "quality" is very subjective. For somebody, the main "quality" of an approximate search will be the correctness, for others the performance. Maybe mhnsw_search_precision_level ? For the index, "quality" may even be all right. mhnsw_search_precision_level is a bit too long to my taste, but, of course, it's not a deciding factor. Having "quality" in both highlights that they're complementary, both improve results when increased and improve speed when decreased, and one can increase one and compensate by decreasing the other. So I'd suggest to have the same suffix for both. search/index precision level? Or may be "accuracy" So mhnsw_index_precision_level — mhnsw_search_precision_level mhnsw_index_precision — mhnsw_search_precision mhnsw_index_quality — mhnsw_search_quality mhnsw_index_accuracy — mhnsw_search_accuracy

Elena Stepanova added a comment - 2024-10-29 13:16

Right, we can lose "level", it doesn't mean anything anyway, and given the allowed ranges (e.g. no "level 1" for the index) it can even be confusing.
From the above, "accuracy" sounds most universal to me, but I rarely represent the majority in such matters.

Elena Stepanova added a comment - 2024-10-29 13:16 Right, we can lose "level", it doesn't mean anything anyway, and given the allowed ranges (e.g. no "level 1" for the index) it can even be confusing. From the above, "accuracy" sounds most universal to me, but I rarely represent the majority in such matters.

Sergei Golubchik added a comment - 2024-10-29 14:50 - edited

Another consideration (from cvicentiu) that users mainly use vector stores through an AI framework, hardly anyone does it directly. Meaning, it's much less important whether there variables are intuitively understandable by end users, as whether they're intuitively understandable by people, writing vector store connectors for AI frameworks. And for that we should use names "same as everyone else". That is "ef" (or "ef_search") and "M". And may be simply "distance" without "_function" part for brevity.

Thus, an alternative proposal is

SET @@mhnsw_default_m=16;

SET @@mhnsw_default_distance=euclidean;

SET @@mhnsw_ef_search=30;

CREATE TABLE t1 (

  v VECTOR(10),

  VECTOR INDEX (v) M=24 DISTANCE=COSINE

Sergei Golubchik added a comment - 2024-10-29 14:50 - edited Another consideration (from cvicentiu ) that users mainly use vector stores through an AI framework, hardly anyone does it directly. Meaning, it's much less important whether there variables are intuitively understandable by end users, as whether they're intuitively understandable by people, writing vector store connectors for AI frameworks. And for that we should use names "same as everyone else". That is "ef" (or "ef_search") and "M". And may be simply "distance" without "_function" part for brevity. Thus, an alternative proposal is SET @@mhnsw_default_m=16; SET @@mhnsw_default_distance=euclidean; SET @@mhnsw_ef_search=30; CREATE TABLE t1 ( v VECTOR(10), VECTOR INDEX (v) M=24 DISTANCE=COSINE )

MariaDB Server

Vector-related system variables could use better names

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration