Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-35244

Vector-related system variables could use better names

Details

    Description

      We discussed the possibility of renaming the variables before. If it is to be done, it should be done before the release, after that it gets more complicated.

      Currently we have 4 system variables:

      7fce19bd215ac0671855044520092aa4210049d1

      +--------------------------+-----------+
      | Variable_name            | Value     |
      +--------------------------+-----------+
      | mhnsw_cache_size         | 16777216  |
      | mhnsw_distance_function  | euclidean |
      | mhnsw_max_edges_per_node | 6         |
      | mhnsw_min_limit          | 20        |
      +--------------------------+-----------+
      

      Considerations:

      • the presence of HNSW in the name suggests there may be other algorithms in the future; if so, I think it would be more user-friendly to group all vector-related variables together, by giving them a common prefix. vector_ is first that comes to mind, for further use as vector_mhnsw_xxx, vector_lsh_xxx, etc., but maybe there are better ideas.
      • I don't know whether there is already a vision how it will be configured when there are alternative algorithms, e.g. whether it would make sense to have, for example, distance_function variable for each algorithm separately, it seems too cumbersome given that the function can also be set in the table definition. If, however, any of the options will be shared among different algorithms, they should lose the algorithm prefix already now, e.g. be not [vector_]mhnsw_distance_function or some [vector_]lsh_distance_function in the future, but just vector_distance_function.
      • it was also discussed that min_limit and max_edges_per_node as such are not very meaningful name and could be improved. I don't have suggestions for the better naming for them, though.

      Attachments

        Issue Links

          Activity

            also, perhaps mhnsw_cache_size should be max cache size ? Because it doesn't allocate all that memory at once, instead memory usage grows until it reaches that value.

            serg Sergei Golubchik added a comment - also, perhaps mhnsw_cache_size should be max cache size ? Because it doesn't allocate all that memory at once, instead memory usage grows until it reaches that value.

            If you think it's more consistent with other similar variables, sure.
            I thought when "max" is used for such purposes in MariaDB, it usually means that a query which hits it will actually fail complaining that it cannot be executed. But I don't actually have any statistics to support it, it was just a subjective impression.

            elenst Elena Stepanova added a comment - If you think it's more consistent with other similar variables, sure. I thought when "max" is used for such purposes in MariaDB, it usually means that a query which hits it will actually fail complaining that it cannot be executed. But I don't actually have any statistics to support it, it was just a subjective impression.

            I think these are better at least, although the term "quality" is very subjective. For somebody, the main "quality" of an approximate search will be the correctness, for others the performance. Maybe mhnsw_search_precision_level? For the index, "quality" may even be all right.

            mhnsw_search_precision_level is a bit too long to my taste, but, of course, it's not a deciding factor.

            Having "quality" in both highlights that they're complementary, both improve results when increased and improve speed when decreased, and one can increase one and compensate by decreasing the other. So I'd suggest to have the same suffix for both.

            search/index precision level? Or may be "accuracy" So

            • mhnsw_index_precision_level — mhnsw_search_precision_level
            • mhnsw_index_precision — mhnsw_search_precision
            • mhnsw_index_quality — mhnsw_search_quality
            • mhnsw_index_accuracy — mhnsw_search_accuracy
            serg Sergei Golubchik added a comment - I think these are better at least, although the term "quality" is very subjective. For somebody, the main "quality" of an approximate search will be the correctness, for others the performance. Maybe mhnsw_search_precision_level ? For the index, "quality" may even be all right. mhnsw_search_precision_level is a bit too long to my taste, but, of course, it's not a deciding factor. Having "quality" in both highlights that they're complementary, both improve results when increased and improve speed when decreased, and one can increase one and compensate by decreasing the other. So I'd suggest to have the same suffix for both. search/index precision level? Or may be "accuracy" So mhnsw_index_precision_level — mhnsw_search_precision_level mhnsw_index_precision — mhnsw_search_precision mhnsw_index_quality — mhnsw_search_quality mhnsw_index_accuracy — mhnsw_search_accuracy

            Right, we can lose "level", it doesn't mean anything anyway, and given the allowed ranges (e.g. no "level 1" for the index) it can even be confusing.
            From the above, "accuracy" sounds most universal to me, but I rarely represent the majority in such matters.

            elenst Elena Stepanova added a comment - Right, we can lose "level", it doesn't mean anything anyway, and given the allowed ranges (e.g. no "level 1" for the index) it can even be confusing. From the above, "accuracy" sounds most universal to me, but I rarely represent the majority in such matters.
            serg Sergei Golubchik added a comment - - edited

            Another consideration (from cvicentiu) that users mainly use vector stores through an AI framework, hardly anyone does it directly. Meaning, it's much less important whether there variables are intuitively understandable by end users, as whether they're intuitively understandable by people, writing vector store connectors for AI frameworks. And for that we should use names "same as everyone else". That is "ef" (or "ef_search") and "M". And may be simply "distance" without "_function" part for brevity.

            Thus, an alternative proposal is

            SET @@mhnsw_default_m=16;
            SET @@mhnsw_default_distance=euclidean;
            SET @@mhnsw_ef_search=30;
             
            CREATE TABLE t1 (
              v VECTOR(10),
              VECTOR INDEX (v) M=24 DISTANCE=COSINE
            )
            

            serg Sergei Golubchik added a comment - - edited Another consideration (from cvicentiu ) that users mainly use vector stores through an AI framework, hardly anyone does it directly. Meaning, it's much less important whether there variables are intuitively understandable by end users, as whether they're intuitively understandable by people, writing vector store connectors for AI frameworks. And for that we should use names "same as everyone else". That is "ef" (or "ef_search") and "M". And may be simply "distance" without "_function" part for brevity. Thus, an alternative proposal is SET @@mhnsw_default_m=16; SET @@mhnsw_default_distance=euclidean; SET @@mhnsw_ef_search=30;   CREATE TABLE t1 ( v VECTOR(10), VECTOR INDEX (v) M=24 DISTANCE=COSINE )

            People

              serg Sergei Golubchik
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.