Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-35244

Vector-related system variables could use better names

Details

    Description

      We discussed the possibility of renaming the variables before. If it is to be done, it should be done before the release, after that it gets more complicated.

      Currently we have 4 system variables:

      7fce19bd215ac0671855044520092aa4210049d1

      +--------------------------+-----------+
      | Variable_name            | Value     |
      +--------------------------+-----------+
      | mhnsw_cache_size         | 16777216  |
      | mhnsw_distance_function  | euclidean |
      | mhnsw_max_edges_per_node | 6         |
      | mhnsw_min_limit          | 20        |
      +--------------------------+-----------+
      

      Considerations:

      • the presence of HNSW in the name suggests there may be other algorithms in the future; if so, I think it would be more user-friendly to group all vector-related variables together, by giving them a common prefix. vector_ is first that comes to mind, for further use as vector_mhnsw_xxx, vector_lsh_xxx, etc., but maybe there are better ideas.
      • I don't know whether there is already a vision how it will be configured when there are alternative algorithms, e.g. whether it would make sense to have, for example, distance_function variable for each algorithm separately, it seems too cumbersome given that the function can also be set in the table definition. If, however, any of the options will be shared among different algorithms, they should lose the algorithm prefix already now, e.g. be not [vector_]mhnsw_distance_function or some [vector_]lsh_distance_function in the future, but just vector_distance_function.
      • it was also discussed that min_limit and max_edges_per_node as such are not very meaningful name and could be improved. I don't have suggestions for the better naming for them, though.

      Attachments

        Issue Links

          Activity

            • Storage engine variables don't start with engine_innodb_ or engine_myisam_. And fulltext parser variables don't start with ftparser_simple_ or ftparser_ngram_. And so on. That is, I think, mhnsw_ is enough, and there's no need to start from the index type.
            • This isn't easily doable within the current framework. And there can be a use case for different distances per algorithm, so I'd suggest to keep it this way too.
            • I agree here, but don't have suggestions either yet
            serg Sergei Golubchik added a comment - Storage engine variables don't start with engine_innodb_ or engine_myisam_ . And fulltext parser variables don't start with ftparser_simple_ or ftparser_ngram_ . And so on. That is, I think, mhnsw_ is enough, and there's no need to start from the index type. This isn't easily doable within the current framework. And there can be a use case for different distances per algorithm, so I'd suggest to keep it this way too. I agree here, but don't have suggestions either yet
            serg Sergei Golubchik added a comment - - edited

            "min limit" is reasonably easy to explain.

            SET STATEMENT mhnws_min_limit=20 SELECT id FROM t1 ORDER BY vec_distance_euclidean(vec, @q) LIMIT 10
            

            should return exactly the same result as

            SELECT * FROM (SELECT id FROM t1 ORDER BY vec_distance_euclidean(vec, @q) LIMIT 20) t LIMIT 10
            

            for simplicity let's ignore the fact that outer select result is unordered
            except that it avoids the unnecessary work of retrieving the extra 10 rows and creating an intermediate result set. That is "min limit" is, indeed, a minimal value of LIMIT that the search is performed with.

            number of edges per node is a degree:

            The maximum degree of a graph G is denoted by Δ(G)

            so "max edges per node" can also be "max degree", and it's shorter, but not any more clear

            Or simply "neighbors" (that is mhnsw_neighbors), "max" doesn't add any value or clarity here.

            It could be something totally unrelated to its underlying meaning, like "quality factor" — emphasizing a choice between "fast and bad" and "slow and good".

            Or some middle-ground, like "[max] scale". Probably, less strange than "degree", if you scale up, index size (on disk and in memory) grows. And the search becomes slower and better.

            Note that any name that describes the underlying semantics — "degree", "neighbors", "edges_per_node" — is technically wrong, because HNSW uses 2M value for the ground layer, and that's the layer that has exponentially more nodes than others.

            serg Sergei Golubchik added a comment - - edited "min limit" is reasonably easy to explain. SET STATEMENT mhnws_min_limit=20 SELECT id FROM t1 ORDER BY vec_distance_euclidean(vec, @q) LIMIT 10 should return exactly the same result as SELECT * FROM ( SELECT id FROM t1 ORDER BY vec_distance_euclidean(vec, @q) LIMIT 20) t LIMIT 10 for simplicity let's ignore the fact that outer select result is unordered except that it avoids the unnecessary work of retrieving the extra 10 rows and creating an intermediate result set. That is "min limit" is, indeed, a minimal value of LIMIT that the search is performed with. number of edges per node is a degree : The maximum degree of a graph G is denoted by Δ( G ) so "max edges per node" can also be "max degree", and it's shorter, but not any more clear Or simply "neighbors" (that is mhnsw_neighbors ), "max" doesn't add any value or clarity here. It could be something totally unrelated to its underlying meaning, like "quality factor" — emphasizing a choice between "fast and bad" and "slow and good". Or some middle-ground, like "[max] scale". Probably, less strange than "degree", if you scale up, index size (on disk and in memory) grows. And the search becomes slower and better. Note that any name that describes the underlying semantics — "degree", "neighbors", "edges_per_node" — is technically wrong, because HNSW uses 2M value for the ground layer, and that's the layer that has exponentially more nodes than others.
            elenst Elena Stepanova added a comment - - edited

            While I still don't have any good suggestions, and I don't expect to come up with any bright ideas as I'm not good at it, here is my general consideration on the subject.

            I don't think "easy to explain" is a sufficient criterion. Any name is easy to explain, we can call it "mhnsw_num4" and say that in the implementation, it is the 4th numeric value in order of appearance, hence the name. I think the main factor at choosing a good variable name is not how precisely it reflects semantics of the underlying implementation, but how intuitively understandable it is for those who is not familiar with the implementation without anyone having to explain it.

            For example, mhnsw_cache_size is fairly intuitive (apart from the unfortunate "mhnsw" part itself of course): there is a cache used for it, and that's its size.
            For fine-tuning a database admin may still need to read documentation to find out when it is allocated, whether it's per connection or global, etc., but at least database admins have the general idea that caches are for, and that a bigger size of the cache (whatever it caches) is usually better but requires more memory, and so on.

            Comparing to it, mhnsw_min_limit is rather impossible to guess. If I were to try before I learned what it is, I would say that it probably means the minimal number you can use in LIMIT clause when mhnsw is at work, that is, with mhnsw_min_limit=10 you can do ORDER BY vec_distance_euclidean(vec, @q) LIMIT 10 (or LIMIT 11 etc.), but LIMIT 9 will cause an error.

            And as for mhnsw_max_edges_per_node, I wouldn't even venture to guess.

            However, a fair counter-argument against an effort to come up with something understandable here is that MariaDB has tons of incomprehensible variable names, so there is nothing special about these ones. That's something I wouldn't be able to argue with.

            elenst Elena Stepanova added a comment - - edited While I still don't have any good suggestions, and I don't expect to come up with any bright ideas as I'm not good at it, here is my general consideration on the subject. I don't think "easy to explain" is a sufficient criterion. Any name is easy to explain, we can call it "mhnsw_num4" and say that in the implementation, it is the 4th numeric value in order of appearance, hence the name. I think the main factor at choosing a good variable name is not how precisely it reflects semantics of the underlying implementation, but how intuitively understandable it is for those who is not familiar with the implementation without anyone having to explain it. For example, mhnsw_cache_size is fairly intuitive (apart from the unfortunate "mhnsw" part itself of course): there is a cache used for it, and that's its size. For fine-tuning a database admin may still need to read documentation to find out when it is allocated, whether it's per connection or global, etc., but at least database admins have the general idea that caches are for, and that a bigger size of the cache (whatever it caches) is usually better but requires more memory, and so on. Comparing to it, mhnsw_min_limit is rather impossible to guess. If I were to try before I learned what it is, I would say that it probably means the minimal number you can use in LIMIT clause when mhnsw is at work, that is, with mhnsw_min_limit=10 you can do ORDER BY vec_distance_euclidean(vec, @q) LIMIT 10 (or LIMIT 11 etc.), but LIMIT 9 will cause an error. And as for mhnsw_max_edges_per_node , I wouldn't even venture to guess. However, a fair counter-argument against an effort to come up with something understandable here is that MariaDB has tons of incomprehensible variable names, so there is nothing special about these ones. That's something I wouldn't be able to argue with.

            The only even remotely intuitive I can think of would be mhnsw_search_quality_level and mhnsw_index_quality_level. The first one affect the search, if it's larger the quality is better and the search is slower. The second affects the index itself, if it's larger the index (and thus search results) will be better, but index will be larger and slower (that is, both insert and search will be slower).

            And no, I don't think these are good names.

            serg Sergei Golubchik added a comment - The only even remotely intuitive I can think of would be mhnsw_search_quality_level and mhnsw_index_quality_level . The first one affect the search, if it's larger the quality is better and the search is slower. The second affects the index itself, if it's larger the index (and thus search results) will be better, but index will be larger and slower (that is, both insert and search will be slower). And no, I don't think these are good names.

            It would be great to have an opinion of somebody who didn't implement it and didn't test it, but is close enough to the target audience.

            I think these are better at least, although the term "quality" is very subjective. For somebody, the main "quality" of an approximate search will be the correctness, for others the performance. Maybe mhnsw_search_precision_level?
            For the index, "quality" may even be all right.

            elenst Elena Stepanova added a comment - It would be great to have an opinion of somebody who didn't implement it and didn't test it, but is close enough to the target audience. I think these are better at least, although the term "quality" is very subjective. For somebody, the main "quality" of an approximate search will be the correctness, for others the performance. Maybe mhnsw_search_precision_level ? For the index, "quality" may even be all right.

            also, perhaps mhnsw_cache_size should be max cache size ? Because it doesn't allocate all that memory at once, instead memory usage grows until it reaches that value.

            serg Sergei Golubchik added a comment - also, perhaps mhnsw_cache_size should be max cache size ? Because it doesn't allocate all that memory at once, instead memory usage grows until it reaches that value.

            If you think it's more consistent with other similar variables, sure.
            I thought when "max" is used for such purposes in MariaDB, it usually means that a query which hits it will actually fail complaining that it cannot be executed. But I don't actually have any statistics to support it, it was just a subjective impression.

            elenst Elena Stepanova added a comment - If you think it's more consistent with other similar variables, sure. I thought when "max" is used for such purposes in MariaDB, it usually means that a query which hits it will actually fail complaining that it cannot be executed. But I don't actually have any statistics to support it, it was just a subjective impression.

            I think these are better at least, although the term "quality" is very subjective. For somebody, the main "quality" of an approximate search will be the correctness, for others the performance. Maybe mhnsw_search_precision_level? For the index, "quality" may even be all right.

            mhnsw_search_precision_level is a bit too long to my taste, but, of course, it's not a deciding factor.

            Having "quality" in both highlights that they're complementary, both improve results when increased and improve speed when decreased, and one can increase one and compensate by decreasing the other. So I'd suggest to have the same suffix for both.

            search/index precision level? Or may be "accuracy" So

            • mhnsw_index_precision_level — mhnsw_search_precision_level
            • mhnsw_index_precision — mhnsw_search_precision
            • mhnsw_index_quality — mhnsw_search_quality
            • mhnsw_index_accuracy — mhnsw_search_accuracy
            serg Sergei Golubchik added a comment - I think these are better at least, although the term "quality" is very subjective. For somebody, the main "quality" of an approximate search will be the correctness, for others the performance. Maybe mhnsw_search_precision_level ? For the index, "quality" may even be all right. mhnsw_search_precision_level is a bit too long to my taste, but, of course, it's not a deciding factor. Having "quality" in both highlights that they're complementary, both improve results when increased and improve speed when decreased, and one can increase one and compensate by decreasing the other. So I'd suggest to have the same suffix for both. search/index precision level? Or may be "accuracy" So mhnsw_index_precision_level — mhnsw_search_precision_level mhnsw_index_precision — mhnsw_search_precision mhnsw_index_quality — mhnsw_search_quality mhnsw_index_accuracy — mhnsw_search_accuracy

            Right, we can lose "level", it doesn't mean anything anyway, and given the allowed ranges (e.g. no "level 1" for the index) it can even be confusing.
            From the above, "accuracy" sounds most universal to me, but I rarely represent the majority in such matters.

            elenst Elena Stepanova added a comment - Right, we can lose "level", it doesn't mean anything anyway, and given the allowed ranges (e.g. no "level 1" for the index) it can even be confusing. From the above, "accuracy" sounds most universal to me, but I rarely represent the majority in such matters.
            serg Sergei Golubchik added a comment - - edited

            Another consideration (from cvicentiu) that users mainly use vector stores through an AI framework, hardly anyone does it directly. Meaning, it's much less important whether there variables are intuitively understandable by end users, as whether they're intuitively understandable by people, writing vector store connectors for AI frameworks. And for that we should use names "same as everyone else". That is "ef" (or "ef_search") and "M". And may be simply "distance" without "_function" part for brevity.

            Thus, an alternative proposal is

            SET @@mhnsw_default_m=16;
            SET @@mhnsw_default_distance=euclidean;
            SET @@mhnsw_ef_search=30;
             
            CREATE TABLE t1 (
              v VECTOR(10),
              VECTOR INDEX (v) M=24 DISTANCE=COSINE
            )
            

            serg Sergei Golubchik added a comment - - edited Another consideration (from cvicentiu ) that users mainly use vector stores through an AI framework, hardly anyone does it directly. Meaning, it's much less important whether there variables are intuitively understandable by end users, as whether they're intuitively understandable by people, writing vector store connectors for AI frameworks. And for that we should use names "same as everyone else". That is "ef" (or "ef_search") and "M". And may be simply "distance" without "_function" part for brevity. Thus, an alternative proposal is SET @@mhnsw_default_m=16; SET @@mhnsw_default_distance=euclidean; SET @@mhnsw_ef_search=30;   CREATE TABLE t1 ( v VECTOR(10), VECTOR INDEX (v) M=24 DISTANCE=COSINE )

            People

              serg Sergei Golubchik
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.