Details

    Description

      Use some real-life million-size dataset

      Benchmark goals:

      • Index creation time
      • Index update time
      • Index lookup for 1, 10, 100, 1000 entries for the same "query" vector repeated N times.
        • This tests speed of graph lookup with hopefully most data in cache.
      • Index lookup for million different query vectors (top 1, 10, 100, 1000) results, no repetition.
        • This tests speed of graph lookup when data may not all fit in cache.
      • Compute average recall of the algorithm for such queries.

      Compare to papers using the same algorithm.

      Attachments

        Issue Links

          Activity

            wenhug Hugo Wen added a comment -

            I'm looking into the vector benchmarking.

            wenhug Hugo Wen added a comment - I'm looking into the vector benchmarking.
            wenhug Hugo Wen added a comment -

            I've drafted a PR for supporting vector ANN search benchmarking based on the bb-11.4-vec branch.

            This PR introduces scripts and a Dockerfile for executing the ann-benchmarks tool against MariaDB. The scripts support running both in GitLab CI and manually. Feel free to test the scripts by following the instructions outlined in the script help and description provided in the PR/commit message. Additionally, I've provided the outputs and execution results in the PR comment section.

            Please be aware that the current PR utilizes the "mariadb" module in ann-benchmarks, which I created in my forked repository on GitHub. Once an official version of MariaDB that supports vectors is available, I will

            • update ann-benchmarks "mariadb" module to use official MariaDB build and contribute the updated module to ann-benchmarks.
            • update the MariaDB scripts to utilize the official branch of ann-benchmarks.
            wenhug Hugo Wen added a comment - I've drafted a PR for supporting vector ANN search benchmarking based on the bb-11.4-vec branch. This PR introduces scripts and a Dockerfile for executing the ann-benchmarks tool against MariaDB. The scripts support running both in GitLab CI and manually. Feel free to test the scripts by following the instructions outlined in the script help and description provided in the PR/commit message. Additionally, I've provided the outputs and execution results in the PR comment section. Please be aware that the current PR utilizes the " mariadb " module in ann-benchmarks, which I created in my forked repository on GitHub. Once an official version of MariaDB that supports vectors is available, I will update ann-benchmarks "mariadb" module to use official MariaDB build and contribute the updated module to ann-benchmarks. update the MariaDB scripts to utilize the official branch of ann-benchmarks.
            rdyas Robert Dyas added a comment -

            I'm trying to get an understanding of what scale this might be able to handle.
            400GB of text chunks at 2k per embedding is 200mm vectors. Lets assume only 512 dimensions.
            Do you have any feel for what kind of hardware it will take to run this?
            It doesn't have to be fast initially (500ms is more than good enough) but it has to work.

            rdyas Robert Dyas added a comment - I'm trying to get an understanding of what scale this might be able to handle. 400GB of text chunks at 2k per embedding is 200mm vectors. Lets assume only 512 dimensions. Do you have any feel for what kind of hardware it will take to run this? It doesn't have to be fast initially (500ms is more than good enough) but it has to work.
            serg Sergei Golubchik added a comment - - edited

            The largest I've tried so far was the gist-960 dataset from ann-benchmarks. It's 1 million of vectors, 960 dimensions. It fits well within 2GB on disk and may be a bit more than that (2.5GB or less?) in RAM.

            If "200 mm" means 200 millions, it's, well, 200 times more, but vectors are about half the size, so say 250GB should fit the whole graph. Otherwise it'll be a disk-bound load — the accessed part of the index is cached, but if the whole index doesn't fit in memory, the server will have to read from disk, which is slow. Pay attention to the @@mhnsw_cache_size variable.

            Note, though, that as of now the INSERT performance is kind of slow. So loading these 400GB will take a lot of time. Should be better in a release.

            serg Sergei Golubchik added a comment - - edited The largest I've tried so far was the gist-960 dataset from ann-benchmarks. It's 1 million of vectors, 960 dimensions. It fits well within 2GB on disk and may be a bit more than that (2.5GB or less?) in RAM. If "200 mm" means 200 millions, it's, well, 200 times more, but vectors are about half the size, so say 250GB should fit the whole graph. Otherwise it'll be a disk-bound load — the accessed part of the index is cached, but if the whole index doesn't fit in memory, the server will have to read from disk, which is slow. Pay attention to the @@mhnsw_cache_size variable. Note, though, that as of now the INSERT performance is kind of slow. So loading these 400GB will take a lot of time. Should be better in a release.
            rdyas Robert Dyas added a comment -

            Yes, by 200mm I mean 200 million.
            I am not familiar with how vector search is implemented, so I don't have a feel for speed if the index is disk bound.
            Say a server has 32gb of ram and a 250gb vector index sitting on a SSD. Can I reasonably expect a vector search for top 5 to work? Any rough guess of query time?
            And if I need to add 100,000 vectors each night in a batch would I just drop the index, add the vectors, and re-index?

            rdyas Robert Dyas added a comment - Yes, by 200mm I mean 200 million. I am not familiar with how vector search is implemented, so I don't have a feel for speed if the index is disk bound. Say a server has 32gb of ram and a 250gb vector index sitting on a SSD. Can I reasonably expect a vector search for top 5 to work? Any rough guess of query time? And if I need to add 100,000 vectors each night in a batch would I just drop the index, add the vectors, and re-index?

            About disk bound: Most of the graph search algorithms assume the index is completely in memory. So disk-bound will be slow. On the other hand, you said 500ms which is very slow too, so looks very possible. I can run some benchmarks like that later (after current benchmarks will finish).

            About adding 100,000 vectors each night in a batch — no, definitely not drop the index. Just add them. But 1) we need to implement support for bulk insert (this will happen before the release) and 2) your batch must be recognized as bulk insert (I don't know yet how we'll do it — but there will be a way to tell the server "this is bulk insert". may be LOCK TABLES will be needed as a hint). Then it'll be perfectly doable. I can insert 60000 vectors of 784 dimensions in a few minutes.

            By the way, thanks for asking. It helps to understand what use cases are important, for example, that we need to support bulk insert into a non-empty index.

            serg Sergei Golubchik added a comment - About disk bound: Most of the graph search algorithms assume the index is completely in memory. So disk-bound will be slow. On the other hand, you said 500ms which is very slow too, so looks very possible. I can run some benchmarks like that later (after current benchmarks will finish). About adding 100,000 vectors each night in a batch — no, definitely not drop the index. Just add them. But 1) we need to implement support for bulk insert (this will happen before the release) and 2) your batch must be recognized as bulk insert (I don't know yet how we'll do it — but there will be a way to tell the server "this is bulk insert". may be LOCK TABLES will be needed as a hint). Then it'll be perfectly doable. I can insert 60000 vectors of 784 dimensions in a few minutes. By the way, thanks for asking. It helps to understand what use cases are important, for example, that we need to support bulk insert into a non-empty index.

            People

              serg Sergei Golubchik
              serg Sergei Golubchik
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.