Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-36100

Generate embeddings automatically on INSERT

Details

    Description

      to simplify the vector-based pipeline, we could remove the need for a user to generate vector embeddings for storing in the database.

      instead a database server should be able to do it automatically and transparently behind the scenes

      this task is about implementing a hook and the API that allows to add custom code that is invoked on INSERT (or UPDATE) and converts the data into an embedding. This implies that the original data is stored in the database too.

      A possible SQL syntax could be based on the existing WITH PARSER clause.

      One limitation of this approach — it leaves no place for chunking, as it strictly assumes one document = one embedding. Chunking can be done outside of the INSERT. With a stored procedure or a special LOAD DATA plugin (MDEV-28395)

      A limitation of specifically WITH PARSER syntax — it doesn't cache generated embeddings. To actually save embeddings in the database we could go with a simple UDF function plugin. Another limitation — it doesn't allow combining steps into a single data-processing pipeline, that can be easily done with functions, like

      INSERT INTO t1 (doc) VALUES (generate_embedding(pdf2text(wget('https://.....pdf'))));
      

      Functions seem to be more versatile, they can even do chunking (returning a new chunk on every call — which can be wrapped into SQL WHILE loop or a table function, when we'll have them). Function based approach can be implemented in steps:

      1. add a function plugin to generate embeddings, may be, few more for various LLMs and for helper transformations (like pdf2text, OCR, or chunking)
        • how to pass secret parameters, like OpenAI key into such a function?
      2. introduce a concept of "expensive" function, if a stored generated column uses an "expensive" function, the server should avoid re-generating it whenever possible. ALTER/OPTIMIZE/etc should not regenerate it.
      3. extend this to virtual indexed columns. If an index can find a value by row id (normally indexes cannot do it, but mhnsw index, for example, can), then virtual indexed columns can reuse value and avoid regenerating them just as if they were stored — this would allow to reduce storage requirements by 66%

      Attachments

        Issue Links

          Activity

            Y-jiji Tianji Yang added a comment - - edited

            While this can be a bit out-of-scope, I think we can possibly change the syntax for `WITH PARSER` to allow further configurations like chunking. Like WITH PARSER(CHUNK=LINES).
            BTW, I'm here for GSoC:

            • email: tyang425@gatech.edu
            • github user id: Y-jiji
            Y-jiji Tianji Yang added a comment - - edited While this can be a bit out-of-scope, I think we can possibly change the syntax for `WITH PARSER` to allow further configurations like chunking. Like WITH PARSER(CHUNK=LINES) . BTW, I'm here for GSoC: email: tyang425@gatech.edu github user id: Y-jiji

            You cannot do chunking this way. When one does

            INSERT INTO t1 (doc) VALUES ('.... big document ...')
            

            one inserts one row. If the chunking is done outside of the server, the document will be split in many chunks and will be stored in many rows. You cannot achieve that with WITH PARSER(CHUNK=LINES). No WITH PARSER clause can change one row insert into multi-row insert.

            serg Sergei Golubchik added a comment - You cannot do chunking this way. When one does INSERT INTO t1 (doc) VALUES ( '.... big document ...' ) one inserts one row. If the chunking is done outside of the server, the document will be split in many chunks and will be stored in many rows. You cannot achieve that with WITH PARSER(CHUNK=LINES) . No WITH PARSER clause can change one row insert into multi-row insert.
            Y-jiji Tianji Yang added a comment -

            I see.
            Previously my impression is that the vector generated by each chunk links back to the original row.
            Now this limitation makes sense.

            Y-jiji Tianji Yang added a comment - I see. Previously my impression is that the vector generated by each chunk links back to the original row. Now this limitation makes sense.

            You're right, this is possible. Split in chunks when indexing, generate embeddings per chunk and all them link to the original row.

            But I think it'll defeat the purpose. Normally you want to find chunks and provide them as context for RAG. If you'll find the whole row every time — it'll be too much context and RAG won't be very good, you only want to provide most relevant chunks of the document, meaning they have to be in separate rows.

            serg Sergei Golubchik added a comment - You're right, this is possible. Split in chunks when indexing, generate embeddings per chunk and all them link to the original row. But I think it'll defeat the purpose. Normally you want to find chunks and provide them as context for RAG. If you'll find the whole row every time — it'll be too much context and RAG won't be very good, you only want to provide most relevant chunks of the document, meaning they have to be in separate rows.
            Y-jiji Tianji Yang added a comment - - edited
            • Regarding the secret parameters, it seems like we can implement this using system variables (MYSQL_SYSVAR_STR).
              In this way, this plugin, used by any user, can access this variable without exposing it directly to the user.
              Yet only the admin or privileged users can set and read it directly.
            • I'm not very sure that how <EXPRESSION> AS <COLUMN_NAME> STORED is handled during alter table. I'm a bit lost when trying to understand mysql_alter_table (1700 lines, intimidating).
              If a column is marked as STORED, will it just copy anyway (for the copy algorithm in alter table)? I think I need some help to locate where the column gets recomputed. Who can I ask for help?
            • I will try to figure out how to implement the second goal. And think about how to reuse the values in mhnsw index.
            Y-jiji Tianji Yang added a comment - - edited Regarding the secret parameters, it seems like we can implement this using system variables ( MYSQL_SYSVAR_STR ). In this way, this plugin, used by any user, can access this variable without exposing it directly to the user. Yet only the admin or privileged users can set and read it directly. I'm not very sure that how <EXPRESSION> AS <COLUMN_NAME> STORED is handled during alter table. I'm a bit lost when trying to understand mysql_alter_table (1700 lines, intimidating). If a column is marked as STORED , will it just copy anyway (for the copy algorithm in alter table)? I think I need some help to locate where the column gets recomputed. Who can I ask for help? I will try to figure out how to implement the second goal. And think about how to reuse the values in mhnsw index.
            • yes, it's doable. On the other hand, if you have few users using RAG-like applications with the same MariaDB instance, you would not want to give them all admin rights, so that they could set their own API_KEY, it's even not possible to have three different values stored in one global sysvar at the same time. But if you make it a session variable (MYSQL_THDVAR_STR) — any user will be able to set it to their own API_KEY without interfering with others.
            • mysql_alter_table is too complex and to 99% not important here. Try this: create a table using stored generated column with not not very common function, like, SIN(). Then before alter table set a breakpoint on Item_func_sin::val_real(). And you'll see where the value is recomputed.
            • the idea with the second goal is — if the vcol expression is marked as "expensive" then it's not recomputed in alter table or anywhere, but old computed value is copied into the new table as if it was a normal column, not generated.
            serg Sergei Golubchik added a comment - yes, it's doable. On the other hand, if you have few users using RAG-like applications with the same MariaDB instance, you would not want to give them all admin rights, so that they could set their own API_KEY, it's even not possible to have three different values stored in one global sysvar at the same time. But if you make it a session variable (MYSQL_THDVAR_STR) — any user will be able to set it to their own API_KEY without interfering with others. mysql_alter_table is too complex and to 99% not important here. Try this: create a table using stored generated column with not not very common function, like, SIN() . Then before alter table set a breakpoint on Item_func_sin::val_real() . And you'll see where the value is recomputed. the idea with the second goal is — if the vcol expression is marked as "expensive" then it's not recomputed in alter table or anywhere, but old computed value is copied into the new table as if it was a normal column, not generated.

            People

              Unassigned Unassigned
              serg Sergei Golubchik
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.