[MDEV-36100] Generate embeddings automatically on INSERT - Jira

Details

Type: New Feature
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Fix Version/s: None
Component/s: Vector search
Labels:
- gsoc25

Description

to simplify the vector-based pipeline, we could remove the need for a user to generate vector embeddings for storing in the database.

instead a database server should be able to do it automatically and transparently behind the scenes

this task is about implementing a hook and the API that allows to add custom code that is invoked on INSERT (or UPDATE) and converts the data into an embedding. This implies that the original data is stored in the database too.

A possible SQL syntax could be based on the existing WITH PARSER clause.

One limitation of this approach — it leaves no place for chunking, as it strictly assumes one document = one embedding. Chunking can be done outside of the INSERT. With a stored procedure or a special LOAD DATA plugin (MDEV-28395)

A limitation of specifically WITH PARSER syntax — it doesn't cache generated embeddings. To actually save embeddings in the database we could go with a simple ~~UDF~~ function plugin. Another limitation — it doesn't allow combining steps into a single data-processing pipeline, that can be easily done with functions, like

INSERT INTO t1 (doc) VALUES (generate_embedding(pdf2text(wget('https://.....pdf'))));

Functions seem to be more versatile, they can even do chunking (returning a new chunk on every call — which can be wrapped into SQL WHILE loop or a table function, when we'll have them). Function based approach can be implemented in steps:

add a function plugin to generate embeddings, may be, few more for various LLMs and for helper transformations (like pdf2text, OCR, or chunking)
- how to pass secret parameters, like OpenAI key into such a function?
introduce a concept of "expensive" function, if a stored generated column uses an "expensive" function, the server should avoid re-generating it whenever possible. ALTER/OPTIMIZE/etc should not regenerate it.
extend this to virtual indexed columns. If an index can find a value by row id (normally indexes cannot do it, but mhnsw index, for example, can), then virtual indexed columns can reuse value and avoid regenerating them just as if they were stored — this would allow to reduce storage requirements by 66%

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

mysql_alter_table.png
2025-03-27 11:25
51 kB
Tianji Yang

Issue Links

relates to

MDEV-32887 vector search

Stalled

Activity

Ascending order - Click to sort in descending order

Tianji Yang added a comment - 2025-03-12 03:26 - edited

While this can be a bit out-of-scope, I think we can possibly change the syntax for `WITH PARSER` to allow further configurations like chunking. Like WITH PARSER(CHUNK=LINES).
BTW, I'm here for GSoC:

email: tyang425@gatech.edu
github user id: Y-jiji

Tianji Yang added a comment - 2025-03-12 03:26 - edited While this can be a bit out-of-scope, I think we can possibly change the syntax for `WITH PARSER` to allow further configurations like chunking. Like WITH PARSER(CHUNK=LINES) . BTW, I'm here for GSoC: email: tyang425@gatech.edu github user id: Y-jiji

Sergei Golubchik added a comment - 2025-03-12 14:57

You cannot do chunking this way. When one does

INSERT INTO t1 (doc) VALUES ('.... big document ...')

one inserts one row. If the chunking is done outside of the server, the document will be split in many chunks and will be stored in many rows. You cannot achieve that with WITH PARSER(CHUNK=LINES). No WITH PARSER clause can change one row insert into multi-row insert.

Sergei Golubchik added a comment - 2025-03-12 14:57 You cannot do chunking this way. When one does INSERT INTO t1 (doc) VALUES ( '.... big document ...' ) one inserts one row. If the chunking is done outside of the server, the document will be split in many chunks and will be stored in many rows. You cannot achieve that with WITH PARSER(CHUNK=LINES) . No WITH PARSER clause can change one row insert into multi-row insert.

Tianji Yang added a comment - 2025-03-12 15:02

I see.
Previously my impression is that the vector generated by each chunk links back to the original row.
Now this limitation makes sense.

Tianji Yang added a comment - 2025-03-12 15:02 I see. Previously my impression is that the vector generated by each chunk links back to the original row. Now this limitation makes sense.

Sergei Golubchik added a comment - 2025-03-12 17:25

You're right, this is possible. Split in chunks when indexing, generate embeddings per chunk and all them link to the original row.

But I think it'll defeat the purpose. Normally you want to find chunks and provide them as context for RAG. If you'll find the whole row every time — it'll be too much context and RAG won't be very good, you only want to provide most relevant chunks of the document, meaning they have to be in separate rows.

Sergei Golubchik added a comment - 2025-03-12 17:25 You're right, this is possible. Split in chunks when indexing, generate embeddings per chunk and all them link to the original row. But I think it'll defeat the purpose. Normally you want to find chunks and provide them as context for RAG. If you'll find the whole row every time — it'll be too much context and RAG won't be very good, you only want to provide most relevant chunks of the document, meaning they have to be in separate rows.

Tianji Yang added a comment - 2025-03-27 12:21 - edited

Regarding the secret parameters, it seems like we can implement this using system variables (MYSQL_SYSVAR_STR).
In this way, this plugin, used by any user, can access this variable without exposing it directly to the user.
Yet only the admin or privileged users can set and read it directly.
I'm not very sure that how <EXPRESSION> AS <COLUMN_NAME> STORED is handled during alter table. I'm a bit lost when trying to understand mysql_alter_table (1700 lines, intimidating).
If a column is marked as STORED, will it just copy anyway (for the copy algorithm in alter table)? I think I need some help to locate where the column gets recomputed. Who can I ask for help?
I will try to figure out how to implement the second goal. And think about how to reuse the values in mhnsw index.

Tianji Yang added a comment - 2025-03-27 12:21 - edited Regarding the secret parameters, it seems like we can implement this using system variables ( MYSQL_SYSVAR_STR ). In this way, this plugin, used by any user, can access this variable without exposing it directly to the user. Yet only the admin or privileged users can set and read it directly. I'm not very sure that how <EXPRESSION> AS <COLUMN_NAME> STORED is handled during alter table. I'm a bit lost when trying to understand mysql_alter_table (1700 lines, intimidating). If a column is marked as STORED , will it just copy anyway (for the copy algorithm in alter table)? I think I need some help to locate where the column gets recomputed. Who can I ask for help? I will try to figure out how to implement the second goal. And think about how to reuse the values in mhnsw index.

Sergei Golubchik added a comment - 2025-03-30 18:46

yes, it's doable. On the other hand, if you have few users using RAG-like applications with the same MariaDB instance, you would not want to give them all admin rights, so that they could set their own API_KEY, it's even not possible to have three different values stored in one global sysvar at the same time. But if you make it a session variable (MYSQL_THDVAR_STR) — any user will be able to set it to their own API_KEY without interfering with others.
mysql_alter_table is too complex and to 99% not important here. Try this: create a table using stored generated column with not not very common function, like, SIN(). Then before alter table set a breakpoint on Item_func_sin::val_real(). And you'll see where the value is recomputed.
the idea with the second goal is — if the vcol expression is marked as "expensive" then it's not recomputed in alter table or anywhere, but old computed value is copied into the new table as if it was a normal column, not generated.

Sergei Golubchik added a comment - 2025-03-30 18:46 yes, it's doable. On the other hand, if you have few users using RAG-like applications with the same MariaDB instance, you would not want to give them all admin rights, so that they could set their own API_KEY, it's even not possible to have three different values stored in one global sysvar at the same time. But if you make it a session variable (MYSQL_THDVAR_STR) — any user will be able to set it to their own API_KEY without interfering with others. mysql_alter_table is too complex and to 99% not important here. Try this: create a table using stored generated column with not not very common function, like, SIN() . Then before alter table set a breakpoint on Item_func_sin::val_real() . And you'll see where the value is recomputed. the idea with the second goal is — if the vcol expression is marked as "expensive" then it's not recomputed in alter table or anywhere, but old computed value is copied into the new table as if it was a normal column, not generated.

Tianji Yang added a comment - 2025-04-05 20:13 - edited

I think the solution for the API_KEY is to do both – we first read the API_KEY from the session variable, if the API_KEY is not available, then consider using the system variable.
+ My previous impression for this objective is: there is an admin who can access and modify the API_KEY, and normally this key should be kept from other users so they cannot either access or modify them.
+ But certainly as you said, there are cases where the users do want to use there API_KEY for their business. So we should also read session variables.

I'm still working on this. I expect to complete the project proposal by the EOD.

I consulted people from LLM business. They said sometimes for the same API call, the embedding for the same text can change between different API calls because some companies like OpenAI can silently update their model. We may also take that into account.

Tianji Yang added a comment - 2025-04-05 20:13 - edited I think the solution for the API_KEY is to do both – we first read the API_KEY from the session variable, if the API_KEY is not available, then consider using the system variable. + My previous impression for this objective is: there is an admin who can access and modify the API_KEY, and normally this key should be kept from other users so they cannot either access or modify them. + But certainly as you said, there are cases where the users do want to use there API_KEY for their business. So we should also read session variables. I'm still working on this. I expect to complete the project proposal by the EOD. I consulted people from LLM business. They said sometimes for the same API call, the embedding for the same text can change between different API calls because some companies like OpenAI can silently update their model. We may also take that into account.

People

Assignee:: Unassigned

Reporter:: Sergei Golubchik

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 2025-02-15 13:47

Updated:: 2025-04-05 20:15

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration