Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-36100

Generate embeddings automatically on INSERT

Details

    Description

      to simplify the vector-based pipeline, we could remove the need for a user to generate vector embeddings for storing in the database.

      instead a database server should be able to do it automatically and transparently behind the scenes

      this task is about implementing a hook and the API that allows to add custom code that is invoked on INSERT (or UPDATE) and converts the data into an embedding. This implies that the original data is stored in the database too.

      A possible SQL syntax could be based on the existing WITH PARSER clause.

      One limitation of this approach — it leaves no place for chunking, as it strictly assumes one document = one embedding. Chunking can be done outside of the INSERT. With a stored procedure or a special LOAD DATA plugin (MDEV-28395)

      A limitation of specifically WITH PARSER syntax — it doesn't cache generated embeddings. To actually save embeddings in the database we could go with a simple UDF function plugin. Another limitation — it doesn't allow combining steps into a single data-processing pipeline, that can be easily done with functions, like

      INSERT INTO t1 (doc) VALUES (generate_embedding(pdf2text(wget('https://.....pdf'))));
      

      Functions seem to be more versatile, they can even do chunking (returning a new chunk on every call — which can be wrapped into SQL WHILE loop or a table function, when we'll have them). Function based approach can be implemented in steps:

      1. add a function plugin to generate embeddings, may be, few more for various LLMs and for helper transformations (like pdf2text, OCR, or chunking)
        • how to pass secret parameters, like OpenAI key into such a function?
      2. introduce a concept of "expensive" function, if a stored generated column uses an "expensive" function, the server should avoid re-generating it whenever possible. ALTER/OPTIMIZE/etc should not regenerate it.
      3. extend this to virtual indexed columns. If an index can find a value by row id (normally indexes cannot do it, but mhnsw index, for example, can), then virtual indexed columns can reuse value and avoid regenerating them just as if they were stored — this would allow to reduce storage requirements by 66%

      Attachments

        Issue Links

          Activity

            serg Sergei Golubchik created issue -
            serg Sergei Golubchik made changes -
            Field Original Value New Value
            Description to simplify the vector-based pipeline, we could remove the need for a user to generate vector embeddings for storing in the database.

            instead a database server should be able to do it automatically and transparently behind the scenes

            this task is about implementing a hook and the API that allows to add custom code that is invoked on INSERT (or UPDATE) and converts the data into an embedding. This implies that the original data is stored in the database too.

            A possible SQL syntax could be based on the existing {{WITH PARSER}} clause.

            One limitation of this approach — it leaves no place for chunking, as it strictly assumes one document = one embedding. Chunking can be done outside of the INSERT. With a stored procedure or a special LOAD DATA plugin (MDEV-28395)
            to simplify the vector-based pipeline, we could remove the need for a user to generate vector embeddings for storing in the database.

            instead a database server should be able to do it automatically and transparently behind the scenes

            this task is about implementing a hook and the API that allows to add custom code that is invoked on INSERT (or UPDATE) and converts the data into an embedding. This implies that the original data is stored in the database too.

            A possible SQL syntax could be based on the existing {{WITH PARSER}} clause.

            One limitation of this approach — it leaves no place for chunking, as it strictly assumes one document = one embedding. Chunking can be done outside of the INSERT. With a stored procedure or a special LOAD DATA plugin (MDEV-28395)

            A limitation of specifically {{WITH PARSER}} syntax — it doesn't cache generated embeddings. To actually save embeddings in the database we could go with a simple UDF and a generated stored column.
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            Labels gsoc25
            serg Sergei Golubchik made changes -
            Description to simplify the vector-based pipeline, we could remove the need for a user to generate vector embeddings for storing in the database.

            instead a database server should be able to do it automatically and transparently behind the scenes

            this task is about implementing a hook and the API that allows to add custom code that is invoked on INSERT (or UPDATE) and converts the data into an embedding. This implies that the original data is stored in the database too.

            A possible SQL syntax could be based on the existing {{WITH PARSER}} clause.

            One limitation of this approach — it leaves no place for chunking, as it strictly assumes one document = one embedding. Chunking can be done outside of the INSERT. With a stored procedure or a special LOAD DATA plugin (MDEV-28395)

            A limitation of specifically {{WITH PARSER}} syntax — it doesn't cache generated embeddings. To actually save embeddings in the database we could go with a simple UDF and a generated stored column.
            to simplify the vector-based pipeline, we could remove the need for a user to generate vector embeddings for storing in the database.

            instead a database server should be able to do it automatically and transparently behind the scenes

            this task is about implementing a hook and the API that allows to add custom code that is invoked on INSERT (or UPDATE) and converts the data into an embedding. This implies that the original data is stored in the database too.

            A possible SQL syntax could be based on the existing {{WITH PARSER}} clause.

            One limitation of this approach — it leaves no place for chunking, as it strictly assumes one document = one embedding. Chunking can be done outside of the INSERT. With a stored procedure or a special LOAD DATA plugin (MDEV-28395)

            A limitation of specifically {{WITH PARSER}} syntax — it doesn't cache generated embeddings. To actually save embeddings in the database we could go with a simple -UDF- function plugin. Another limitation — it doesn't allow combining steps into a single data-processing pipeline, that can be easily done with functions, like
            {code:sql}
            INSERT INTO t1 (doc) VALUES (generate_embedding(pdf2text(wget('https://.....pdf'))));
            {code}
            serg Sergei Golubchik made changes -
            Description to simplify the vector-based pipeline, we could remove the need for a user to generate vector embeddings for storing in the database.

            instead a database server should be able to do it automatically and transparently behind the scenes

            this task is about implementing a hook and the API that allows to add custom code that is invoked on INSERT (or UPDATE) and converts the data into an embedding. This implies that the original data is stored in the database too.

            A possible SQL syntax could be based on the existing {{WITH PARSER}} clause.

            One limitation of this approach — it leaves no place for chunking, as it strictly assumes one document = one embedding. Chunking can be done outside of the INSERT. With a stored procedure or a special LOAD DATA plugin (MDEV-28395)

            A limitation of specifically {{WITH PARSER}} syntax — it doesn't cache generated embeddings. To actually save embeddings in the database we could go with a simple -UDF- function plugin. Another limitation — it doesn't allow combining steps into a single data-processing pipeline, that can be easily done with functions, like
            {code:sql}
            INSERT INTO t1 (doc) VALUES (generate_embedding(pdf2text(wget('https://.....pdf'))));
            {code}
            to simplify the vector-based pipeline, we could remove the need for a user to generate vector embeddings for storing in the database.

            instead a database server should be able to do it automatically and transparently behind the scenes

            this task is about implementing a hook and the API that allows to add custom code that is invoked on INSERT (or UPDATE) and converts the data into an embedding. This implies that the original data is stored in the database too.

            A possible SQL syntax could be based on the existing {{WITH PARSER}} clause.

            One limitation of this approach — it leaves no place for chunking, as it strictly assumes one document = one embedding. Chunking can be done outside of the INSERT. With a stored procedure or a special LOAD DATA plugin (MDEV-28395)

            A limitation of specifically {{WITH PARSER}} syntax — it doesn't cache generated embeddings. To actually save embeddings in the database we could go with a simple -UDF- function plugin. Another limitation — it doesn't allow combining steps into a single data-processing pipeline, that can be easily done with functions, like
            {code:sql}
            INSERT INTO t1 (doc) VALUES (generate_embedding(pdf2text(wget('https://.....pdf'))));
            {code}

            Functions seem to be more versatile, they can even do chunking (returning a new chunk on every call — which can be wrapped into SQL WHILE loop or a table function, when we'll have them). Function based approach can be implemented in steps:
            * add a function plugin to generate embeddings, may few more for various LLMs and for helper transformations (like pdf2text or OCR)
            * introduce a concept of "expensive" function, if a stored generated column uses an "expensive" function, the server should avoid re-generating it whenever possible. ALTER/OPTIMIZE/etc should not regenerate it.
            * extend this to virtual indexed columns. If an
            serg Sergei Golubchik made changes -
            Description to simplify the vector-based pipeline, we could remove the need for a user to generate vector embeddings for storing in the database.

            instead a database server should be able to do it automatically and transparently behind the scenes

            this task is about implementing a hook and the API that allows to add custom code that is invoked on INSERT (or UPDATE) and converts the data into an embedding. This implies that the original data is stored in the database too.

            A possible SQL syntax could be based on the existing {{WITH PARSER}} clause.

            One limitation of this approach — it leaves no place for chunking, as it strictly assumes one document = one embedding. Chunking can be done outside of the INSERT. With a stored procedure or a special LOAD DATA plugin (MDEV-28395)

            A limitation of specifically {{WITH PARSER}} syntax — it doesn't cache generated embeddings. To actually save embeddings in the database we could go with a simple -UDF- function plugin. Another limitation — it doesn't allow combining steps into a single data-processing pipeline, that can be easily done with functions, like
            {code:sql}
            INSERT INTO t1 (doc) VALUES (generate_embedding(pdf2text(wget('https://.....pdf'))));
            {code}

            Functions seem to be more versatile, they can even do chunking (returning a new chunk on every call — which can be wrapped into SQL WHILE loop or a table function, when we'll have them). Function based approach can be implemented in steps:
            * add a function plugin to generate embeddings, may few more for various LLMs and for helper transformations (like pdf2text or OCR)
            * introduce a concept of "expensive" function, if a stored generated column uses an "expensive" function, the server should avoid re-generating it whenever possible. ALTER/OPTIMIZE/etc should not regenerate it.
            * extend this to virtual indexed columns. If an
            to simplify the vector-based pipeline, we could remove the need for a user to generate vector embeddings for storing in the database.

            instead a database server should be able to do it automatically and transparently behind the scenes

            this task is about implementing a hook and the API that allows to add custom code that is invoked on INSERT (or UPDATE) and converts the data into an embedding. This implies that the original data is stored in the database too.

            A possible SQL syntax could be based on the existing {{WITH PARSER}} clause.

            One limitation of this approach — it leaves no place for chunking, as it strictly assumes one document = one embedding. Chunking can be done outside of the INSERT. With a stored procedure or a special LOAD DATA plugin (MDEV-28395)

            A limitation of specifically {{WITH PARSER}} syntax — it doesn't cache generated embeddings. To actually save embeddings in the database we could go with a simple -UDF- function plugin. Another limitation — it doesn't allow combining steps into a single data-processing pipeline, that can be easily done with functions, like
            {code:sql}
            INSERT INTO t1 (doc) VALUES (generate_embedding(pdf2text(wget('https://.....pdf'))));
            {code}

            Functions seem to be more versatile, they can even do chunking (returning a new chunk on every call — which can be wrapped into SQL WHILE loop or a table function, when we'll have them). Function based approach can be implemented in steps:
            * add a function plugin to generate embeddings, may few more for various LLMs and for helper transformations (like pdf2text or OCR)
            * introduce a concept of "expensive" function, if a stored generated column uses an "expensive" function, the server should avoid re-generating it whenever possible. ALTER/OPTIMIZE/etc should not regenerate it.
            * extend this to virtual indexed columns. If an index can find a value by row id (normally indexes cannot do it, but mhnsw index, for example, can), then virtual indexed columns can reuse value and avoid regenerating them just as if they were stored.
            serg Sergei Golubchik made changes -
            Description to simplify the vector-based pipeline, we could remove the need for a user to generate vector embeddings for storing in the database.

            instead a database server should be able to do it automatically and transparently behind the scenes

            this task is about implementing a hook and the API that allows to add custom code that is invoked on INSERT (or UPDATE) and converts the data into an embedding. This implies that the original data is stored in the database too.

            A possible SQL syntax could be based on the existing {{WITH PARSER}} clause.

            One limitation of this approach — it leaves no place for chunking, as it strictly assumes one document = one embedding. Chunking can be done outside of the INSERT. With a stored procedure or a special LOAD DATA plugin (MDEV-28395)

            A limitation of specifically {{WITH PARSER}} syntax — it doesn't cache generated embeddings. To actually save embeddings in the database we could go with a simple -UDF- function plugin. Another limitation — it doesn't allow combining steps into a single data-processing pipeline, that can be easily done with functions, like
            {code:sql}
            INSERT INTO t1 (doc) VALUES (generate_embedding(pdf2text(wget('https://.....pdf'))));
            {code}

            Functions seem to be more versatile, they can even do chunking (returning a new chunk on every call — which can be wrapped into SQL WHILE loop or a table function, when we'll have them). Function based approach can be implemented in steps:
            * add a function plugin to generate embeddings, may few more for various LLMs and for helper transformations (like pdf2text or OCR)
            * introduce a concept of "expensive" function, if a stored generated column uses an "expensive" function, the server should avoid re-generating it whenever possible. ALTER/OPTIMIZE/etc should not regenerate it.
            * extend this to virtual indexed columns. If an index can find a value by row id (normally indexes cannot do it, but mhnsw index, for example, can), then virtual indexed columns can reuse value and avoid regenerating them just as if they were stored.
            to simplify the vector-based pipeline, we could remove the need for a user to generate vector embeddings for storing in the database.

            instead a database server should be able to do it automatically and transparently behind the scenes

            this task is about implementing a hook and the API that allows to add custom code that is invoked on INSERT (or UPDATE) and converts the data into an embedding. This implies that the original data is stored in the database too.

            A possible SQL syntax could be based on the existing {{WITH PARSER}} clause.

            One limitation of this approach — it leaves no place for chunking, as it strictly assumes one document = one embedding. Chunking can be done outside of the INSERT. With a stored procedure or a special LOAD DATA plugin (MDEV-28395)

            A limitation of specifically {{WITH PARSER}} syntax — it doesn't cache generated embeddings. To actually save embeddings in the database we could go with a simple -UDF- function plugin. Another limitation — it doesn't allow combining steps into a single data-processing pipeline, that can be easily done with functions, like
            {code:sql}
            INSERT INTO t1 (doc) VALUES (generate_embedding(pdf2text(wget('https://.....pdf'))));
            {code}

            Functions seem to be more versatile, they can even do chunking (returning a new chunk on every call — which can be wrapped into SQL WHILE loop or a table function, when we'll have them). Function based approach can be implemented in steps:
            * add a function plugin to generate embeddings, may few more for various LLMs and for helper transformations (like pdf2text or OCR)
            * introduce a concept of "expensive" function, if a stored generated column uses an "expensive" function, the server should avoid re-generating it whenever possible. ALTER/OPTIMIZE/etc should not regenerate it.
            * extend this to virtual indexed columns. If an index can find a value by row id (normally indexes cannot do it, but mhnsw index, for example, can), then virtual indexed columns can reuse value and avoid regenerating them just as if they were stored — this would allow to reduce storage requirements by 66%
            serg Sergei Golubchik made changes -
            Description to simplify the vector-based pipeline, we could remove the need for a user to generate vector embeddings for storing in the database.

            instead a database server should be able to do it automatically and transparently behind the scenes

            this task is about implementing a hook and the API that allows to add custom code that is invoked on INSERT (or UPDATE) and converts the data into an embedding. This implies that the original data is stored in the database too.

            A possible SQL syntax could be based on the existing {{WITH PARSER}} clause.

            One limitation of this approach — it leaves no place for chunking, as it strictly assumes one document = one embedding. Chunking can be done outside of the INSERT. With a stored procedure or a special LOAD DATA plugin (MDEV-28395)

            A limitation of specifically {{WITH PARSER}} syntax — it doesn't cache generated embeddings. To actually save embeddings in the database we could go with a simple -UDF- function plugin. Another limitation — it doesn't allow combining steps into a single data-processing pipeline, that can be easily done with functions, like
            {code:sql}
            INSERT INTO t1 (doc) VALUES (generate_embedding(pdf2text(wget('https://.....pdf'))));
            {code}

            Functions seem to be more versatile, they can even do chunking (returning a new chunk on every call — which can be wrapped into SQL WHILE loop or a table function, when we'll have them). Function based approach can be implemented in steps:
            * add a function plugin to generate embeddings, may few more for various LLMs and for helper transformations (like pdf2text or OCR)
            * introduce a concept of "expensive" function, if a stored generated column uses an "expensive" function, the server should avoid re-generating it whenever possible. ALTER/OPTIMIZE/etc should not regenerate it.
            * extend this to virtual indexed columns. If an index can find a value by row id (normally indexes cannot do it, but mhnsw index, for example, can), then virtual indexed columns can reuse value and avoid regenerating them just as if they were stored — this would allow to reduce storage requirements by 66%
            to simplify the vector-based pipeline, we could remove the need for a user to generate vector embeddings for storing in the database.

            instead a database server should be able to do it automatically and transparently behind the scenes

            this task is about implementing a hook and the API that allows to add custom code that is invoked on INSERT (or UPDATE) and converts the data into an embedding. This implies that the original data is stored in the database too.

            A possible SQL syntax could be based on the existing {{WITH PARSER}} clause.

            One limitation of this approach — it leaves no place for chunking, as it strictly assumes one document = one embedding. Chunking can be done outside of the INSERT. With a stored procedure or a special LOAD DATA plugin (MDEV-28395)

            A limitation of specifically {{WITH PARSER}} syntax — it doesn't cache generated embeddings. To actually save embeddings in the database we could go with a simple -UDF- function plugin. Another limitation — it doesn't allow combining steps into a single data-processing pipeline, that can be easily done with functions, like
            {code:sql}
            INSERT INTO t1 (doc) VALUES (generate_embedding(pdf2text(wget('https://.....pdf'))));
            {code}

            Functions seem to be more versatile, they can even do chunking (returning a new chunk on every call — which can be wrapped into SQL WHILE loop or a table function, when we'll have them). Function based approach can be implemented in steps:
            # add a function plugin to generate embeddings, may few more for various LLMs and for helper transformations (like pdf2text or OCR)
            # introduce a concept of "expensive" function, if a stored generated column uses an "expensive" function, the server should avoid re-generating it whenever possible. ALTER/OPTIMIZE/etc should not regenerate it.
            # extend this to virtual indexed columns. If an index can find a value by row id (normally indexes cannot do it, but mhnsw index, for example, can), then virtual indexed columns can reuse value and avoid regenerating them just as if they were stored — this would allow to reduce storage requirements by 66%
            serg Sergei Golubchik made changes -
            Description to simplify the vector-based pipeline, we could remove the need for a user to generate vector embeddings for storing in the database.

            instead a database server should be able to do it automatically and transparently behind the scenes

            this task is about implementing a hook and the API that allows to add custom code that is invoked on INSERT (or UPDATE) and converts the data into an embedding. This implies that the original data is stored in the database too.

            A possible SQL syntax could be based on the existing {{WITH PARSER}} clause.

            One limitation of this approach — it leaves no place for chunking, as it strictly assumes one document = one embedding. Chunking can be done outside of the INSERT. With a stored procedure or a special LOAD DATA plugin (MDEV-28395)

            A limitation of specifically {{WITH PARSER}} syntax — it doesn't cache generated embeddings. To actually save embeddings in the database we could go with a simple -UDF- function plugin. Another limitation — it doesn't allow combining steps into a single data-processing pipeline, that can be easily done with functions, like
            {code:sql}
            INSERT INTO t1 (doc) VALUES (generate_embedding(pdf2text(wget('https://.....pdf'))));
            {code}

            Functions seem to be more versatile, they can even do chunking (returning a new chunk on every call — which can be wrapped into SQL WHILE loop or a table function, when we'll have them). Function based approach can be implemented in steps:
            # add a function plugin to generate embeddings, may few more for various LLMs and for helper transformations (like pdf2text or OCR)
            # introduce a concept of "expensive" function, if a stored generated column uses an "expensive" function, the server should avoid re-generating it whenever possible. ALTER/OPTIMIZE/etc should not regenerate it.
            # extend this to virtual indexed columns. If an index can find a value by row id (normally indexes cannot do it, but mhnsw index, for example, can), then virtual indexed columns can reuse value and avoid regenerating them just as if they were stored — this would allow to reduce storage requirements by 66%
            to simplify the vector-based pipeline, we could remove the need for a user to generate vector embeddings for storing in the database.

            instead a database server should be able to do it automatically and transparently behind the scenes

            this task is about implementing a hook and the API that allows to add custom code that is invoked on INSERT (or UPDATE) and converts the data into an embedding. This implies that the original data is stored in the database too.

            A possible SQL syntax could be based on the existing {{WITH PARSER}} clause.

            One limitation of this approach — it leaves no place for chunking, as it strictly assumes one document = one embedding. Chunking can be done outside of the INSERT. With a stored procedure or a special LOAD DATA plugin (MDEV-28395)

            A limitation of specifically {{WITH PARSER}} syntax — it doesn't cache generated embeddings. To actually save embeddings in the database we could go with a simple -UDF- function plugin. Another limitation — it doesn't allow combining steps into a single data-processing pipeline, that can be easily done with functions, like
            {code:sql}
            INSERT INTO t1 (doc) VALUES (generate_embedding(pdf2text(wget('https://.....pdf'))));
            {code}

            Functions seem to be more versatile, they can even do chunking (returning a new chunk on every call — which can be wrapped into SQL WHILE loop or a table function, when we'll have them). Function based approach can be implemented in steps:
            # add a function plugin to generate embeddings, may few more for various LLMs and for helper transformations (like pdf2text, OCR, or chunking)
            # introduce a concept of "expensive" function, if a stored generated column uses an "expensive" function, the server should avoid re-generating it whenever possible. ALTER/OPTIMIZE/etc should not regenerate it.
            # extend this to virtual indexed columns. If an index can find a value by row id (normally indexes cannot do it, but mhnsw index, for example, can), then virtual indexed columns can reuse value and avoid regenerating them just as if they were stored — this would allow to reduce storage requirements by 66%
            serg Sergei Golubchik made changes -
            Description to simplify the vector-based pipeline, we could remove the need for a user to generate vector embeddings for storing in the database.

            instead a database server should be able to do it automatically and transparently behind the scenes

            this task is about implementing a hook and the API that allows to add custom code that is invoked on INSERT (or UPDATE) and converts the data into an embedding. This implies that the original data is stored in the database too.

            A possible SQL syntax could be based on the existing {{WITH PARSER}} clause.

            One limitation of this approach — it leaves no place for chunking, as it strictly assumes one document = one embedding. Chunking can be done outside of the INSERT. With a stored procedure or a special LOAD DATA plugin (MDEV-28395)

            A limitation of specifically {{WITH PARSER}} syntax — it doesn't cache generated embeddings. To actually save embeddings in the database we could go with a simple -UDF- function plugin. Another limitation — it doesn't allow combining steps into a single data-processing pipeline, that can be easily done with functions, like
            {code:sql}
            INSERT INTO t1 (doc) VALUES (generate_embedding(pdf2text(wget('https://.....pdf'))));
            {code}

            Functions seem to be more versatile, they can even do chunking (returning a new chunk on every call — which can be wrapped into SQL WHILE loop or a table function, when we'll have them). Function based approach can be implemented in steps:
            # add a function plugin to generate embeddings, may few more for various LLMs and for helper transformations (like pdf2text, OCR, or chunking)
            # introduce a concept of "expensive" function, if a stored generated column uses an "expensive" function, the server should avoid re-generating it whenever possible. ALTER/OPTIMIZE/etc should not regenerate it.
            # extend this to virtual indexed columns. If an index can find a value by row id (normally indexes cannot do it, but mhnsw index, for example, can), then virtual indexed columns can reuse value and avoid regenerating them just as if they were stored — this would allow to reduce storage requirements by 66%
            to simplify the vector-based pipeline, we could remove the need for a user to generate vector embeddings for storing in the database.

            instead a database server should be able to do it automatically and transparently behind the scenes

            this task is about implementing a hook and the API that allows to add custom code that is invoked on INSERT (or UPDATE) and converts the data into an embedding. This implies that the original data is stored in the database too.

            A possible SQL syntax could be based on the existing {{WITH PARSER}} clause.

            One limitation of this approach — it leaves no place for chunking, as it strictly assumes one document = one embedding. Chunking can be done outside of the INSERT. With a stored procedure or a special LOAD DATA plugin (MDEV-28395)

            A limitation of specifically {{WITH PARSER}} syntax — it doesn't cache generated embeddings. To actually save embeddings in the database we could go with a simple -UDF- function plugin. Another limitation — it doesn't allow combining steps into a single data-processing pipeline, that can be easily done with functions, like
            {code:sql}
            INSERT INTO t1 (doc) VALUES (generate_embedding(pdf2text(wget('https://.....pdf'))));
            {code}

            Functions seem to be more versatile, they can even do chunking (returning a new chunk on every call — which can be wrapped into SQL WHILE loop or a table function, when we'll have them). Function based approach can be implemented in steps:
            # add a function plugin to generate embeddings, may be, few more for various LLMs and for helper transformations (like pdf2text, OCR, or chunking)
            # introduce a concept of "expensive" function, if a stored generated column uses an "expensive" function, the server should avoid re-generating it whenever possible. ALTER/OPTIMIZE/etc should not regenerate it.
            # extend this to virtual indexed columns. If an index can find a value by row id (normally indexes cannot do it, but mhnsw index, for example, can), then virtual indexed columns can reuse value and avoid regenerating them just as if they were stored — this would allow to reduce storage requirements by 66%
            serg Sergei Golubchik made changes -
            Description to simplify the vector-based pipeline, we could remove the need for a user to generate vector embeddings for storing in the database.

            instead a database server should be able to do it automatically and transparently behind the scenes

            this task is about implementing a hook and the API that allows to add custom code that is invoked on INSERT (or UPDATE) and converts the data into an embedding. This implies that the original data is stored in the database too.

            A possible SQL syntax could be based on the existing {{WITH PARSER}} clause.

            One limitation of this approach — it leaves no place for chunking, as it strictly assumes one document = one embedding. Chunking can be done outside of the INSERT. With a stored procedure or a special LOAD DATA plugin (MDEV-28395)

            A limitation of specifically {{WITH PARSER}} syntax — it doesn't cache generated embeddings. To actually save embeddings in the database we could go with a simple -UDF- function plugin. Another limitation — it doesn't allow combining steps into a single data-processing pipeline, that can be easily done with functions, like
            {code:sql}
            INSERT INTO t1 (doc) VALUES (generate_embedding(pdf2text(wget('https://.....pdf'))));
            {code}

            Functions seem to be more versatile, they can even do chunking (returning a new chunk on every call — which can be wrapped into SQL WHILE loop or a table function, when we'll have them). Function based approach can be implemented in steps:
            # add a function plugin to generate embeddings, may be, few more for various LLMs and for helper transformations (like pdf2text, OCR, or chunking)
            # introduce a concept of "expensive" function, if a stored generated column uses an "expensive" function, the server should avoid re-generating it whenever possible. ALTER/OPTIMIZE/etc should not regenerate it.
            # extend this to virtual indexed columns. If an index can find a value by row id (normally indexes cannot do it, but mhnsw index, for example, can), then virtual indexed columns can reuse value and avoid regenerating them just as if they were stored — this would allow to reduce storage requirements by 66%
            to simplify the vector-based pipeline, we could remove the need for a user to generate vector embeddings for storing in the database.

            instead a database server should be able to do it automatically and transparently behind the scenes

            this task is about implementing a hook and the API that allows to add custom code that is invoked on INSERT (or UPDATE) and converts the data into an embedding. This implies that the original data is stored in the database too.

            A possible SQL syntax could be based on the existing {{WITH PARSER}} clause.

            One limitation of this approach — it leaves no place for chunking, as it strictly assumes one document = one embedding. Chunking can be done outside of the INSERT. With a stored procedure or a special LOAD DATA plugin (MDEV-28395)

            A limitation of specifically {{WITH PARSER}} syntax — it doesn't cache generated embeddings. To actually save embeddings in the database we could go with a simple -UDF- function plugin. Another limitation — it doesn't allow combining steps into a single data-processing pipeline, that can be easily done with functions, like
            {code:sql}
            INSERT INTO t1 (doc) VALUES (generate_embedding(pdf2text(wget('https://.....pdf'))));
            {code}

            *Functions seem to be more versatile,* they can even do chunking (returning a new chunk on every call — which can be wrapped into SQL WHILE loop or a table function, when we'll have them). Function based approach can be implemented in steps:
            # add a function plugin to generate embeddings, may be, few more for various LLMs and for helper transformations (like pdf2text, OCR, or chunking)
            # introduce a concept of "expensive" function, if a stored generated column uses an "expensive" function, the server should avoid re-generating it whenever possible. ALTER/OPTIMIZE/etc should not regenerate it.
            # extend this to virtual indexed columns. If an index can find a value by row id (normally indexes cannot do it, but mhnsw index, for example, can), then virtual indexed columns can reuse value and avoid regenerating them just as if they were stored — this would allow to reduce storage requirements by 66%
            serg Sergei Golubchik made changes -
            Description to simplify the vector-based pipeline, we could remove the need for a user to generate vector embeddings for storing in the database.

            instead a database server should be able to do it automatically and transparently behind the scenes

            this task is about implementing a hook and the API that allows to add custom code that is invoked on INSERT (or UPDATE) and converts the data into an embedding. This implies that the original data is stored in the database too.

            A possible SQL syntax could be based on the existing {{WITH PARSER}} clause.

            One limitation of this approach — it leaves no place for chunking, as it strictly assumes one document = one embedding. Chunking can be done outside of the INSERT. With a stored procedure or a special LOAD DATA plugin (MDEV-28395)

            A limitation of specifically {{WITH PARSER}} syntax — it doesn't cache generated embeddings. To actually save embeddings in the database we could go with a simple -UDF- function plugin. Another limitation — it doesn't allow combining steps into a single data-processing pipeline, that can be easily done with functions, like
            {code:sql}
            INSERT INTO t1 (doc) VALUES (generate_embedding(pdf2text(wget('https://.....pdf'))));
            {code}

            *Functions seem to be more versatile,* they can even do chunking (returning a new chunk on every call — which can be wrapped into SQL WHILE loop or a table function, when we'll have them). Function based approach can be implemented in steps:
            # add a function plugin to generate embeddings, may be, few more for various LLMs and for helper transformations (like pdf2text, OCR, or chunking)
            # introduce a concept of "expensive" function, if a stored generated column uses an "expensive" function, the server should avoid re-generating it whenever possible. ALTER/OPTIMIZE/etc should not regenerate it.
            # extend this to virtual indexed columns. If an index can find a value by row id (normally indexes cannot do it, but mhnsw index, for example, can), then virtual indexed columns can reuse value and avoid regenerating them just as if they were stored — this would allow to reduce storage requirements by 66%
            to simplify the vector-based pipeline, we could remove the need for a user to generate vector embeddings for storing in the database.

            instead a database server should be able to do it automatically and transparently behind the scenes

            this task is about implementing a hook and the API that allows to add custom code that is invoked on INSERT (or UPDATE) and converts the data into an embedding. This implies that the original data is stored in the database too.

            A possible SQL syntax could be based on the existing {{WITH PARSER}} clause.

            One limitation of this approach — it leaves no place for chunking, as it strictly assumes one document = one embedding. Chunking can be done outside of the INSERT. With a stored procedure or a special LOAD DATA plugin (MDEV-28395)

            A limitation of specifically {{WITH PARSER}} syntax — it doesn't cache generated embeddings. To actually save embeddings in the database we could go with a simple -UDF- function plugin. Another limitation — it doesn't allow combining steps into a single data-processing pipeline, that can be easily done with functions, like
            {code:sql}
            INSERT INTO t1 (doc) VALUES (generate_embedding(pdf2text(wget('https://.....pdf'))));
            {code}

            *Functions seem to be more versatile,* they can even do chunking (returning a new chunk on every call — which can be wrapped into SQL WHILE loop or a table function, when we'll have them). Function based approach can be implemented in steps:
            # add a function plugin to generate embeddings, may be, few more for various LLMs and for helper transformations (like pdf2text, OCR, or chunking)
            #* how to pass secret parameters, like OpenAI key into such a function?
            # introduce a concept of "expensive" function, if a stored generated column uses an "expensive" function, the server should avoid re-generating it whenever possible. ALTER/OPTIMIZE/etc should not regenerate it.
            # extend this to virtual indexed columns. If an index can find a value by row id (normally indexes cannot do it, but mhnsw index, for example, can), then virtual indexed columns can reuse value and avoid regenerating them just as if they were stored — this would allow to reduce storage requirements by 66%
            Y-jiji Tianji Yang made changes -
            Attachment mysql_alter_table.png [ 74887 ]

            People

              Unassigned Unassigned
              serg Sergei Golubchik
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.