- Automated embedding generation: you can create a vectorizer for a specified table, which automatically generates embeddings for the data in that table and keeps them in sync with the source data.
- Automatic synchronization: a vectorizer creates triggers on the source table, ensuring that embeddings are automatically updated when the source data changes.
- Background processing: the process to create embeddings runs asynchrounously in the background. This minimizes the impact on regular database operations such as INSERT, UPDATE, and DELETE.
- Scalability: a vectorizer processes data in batches and can run concurrently. This enables vectorizers to handle large datasets efficiently.
-
Configurable embedding process: a vectorizer is highly configurable,
allowing you to specify:
- The embedding model and dimensions. For example, the
nomic-embed-textmodel in Ollama. - Chunking strategies for text data.
- Formatting templates for combining multiple fields.
- Indexing options for efficient similarity searches.
- Scheduling for background processing.
- The embedding model and dimensions. For example, the
- Integration with multiple AI providers: a vectorizer supports different embedding providers, initially including OpenAI, with more planned for the future.
- Efficient storage and retrieval: embeddings are stored in a separate table with appropriate indexing, optimizing for vector similarity searches.
- View creation: a view is automatically created to join the original data with its embeddings, making it easy to query and use the embedded data.
- Fine-grained access control: you can specify the roles that have access to a vectorizer and its related objects.
- Monitoring and management: monitor the vectorizer’s queue, enable/disable scheduling, and manage the vectorizer lifecycle.
- Install or upgrade the database objects necessary for vectorizer.
- Create vectorizers: automate the process of creating embeddings for table data.
- Loading configuration: define the source of the data to embed. You can load data from a column in the source table, or from a file referenced in a column of the source table.
- Parsing configuration: for documents, define the way the data is parsed after it is loaded.
- Chunking configuration: define the way text data is split into smaller, manageable pieces before being processed for embeddings.
- Formatting configuration: configure the way data from the source table is formatted before it is sent for embedding.
- Embedding configuration: specify the LLM provider, model, and the parameters to be used when generating the embeddings
- Indexing configuration: specify the way generated embeddings should be indexed for efficient similarity searches.
- Scheduling configuration: configure when and how often the vectorizer should run in order to process new or updated data.
- Processing configuration: specify the way the vectorizer should process data when generating embeddings.
- Enable and disable vectorizer schedules: temporarily pause or resume the automatic processing of embeddings, without having to delete or recreate the vectorizer configuration.
- Drop a vectorizer: remove a vectorizer that you created previously, and clean up the associated resources.
- View vectorizer status: monitoring tools in pgai that provide insights into the state and performance of vectorizers.
Install or upgrade the database objects necessary for vectorizer
You can install or upgrade the database objects necessary for vectorizer by running the following cli command:ai schema.
The version of the database objects corresponds to the version of the pgai python package you have installed. To upgrade, first upgrade the python package with pip install -U pgai and then run pgai.install(DB_URL) again.
Create vectorizers
You use theai.create_vectorizer function in pgai to set up and configure an automated system
for generating and managing embeddings for a specific table in your database.
The purpose of ai.create_vectorizer is to:
- Automate the process of creating embeddings for table data.
- Set up necessary infrastructure such as tables, views, triggers, or columns for embedding management.
- Configure the embedding generation process according to user specifications.
- Integrate with AI providers for embedding creation.
- Set up scheduling for background processing of embeddings.
Example usage
By usingai.create_vectorizer, you can quickly set up a sophisticated
embedding system tailored to your specific needs, without having to manually
create and manage all the necessary database objects and processes.
Example 1: Table destination (default)
This approach creates a separate table to store embeddings and a view that joins with the source table:- Sets up a vectorizer named ‘website_blog_vectorizer’ for the
website.blogtable. - Creates a separate table
website.blog_embeddings_storeto store embeddings. - Creates a view
website.blog_embeddingsjoining the source and embeddings. - Loads the
contentscolumn. - Uses the Ollama
nomic-embed-textmodel to create 768 dimensional embeddings. - Chunks the content into 128-character pieces with a 10-character overlap.
- Formats each chunk with a
titleand apublisheddate. - Grants necessary permissions to the roles
bobandalice.
Example 2: Column destination
Column destination place the embedding in a separate column in the source table. It can only be used when the vectorizer does not perform chunking because it requires a one-to-one relationship between the source data and the embedding. This is useful in cases where you know the source text is short (as is common if the chunking has already been done upstream in your data pipeline). The workflow is that your application inserts data into the table with a NULL in the embedding column. The vectorizer will then read the row, generate the embedding and update the row with the correct value in the embedding column.- Sets up a vectorizer named ‘product_descriptions_vectorizer’ for the
website.product_descriptionstable. - Adds a column called
description_embeddingdirectly to the source table. - Loads the
descriptioncolumn. - Doesn’t chunk the content (required for column destination).
- Uses OpenAI’s embedding model to create 768 dimensional embeddings.
- Doesn’t chunk the content (required for column destination).
- Grants necessary permissions to the role
marketing_team.
Parameters
ai.create_vectorizer takes the following parameters:
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| source | regclass | - | ✔ | The source table that embeddings are generated for. |
| name | text | Auto-generated | ✖ | A unique name for the vectorizer. If not provided, it’s auto-generated based on the destination type:- For table destination: [target_schema]_[target_table]- For column destination: [source_schema]_[source_table]_[embedding_column] Must follow snake_case pattern ^[a-z][a-z_0-9]*$ |
| destination | Destination configuration | ai.destination_table() | ✖ | Configure how the embeddings will be stored. Two options available: ai.destination_table() (default): Creates a separate table to store embeddings - ai.destination_column(): Adds an embedding column directly to the source table |
| embedding | Embedding configuration | - | ✔ | Set how to embed the data. |
| loading | Loading configuration | - | ✔ | Set the way to load the data from the source table, using functions like ai.loading_column(). |
| parsing | Parsing configuration | ai.parsing_auto() | ✖ | Set the way to parse the data, using functions like ai.parsing_auto(). |
| chunking | Chunking configuration | ai.chunking_recursive_character_text_splitter() | ✖ | Set the way to split text data, using functions like ai.chunking_character_text_splitter(). |
| indexing | Indexing configuration | ai.indexing_default() | ✖ | Specify how to index the embeddings. For example, ai.indexing_diskann() or ai.indexing_hnsw(). |
| formatting | Formatting configuration | ai.formatting_python_template() | ✖ | Define the data format before embedding, using ai.formatting_python_template(). |
| scheduling | Scheduling configuration | ai.scheduling_default() | ✖ | Set how often to run the vectorizer. For example, ai.scheduling_timescaledb(). |
| processing | Processing configuration | ai.processing_default() | ✖ | Configure the way to process the embeddings. |
| queue_schema | name | - | ✖ | Specify the schema where the work queue table is created. |
| queue_table | name | - | ✖ | Specify the name of the work queue table. |
| grant_to | [Grant To configuration][#grant-to-configuration] | ai.grant_to_default() | ✖ | Specify which users should be able to use objects created by the vectorizer. |
| enqueue_existing | bool | true | ✖ | Set to true if existing rows should be immediately queued for embedding. |
| if_not_exists | bool | false | ✖ | Set to true to avoid an error if the vectorizer already exists. |
Returns
Theint id of the vectorizer that you created. You can also reference the vectorizer by its name in management functions.
Destination configuration
You use the destination configuration functions to define how and where the embeddings will be stored. There are two options available:- ai.destination_table: Creates a separate table to store embeddings (default behavior)
- ai.destination_column: Adds an embedding column directly to the source table
ai.destination_table
You useai.destination_table to store embeddings in a separate table. This is the default behavior, where:
- A new table is created to store the embeddings
- A view is created that joins the source table with the embeddings
- Multiple chunks can be created per row (using chunking)
Example usage
Parameters
ai.destination_table takes the following parameters:
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| destination | name | - | ✖ | The base name for the view and table. The view is named <destination>, the embedding table is named <destination>_store. |
| target_schema | name | Source table schema | ✖ | The schema where the embeddings table will be created. |
| target_table | name | <source_table>_embedding_store or <destination>_store | ✖ | The name of the table where embeddings will be stored. |
| view_schema | name | Source table schema | ✖ | The schema where the view will be created. |
| view_name | name | <source_table>_embedding or <destination> | ✖ | The name of the view that joins source and embeddings tables. |
Returns
A JSON configuration object that you can use in ai.create_vectorizer.ai.destination_column
You useai.destination_column to store embeddings directly in the source table as a new column. This approach can only be used when the vectorizer does not perform chunking because it requires a one-to-one relationship between the source data and the embedding. This is useful in cases where you know the source text is short (as is common if the chunking has already been done upstream in your data pipeline).
This approach:
- Adds a vector column directly to the source table
- Does not create a separate view
- Requires chunking to be set to
ai.chunking_none()(no chunking) - Stores exactly one embedding per row
Example usage
Parameters
ai.destination_column takes the following parameters:
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| embedding_column | name | - | ✔ | The name of the column to be added to the source table for storing embeddings. |
Returns
A JSON configuration object that you can use in ai.create_vectorizer.Loading configuration
You use the loading configuration functions inpgai to define the way data is loaded from the source table.
The loading functions are:
ai.loading_column
You useai.loading_column to load the data to embed directly from a column in the source table.
Example usage
Parameters
ai.loading_column takes the following parameters:
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| column_name | text | - | ✔ | The name of the column containing the data to load. |
Returns
A JSON configuration object that you can use in ai.create_vectorizer.ai.loading_uri
You useai.loading_uri to load the data to embed from a file that is referenced in a column of the source table.
This file path is internally passed to smart_open, so it supports any protocol that smart_open supports, including:
- Local files
- Amazon S3
- Google Cloud Storage
- Azure Blob Storage
- HTTP/HTTPS
- SFTP
- and many more
Environment configuration
You just need to ensure the vectorizer worker has the correct credentials to access the file, such as in environment variables. Here is an example for AWS S3:Example usage
Parameters
ai.loading_uri takes the following parameters:
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| column_name | text | - | ✔ | The name of the column containing the file path. |
Returns
A JSON configuration object that you can use in ai.create_vectorizer.Parsing configuration
You use the parsing configuration functions inpgai to define how data is parsed after document loading. This is useful if for non-textual formats such as PDF documents.
The parsing functions are:
- ai.parsing_auto: Automatically selects the appropriate parser based on file type.
- ai.parsing_none: Converts various formats to Markdown.
- ai.parsing_docling: More powerful alternative to PyMuPDF. See Docling for supported formats.
- ai.parsing_pymupdf: See PyMuPDF for supported formats.
ai.parsing_auto
You useai.parsing_auto to automatically select an appropriate parser based on detected file types.
Documents with unrecognizable formats won’t be processed and will generate an error (in the ai.vectorizer_errors table.
The parser selection works by examining file extensions and content types:
- For PDF files, images, Office documents (DOCX, XLSX, etc.): Uses docling
- For EPUB and MOBI (e-book formats): Uses pymupdf
- For text formats (TXT, MD, etc.): No parser is used (content is read directly)
Example usage
Parameters
ai.parsing_auto takes the following parameters:
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| None | - | - | - | - |
Returns
A JSON configuration object that you can use in ai.create_vectorizer.ai.parsing_none
You useai.parsing_none to skip the parsing step. Only appropriate for textual data.
Example usage, for textual data.
Parameters
ai.parsing_none takes the following parameters:
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| None | - | - | - | - |
Returns
A JSON configuration object that you can use in ai.create_vectorizer.ai.parsing_docling
You useai.parsing_docling to parse the data provided by the loader using docling.
Docling is a more robust and thorough document parsing library that:
- Uses OCR capabilities to extract text from images
- Can parse complex documents with tables and multi-column layouts
- Supports Office formats (DOCX, XLSX, etc.)
- Preserves document structure better than other parsers
- Converts documents to markdown format
Example usage
Parameters
ai.parsing_docling takes the following parameters:
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| None | - | - | - | - |
Returns
A JSON configuration object that you can use in ai.create_vectorizer.ai.parsing_pymupdf
You useai.parsing_pymupdf to parse the data provided by the loader using pymupdf.
PyMuPDF is a faster, simpler document parser that:
- Processes PDF documents with basic structure preservation
- Supports e-book formats like EPUB and MOBI
- Is generally faster than docling for simpler documents
- Works well for documents with straightforward layouts
Example usage
Parameters
ai.parsing_pymupdf takes the following parameters:
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| None | - | - | - | - |
Returns
A JSON configuration object that you can use in ai.create_vectorizer.Chunking configuration
You use the chunking configuration functions inpgai to define the way text data is split into smaller,
manageable pieces before being processed for embeddings. This is crucial because many embedding models have input size
limitations, and chunking allows for processing of larger text documents while maintaining context.
By using chunking functions, you can fine-tune how your text data is
prepared for embedding, ensuring that the chunks are appropriately sized and
maintain necessary context for their specific use case. This is particularly
important for maintaining the quality and relevance of the generated embeddings,
especially when dealing with long-form content or documents with specific
structural elements.
The chunking functions are:
The key difference between these functions is that chunking_recursive_character_text_splitter
allows for a more sophisticated splitting strategy, potentially preserving more
semantic meaning in the chunks.
ai.chunking_character_text_splitter
You useai.chunking_character_text_splitter to:
- Split text into chunks based on a specified separator.
- Control the chunk size and the amount of overlap between chunks.
Example usage
-
Split the content into chunks of 128 characters, with 10
character overlap, using ‘\n;’ as the separator:
Parameters
ai.chunking_character_text_splitter takes the following parameters:
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| chunk_size | int | 800 | ✖ | The maximum number of characters in a chunk |
| chunk_overlap | int | 400 | ✖ | The number of characters to overlap between chunks |
| separator | text | E’\n\n’ | ✖ | The string or character used to split the text |
| is_separator_regex | bool | false | ✖ | Set to true if separator is a regular expression. |
Returns
A JSON configuration object that you can use in ai.create_vectorizer.ai.chunking_recursive_character_text_splitter
ai.chunking_recursive_character_text_splitter provides more fine-grained control over the chunking process.
You use it to recursively split text into chunks using multiple separators.
Example usage
-
Recursively split content into chunks of 256 characters, with a 20 character
overlap, first trying to split on ‘\n;’, then on spaces:
Parameters
ai.chunking_recursive_character_text_splitter takes the following parameters:
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| chunk_size | int | 800 | ✖ | The maximum number of characters per chunk |
| chunk_overlap | int | 400 | ✖ | The number of characters to overlap between chunks |
| separators | text[] | array[E’\n\n’, E’\n’, ’.’, ’?’, ’!’, ’ ’, ”] | ✖ | The string or character used to split the text |
| is_separator_regex | bool | false | ✖ | Set to true if separator is a regular expression. |
Returns
A JSON configuration object that you can use in ai.create_vectorizer.Embedding configuration
You use the embedding configuration functions to specify how embeddings are generated for your data. The embedding functions are:ai.embedding_litellm
You call theai.embedding_litellm function to use LiteLLM to generate embeddings for models from multiple providers.
The purpose of ai.embedding_litellm is to:
- Define the embedding model to use.
- Specify the dimensionality of the embeddings.
- Configure optional, provider-specific parameters.
- Set the name of the environment variable that holds the value of your API key.
Example usage
Useai.embedding_litellm to create an embedding configuration object that is passed as an argument to ai.create_vectorizer:
- Set the required API key for your provider. The API key should be set as an environment variable which is available to either the Vectorizer worker, or the Postgres process.
-
Create a vectorizer using LiteLLM to access the ‘microsoft/codebert-base’ embedding model on huggingface:
Parameters
The function takes several parameters to customize the LiteLLM embedding configuration:| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| model | text | - | ✔ | Specify the name of the embedding model to use. Refer to the LiteLLM embedding documentation for an overview of the available providers and models. |
| dimensions | int | - | ✔ | Define the number of dimensions for the embedding vectors. This should match the output dimensions of the chosen model. |
| api_key_name | text | - | ✖ | Set the name of the environment variable that contains the API key. This allows for flexible API key management without hardcoding keys in the database. |
| extra_options | jsonb | - | ✖ | Set provider-specific configuration options. |
Returns
A JSON configuration object that you can use in ai.create_vectorizer.Provider-specific configuration examples
The following subsections show how to configure the vectorizer for all supported providers.Cohere
input_type parameter is required.
By default, LiteLLM sets this to search_document. The input type can be provided
via extra_options, i.e. extra_options => '{"input_type": "search_document"}'::jsonb.
Mistral
Azure OpenAI
To set up a vectorizer with Azure OpenAI you require these values from the Azure AI Foundry console:- deployment name
- base URL
- version
- API key
https://your-resource-name.openai.azure.com/openai/deployments/your-deployment-name/embeddings?api-version=2023-05-15.
In this example, the base URL is: https://your-resource-name.openai.azure.com and the version is 2023-05-15.
Configure the vectorizer, note that the base URL and version are configured through extra_options:
Huggingface inference models
You can use Huggingface inference to obtain vector embeddings. Note that Huggingface has two categories of inference: “serverless inference”, and “inference endpoints”. Serverless inference is free, but is limited to models under 10GB in size, and the model may not be immediately available to serve requests. Inference endpoints are a paid service and provide always-on APIs for production use-cases. Note: We recommend using thewait_for_model parameter when using vectorizer
with serverless inference to force the call to block until the model has been
loaded. If you do not use wait_for_model, it’s likely that vectorization will
never succeed.
AWS Bedrock
To set up a vectorizer with AWS Bedrock, you must ensure that the vectorizer is authenticated to make API calls to the AWS Bedrock endpoint. The vectorizer worker uses boto3 under the hood, so there are multiple ways to achieve this. The simplest method is to provide theAWS_ACCESS_KEY_ID,
AWS_SECRET_ACCESS_KEY, and AWS_REGION_NAME environment variables to the
vectorizer worker. Consult the boto3 credentials documentation for more
options.
api_key_name parameter to prompt the vectorizer worker to load the api key
from the database. When you do this, you may need to pass aws_access_key_id
and aws_region_name through the extra_options parameter:
Vertex AI
To set up a vectorizer with Vertex AI, you must ensure that the vectorizer can make API calls to the Vertex AI endpoint. The vectorizer worker uses GCP’s authentication under the hood, so there are multiple ways to achieve this. The simplest method is to provide theVERTEX_PROJECT, and
VERTEX_CREDENTIALS environment variables to the vectorizer worker. These
correspond to the project id, and the path to a file containing credentials for
a service account. Consult the Authentication methods at Google for more
options.
api_key_name parameter to prompt the vectorizer worker to load the api key
from the database. When you do this, you may need to pass vertex_project and
vertex_location through the extra_options parameter.
Note: VERTEX_CREDENTIALS should contain the path to a file
containing the API key, the vectorizer worker requires to have access to this
file in order to load the credentials.
ai.embedding_openai
You call theai.embedding_openai function to use an OpenAI model to generate embeddings.
The purpose of ai.embedding_openai is to:
- Define which OpenAI embedding model to use.
- Specify the dimensionality of the embeddings.
- Configure optional parameters like the user identifier for API calls.
- Set the name of the environment variable that holds the value of your OpenAI API key.
Example usage
Useai.embedding_openai to create an embedding configuration object that is passed as an argument to ai.create_vectorizer:
- Set the value of your OpenAI API key. For example, in an environment variable or in a Docker configuration.
-
Create a vectorizer with OpenAI as the embedding provider:
Parameters
The function takes several parameters to customize the OpenAI embedding configuration:| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| model | text | - | ✔ | Specify the name of the OpenAI embedding model to use. For example, text-embedding-3-small. |
| dimensions | int | - | ✔ | Define the number of dimensions for the embedding vectors. This should match the output dimensions of the chosen model. |
| chat_user | text | - | ✖ | The identifier for the user making the API call. This can be useful for tracking API usage or for OpenAI’s monitoring purposes. |
| api_key_name | text | OPENAI_API_KEY | ✖ | Set the name of the environment variable that contains the OpenAI API key. This allows for flexible API key management without hardcoding keys in the database. On Timescale Cloud, you should set this to the name of the secret that contains the OpenAI API key. |
| base_url | text | - | ✖ | Set the base_url of the OpenAI API. Note: no default configured here to allow configuration of the vectorizer worker through OPENAI_BASE_URL env var. |
Returns
A JSON configuration object that you can use in ai.create_vectorizer.ai.embedding_ollama
You use theai.embedding_ollama function to use an Ollama model to generate embeddings.
The purpose of ai.embedding_ollama is to:
- Define which Ollama model to use.
- Specify the dimensionality of the embeddings.
- Configure how the Ollama API is accessed.
- Configure the model’s truncation behaviour, and keep alive.
- Configure optional, model-specific parameters, like the
temperature.
Example usage
This function is used to create an embedding configuration object that is passed as an argument to ai.create_vectorizer:Parameters
The function takes several parameters to customize the Ollama embedding configuration:| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| model | text | - | ✔ | Specify the name of the Ollama model to use. For example, nomic-embed-text. Note: the model must already be available (pulled) in your Ollama server. |
| dimensions | int | - | ✔ | Define the number of dimensions for the embedding vectors. This should match the output dimensions of the chosen model. |
| base_url | text | - | ✖ | Set the base_url of the Ollama API. Note: no default configured here to allow configuration of the vectorizer worker through OLLAMA_HOST env var. |
| options | jsonb | - | ✖ | Configures additional model parameters listed in the documentation for the Modelfile, such as temperature, or num_ctx. |
| keep_alive | text | - | ✖ | Controls how long the model will stay loaded in memory following the request. Note: no default configured here to allow configuration at Ollama-level. |
Returns
A JSON configuration object that you can use in ai.create_vectorizer.ai.embedding_voyageai
You use theai.embedding_voyageai function to use a Voyage AI model to generate embeddings.
The purpose of ai.embedding_voyageai is to:
- Define which Voyage AI model to use.
- Specify the dimensionality of the embeddings.
- Configure the model’s truncation behaviour, and api key name.
- Configure the input type.
Example usage
This function is used to create an embedding configuration object that is passed as an argument to ai.create_vectorizer:Parameters
The function takes several parameters to customize the Voyage AI embedding configuration:| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| model | text | - | ✔ | Specify the name of the Voyage AI model to use. |
| dimensions | int | - | ✔ | Define the number of dimensions for the embedding vectors. This should match the output dimensions of the chosen model. |
| input_type | text | ’document’ | ✖ | Type of the input text, null, ‘query’, or ‘document’. |
| api_key_name | text | VOYAGE_API_KEY | ✖ | Set the name of the environment variable that contains the Voyage AI API key. This allows for flexible API key management without hardcoding keys in the database. On Timescale Cloud, you should set this to the name of the secret that contains the Voyage AI API key. |
Returns
A JSON configuration object that you can use in ai.create_vectorizer.Formatting configuration
You use theai.formatting_python_template function in pgai to
configure the way data from the source table is formatted before it is sent
for embedding.
ai.formatting_python_template provides a flexible way to structure the input
for embedding models. This enables you to incorporate relevant metadata and additional
text. This can significantly enhance the quality and usefulness of the generated
embeddings, especially in scenarios where context from multiple fields is
important for understanding or searching the content.
The purpose of ai.formatting_python_template is to:
- Define a template for formatting the data before embedding.
- Allow the combination of multiple fields from the source table.
- Add consistent context or structure to the text being embedded.
- Customize the input for the embedding model to improve relevance and searchability.
$chunk variable contains the chunked text.
Example usage
-
Default formatting:
The default formatter uses the
$chunktemplate, resulting in outputing the chunk text as-is. -
Add context from other columns:
Add the title and publication date to each chunk, providing more context for the embedding.
-
Combine multiple fields:
Prepend author and category information to each chunk.
-
Add consistent structure:
Add start and end markers to each chunk, which could be useful for certain
types of embeddings or retrieval tasks.
Parameters
ai.formatting_python_template takes the following parameter:
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| template | string | $chunk | ✔ | A string using Python template strings with $-prefixed variables that defines how the data should be formatted. |
- The $chunk placeholder is required and represents the text chunk that will be embedded.
- Other placeholders can be used to reference columns from the source table.
- The template allows for adding static text or structuring the input in a specific way.
Returns
A JSON configuration object that you can use as an argument for ai.create_vectorizer.Indexing configuration
You use indexing configuration functions in pgai to specify the way generated embeddings should be indexed for efficient similarity searches. These functions enable you to choose and configure the indexing method that best suits your needs in terms of performance, accuracy, and resource usage. By providing these indexing options, pgai allows you to optimize your embedding storage and retrieval based on their specific use case and performance requirements. This flexibility is crucial for scaling AI-powered search and analysis capabilities within a PostgreSQL database. Key points about indexing:- The choice of indexing method depends on your dataset size, query performance requirements, and available resources.
- ai.indexing_none is better suited for small datasets, or when you want to perform index creation manually.
- ai.indexing_diskann is generally recommended for larger datasets that require an index.
-
The
min_rowsparameter enables you to delay index creation until you have enough data to justify the overhead. - These indexing methods are designed for approximate nearest neighbor search, which trades a small amount of accuracy for significant speed improvements in similarity searches.
- ai.indexing_default: when you do not want indexes created automatically.
- ai.indexing_none: when you do not want indexes created automatically.
- ai.indexing_diskann: configure indexing using the DiskANN algorithm.
- ai.indexing_hnsw: configure indexing using the Hierarchical Navigable Small World (HNSW) algorithm.
ai.indexing_default
You useai.indexing_default to use the platform-specific default value for indexing.
On Timescale Cloud, the default is ai.indexing_diskann(). On self-hosted, the default is ai.indexing_none().
A timescaledb background job is used for automatic index creation. Since timescaledb may not be installed
in a self-hosted environment, we default to ai.indexing_none().
Example usage
Parameters
This function takes no parameters.Returns
A JSON configuration object that you can use as an argument for ai.create_vectorizer.ai.indexing_none
You useai.indexing_none to specify that no special indexing should be used for the embeddings.
This is useful when you don’t need fast similarity searches or when you’re dealing with a small amount of data.
Example usage
Parameters
This function takes no parameters.Returns
A JSON configuration object that you can use as an argument for ai.create_vectorizer.ai.indexing_diskann
You useai.indexing_diskann to configure indexing using the DiskANN algorithm, which is designed for high-performance
approximate nearest neighbor search on large-scale datasets. This is suitable for very large datasets that need to be
stored on disk.
Example usage
Parameters
ai.indexing_diskann takes the following parameters:
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| min_rows | int | 100000 | ✖ | The minimum number of rows before creating the index |
| storage_layout | text | - | ✖ | Set to either memory_optimized or plain |
| num_neighbors | int | - | ✖ | Advanced DiskANN parameter. |
| search_list_size | int | - | ✖ | Advanced DiskANN parameter. |
| max_alpha | float8 | - | ✖ | Advanced DiskANN parameter. |
| num_dimensions | int | - | ✖ | Advanced DiskANN parameter. |
| num_bits_per_dimension | int | - | ✖ | Advanced DiskANN parameter. |
| create_when_queue_empty | boolean | true | ✖ | Create the index only after all of the embeddings have been generated. |
Returns
A JSON configuration object that you can use as an argument for ai.create_vectorizer.ai.indexing_hnsw
You useai.indexing_hnsw to configure indexing using the Hierarchical Navigable Small World (HNSW) algorithm,
which is known for fast and accurate approximate nearest neighbor search.
HNSW is suitable for in-memory datasets and scenarios where query speed is crucial.
Example usage
Parameters
ai.indexing_hnsw takes the following parameters:
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| min_rows | int | 100000 | ✖ | The minimum number of rows before creating the index |
| opclass | text | vector_cosine_ops | ✖ | The operator class for the index. Possible values are:vector_cosine_ops, vector_l1_ops, or vector_ip_ops |
| m | int | - | ✖ | Advanced HNSW parameters |
| ef_construction | int | - | ✖ | Advanced HNSW parameters |
| create_when_queue_empty | boolean | true | ✖ | Create the index only after all of the embeddings have been generated. |
Returns
A JSON configuration object that you can use as an argument for ai.create_vectorizer.Scheduling configuration
You use scheduling functions in pgai to configure when and how often the vectorizer should run to process new or updated data. These functions allow you to set up automated, periodic execution of the embedding generation process. These are advanced options and most users should use the default. By providing these scheduling options, pgai enables you to automate the process of keeping your embeddings up-to-date with minimal manual intervention. This is crucial for maintaining the relevance and accuracy of AI-powered search and analysis capabilities, especially in systems where data is frequently updated or added. The flexibility in scheduling also allows users to balance the freshness of embeddings against system resource usage and other operational considerations. The available functions are:- ai.scheduling_default: uses the platform-specific default scheduling configuration. On Timescale Cloud this is equivalent to
ai.scheduling_timescaledb(). On self-hosted deployments, this is equivalent toai.scheduling_none(). - ai.scheduling_none: when you want manual control over when the vectorizer runs. Use this when you’re using an external scheduling system, as is the case with self-hosted deployments.
- ai.scheduling_timescaledb: leverages TimescaleDB’s robust job scheduling system, which is designed for reliability and scalability. Use this when you’re using Timescale Cloud.
ai.scheduling_default
You useai.scheduling_default to use the platform-specific default scheduling configuration.
On Timescale Cloud, the default is ai.scheduling_timescaledb(). On self-hosted, the default is ai.scheduling_none().
A timescaledb background job is used to periodically trigger a cloud vectorizer on Timescale Cloud.
This is not available in a self-hosted environment.
Example usage
Parameters
This function takes no parameters.Returns
A JSON configuration object that you can use as an argument for ai.create_vectorizer.ai.scheduling_none
You useai.scheduling_none to
- Specify that no automatic scheduling should be set up for the vectorizer.
- Manually control when the vectorizer runs or when you’re using an external scheduling system.
Example usage
Parameters
This function takes no parameters.Returns
A JSON configuration object that you can use as an argument for ai.create_vectorizer.ai.scheduling_timescaledb
You useai.scheduling_timescaledb to:
- Configure automated scheduling using TimescaleDB’s job scheduling system.
- Allow periodic execution of the vectorizer to process new or updated data.
- Provide fine-grained control over when and how often the vectorizer runs.
Example usage
-
Basic usage (run every 5 minutes). This is the default:
-
Custom interval (run every hour):
-
Specific start time and timezone:
-
Fixed schedule:
Parameters
ai.scheduling_timescaledb takes the following parameters:
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| schedule_interval | interval | ’10m’ | ✔ | Set how frequently the vectorizer checks for new or updated data to process. |
| initial_start | timestamptz | - | ✖ | Delay the start of scheduling. This is useful for coordinating with other system processes or maintenance windows. |
| fixed_schedule | bool | - | ✖ | Set to true to use a fixed schedule such as every day at midnight. Set to false for a sliding window such as every 24 hours from the last run |
| timezone | text | - | ✖ | Set the timezone this schedule operates in. This ensures that schedules are interpreted correctly, especially important for fixed schedules or when coordinating with business hours. |
Returns
A JSON configuration object that you can use as an argument for ai.create_vectorizer.Processing configuration
You use the processing configuration functions in pgai to specify the way the vectorizer should process data when generating embeddings, such as the batch size and concurrency. These are advanced options and most users should use the default.ai.processing_default
You useai.processing_default to specify the concurrency and batch size for the vectorizer.
Example usage
Parameters
ai.processing_default takes the following parameters:
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| batch_size | int | Determined by the vectorizer | ✖ | The number of items to process in each batch. The optimal batch size depends on your data and cloud function configuration, larger batch sizes can improve efficiency but may increase memory usage. The default is 1 for vectorizers that use document loading (ai.loading_uri) and 50 otherwise. |
| concurrency | int | Determined by the vectorizer | ✖ | The number of concurrent processing tasks to run. The optimal concurrency depends on your cloud infrastructure and rate limits, higher concurrency can speed up processing but may increase costs and resource usage. |
Returns
A JSON configuration object that you can use as an argument for ai.create_vectorizer.Grant To configuration
You use the grant to configuration function in pgai to specify which users should be able to use objects created by the vectorizer.ai.grant_to
Grant permissions to a comma-separated list of users. Includes the users specified in theai.grant_to_default setting.
Example usage
Parameters
This function takes a comma-separated list of usernames to grant permissions to.Returns
An array of name values, that you can use as an argument for ai.create_vectorizer.Enable and disable vectorizer schedules
You useai.enable_vectorizer_schedule and ai.disable_vectorizer_schedule to control
the execution of scheduled vectorizer jobs. These functions
provide a way to temporarily pause or resume the automatic processing of embeddings, without
having to delete or recreate the vectorizer configuration.
These functions provide an important layer of operational control for managing
pgai vectorizers in production environments. They allow database administrators
and application developers to balance the need for up-to-date embeddings with
other system priorities and constraints, enhancing the overall flexibility and
manageability of pgai.
Key points about schedule enable and disable:
- These functions provide fine-grained control over individual vectorizer schedules without affecting other vectorizers, or the overall system configuration.
- Disabling a schedule does not delete the vectorizer or its configuration; it simply stops scheduling future executions of the job.
-
These functions are particularly useful in scenarios such as:
- System maintenance windows where you want to reduce database load.
- Temporarily pausing processing during data migrations or large bulk updates.
- Debugging or troubleshooting issues related to the vectorizer.
- Implementing manual control over when embeddings are updated.
- When a schedule is disabled, new or updated data is not automatically processed. However, the data is still queued, and will be processed when the schedule is re-enabled, or when the vectorizer is run manually.
- After re-enabling a schedule, for a vectorizer configured with ai.scheduling_timescaledb, the next run is based on the original scheduling configuration. For example, if the vectorizer was set to run every hour, it will run at the next hour mark after being enabled.
- You can reference vectorizers either by their ID or their name.
- ai.enable_vectorizer_schedule: activate, reactivate or resume a scheduled job.
- ai.disable_vectorizer_schedule: disactivate or temporarily stop a scheduled job.
ai.enable_vectorizer_schedule
You useai.enable_vectorizer_schedule to:
- Activate or reactivate the scheduled job for a specific vectorizer.
- Allow the vectorizer to resume automatic processing of new or updated data.
Example usage
To resume the automatic scheduling for a vectorizer:Parameters
ai.enable_vectorizer_schedule can be called in two ways:
- With a vectorizer name (recommended for better readability)
- With a vectorizer ID
ai.enable_vectorizer_schedule(name text):
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| name | text | - | ✔ | The name of the vectorizer whose schedule you want to enable. |
ai.enable_vectorizer_schedule(vectorizer_id int):
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| vectorizer_id | int | - | ✔ | The identifier of the vectorizer whose schedule you want to enable. |
Returns
ai.enable_vectorizer_schedule does not return a value.
ai.disable_vectorizer_schedule
You useai.disable_vectorizer_schedule to:
- Deactivate the scheduled job for a specific vectorizer.
- Temporarily stop the automatic processing of new or updated data.
Example usage
To stop the automatic scheduling for a vectorizer:Parameters
ai.disable_vectorizer_schedule can be called in two ways:
- With a vectorizer name (recommended for better readability)
- With a vectorizer ID
ai.disable_vectorizer_schedule(name text):
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| name | text | - | ✔ | The name of the vectorizer whose schedule you want to disable. |
ai.disable_vectorizer_schedule(vectorizer_id int):
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| vectorizer_id | int | - | ✔ | The identifier of the vectorizer whose schedule you want to disable. |
Returns
ai.disable_vectorizer_schedule does not return a value.
Drop a vectorizer
ai.drop_vectorizer is a management tool that you use to remove a vectorizer that you
created previously, and clean up the associated
resources. Its primary purpose is to provide a controlled way to delete a
vectorizer when it’s no longer needed, or when you want to reconfigure it from
scratch.
You use ai.drop_vectorizer to:
- Remove a specific vectorizer configuration from the system.
- Clean up associated database objects and scheduled jobs.
- Safely undo the creation of a vectorizer.
ai.drop_vectorizer performs the following on the vectorizer to drop:
- Deletes the scheduled job associated with the vectorizer if one exists.
- Drops the trigger from the source table used to queue changes.
- Drops the trigger function that backed the source table trigger.
- Drops the queue table used to manage the updates to be processed.
- Deletes the vectorizer row from the
ai.vectorizertable.
ai.drop_vectorizer does not:
- Drop the target table containing the embeddings.
- Drop the view joining the target and source tables.
drop_all which is false by default. If you
explicitly pass true, the function WILL drop the target table and view.
This design allows you to keep the generated embeddings and the convenient view
even after dropping the vectorizer. This is useful if you want to stop
automatic updates but still use the existing embeddings.
Example usage
Best practices are:- Before dropping a vectorizer, ensure that you will not need the automatic embedding updates it provides.
- After dropping a vectorizer, you may want to manually clean up the target table and view if they’re no longer needed.
- You can reference vectorizers either by their ID or their name (recommended).
-
Remove a vectorizer by name (recommended):
-
Remove a vectorizer by ID:
-
Remove a vectorizer and drop the target table and view as well:
Parameters
ai.drop_vectorizer can be called in two ways:
- With a vectorizer name (recommended for better readability)
- With a vectorizer ID
ai.drop_vectorizer(name text, drop_all bool):
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| name | text | - | ✔ | The name of the vectorizer you want to drop |
| drop_all | bool | false | ✖ | true to drop the target table and view as well |
ai.drop_vectorizer(vectorizer_id int, drop_all bool):
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| vectorizer_id | int | - | ✔ | The identifier of the vectorizer you want to drop |
| drop_all | bool | false | ✖ | true to drop the target table and view as well |
Returns
ai.drop_vectorizer does not return a value, but it performs several cleanup operations.
View vectorizer status
ai.vectorizer_status view and ai.vectorizer_queue_pending function are monitoring tools in pgai that provide insights into the state and performance of vectorizers. These monitoring tools are crucial for maintaining the health and performance of your pgai-enhanced database. They allow you to proactively manage your vectorizers, ensure timely processing of embeddings, and quickly identify and address any issues that may arise in your AI-powered data pipelines. For effective monitoring, you useai.vectorizer_status.
For example:
| id | source_table | target_table | view | pending_items |
|---|---|---|---|---|
| 1 | public.blog | public.blog_contents_embedding_store | public.blog_contents_embeddings | 1 |
pending_items column indicates the number of items still awaiting embedding creation. The pending items count helps you to:
- Identify bottlenecks in processing.
- Determine if you need to adjust scheduling or processing configurations.
- Monitor the impact of large data imports or updates on your vectorizers.
- ai.vectorizer_status: view, monitor and display information about a vectorizer.
- ai.vectorizer_queue_pending: retrieve just the queue count for a vectorizer.
ai.vectorizer_status view
You useai.vectorizer_status to:
- Get a high-level overview of all vectorizers in the system.
- Regularly monitor and check the health of the entire system.
- Display key information about each vectorizer’s configuration and current state.
- Use the
pending_itemscolumn to get a quick indication of processing backlogs.
Example usage
-
Retrieve all vectorizers that have items waiting to be processed:
-
System health monitoring:
Returns
ai.vectorizer_status returns the following:
| Column name | Description |
|---|---|
| id | The unique identifier of this vectorizer |
| source_table | The fully qualified name of the source table |
| target_table | The fully qualified name of the table storing the embeddings |
| view | The fully qualified name of the view joining source and target tables |
| pending_items | The number of items waiting to be processed by the vectorizer |
ai.vectorizer_queue_pending function
ai.vectorizer_queue_pending enables you to retrieve the number of items in a vectorizer queue
when you need to focus on a particular vectorizer or troubleshoot issues.
You use vectorizer_queue_pending to:
- Retrieve the number of pending items for a specific vectorizer.
- Allow for more granular monitoring of individual vectorizer queues.
Example usage
Return the number of pending items for a vectorizer:exact_count parameter is defaulted to false. When false, the count is limited.
An exact count is returned if the queue has 10,000 or fewer items, and returns
9223372036854775807 (the max bigint value) if there are greater than 10,000
items.
To get an exact count, regardless of queue size, set the optional parameter to
true like this:
Parameters
ai.vectorizer_queue_pending can be called in two ways:
- With a vectorizer name (recommended for better readability)
- With a vectorizer ID
ai.vectorizer_queue_pending(name text, exact_count bool):
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| name | text | - | ✔ | The name of the vectorizer you want to check |
| exact_count | bool | false | ✖ | If true, return exact count. If false, capped at 10,000 |
ai.vectorizer_queue_pending(vectorizer_id int, exact_count bool):
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| vectorizer_id | int | - | ✔ | The identifier of the vectorizer you want to check |
| exact_count | bool | false | ✖ | If true, return exact count. If false, capped at 10,000 |