Build a Knowledge Base

Sema4.ai Knowledge Bases allow you to create a semantic layer over your enterprise data, enabling AI agents to access and reason over unstructured content like documents, emails, and chat history. This guide will help you build your first Knowledge Base using the Sema4.ai SDK. This guide walks you through the steps to build a Knowledge Base (KB) using the Sema4.ai SDK. You'll define the KB schema, connect it to storage, insert your data, and verify that it’s ready for use with agents.

Prerequisite: Make sure you have the Sema4.ai SDK and Sema4.ai Data Server extensions installed in your VS Code or Cursor environment.

Start from the Knowledge Base template

You can scaffold a working KB setup using the "Create Action Package" in VS Code or Cursor.

Create a folder for your KB project and open it in your editor.
Use the command palette (Cmd or Ctrl + Shift + P) and select Sema4.ai: Create Action Package
Select Use workspace folder option to create the package in the current folder.
Choose the Data Access/Knowledge Base - Give agents access to knowledge base template when prompted.

This will generate a project with:

data_sources.py — define Postgres and pgvector connections
models.py — request/response models for KB interaction
data_actions.py — query and insert handlers with @query
package.yaml — for configuration of your action package
scratchpad.sql — useful for Creating and testing your KB

Create a vector storage target (e.g., pgvector)

Before you can define a Knowledge Base, you need a place to store its embedded content.

When you use the Knowledge Base template, your data_sources.py already includes placeholders for defining your storage connection:

DataSourceSpec(
        name="my_kb_storage",  # Change this to your PGVector Database name
        engine="pgvector",
        description="Data source for storing knowledge base content embeddings",
    )

You can configure this visually using the Sema4 SDK extension .

Sema4.ai Extension - Setup New Datasource

In your action package view, locate the my_kb_storage entry under Data Sources
Click the ➕ icon next to it to open the “New Data Source” dialog
You will see PGVector is preselected as the engine
Fill in your PostgreSQL connection info:
- Host, Port
- Database, User, Password
- Optional: Schema and SSL Mode

This is where your embedded knowledge will be stored after ingestion. You can reuse the same pgvector store across KBs or create a dedicated one per use case.

Define your Knowledge Base

Similar to how we configured the storage target, we now define the Knowledge Base itself. You can do this via the Sema4 SDK extension or directly in SQL.

Locate your Knowledge Base entry under Data Sources
Click the ➕ icon next to it to open the “New Knowledge Base” dialog

Select your embedding model (Currently OpenAI and Azure OpenAI are supported) - This is used to convert text into vector embeddings
- For example, openai/text-embedding-3-large is a good choice for general-purpose embeddings
Optionally, you can add a reranking model
- This is useful for improving search results by reordering based on relevance
Choose the storage target from Step 1
Map the following columns:

content_column — the main text content to be embedded
metadata_columns — additional fields to store with each record (e.g., title, topic)
id_column — a unique identifier for each record (e.g., doc_id)

Click Add to finalize the Knowledge Base definition

If you prefer SQL, you can define the Knowledge Base directly in scratchpad.sql:

CREATE KNOWLEDGE_BASE customer_support_kb
USING
  embedding_model = {
    provider: "openai",
    model_name: "text-embedding-3-large"
  },
  storage = my_pgvector.kb_embeddings,
  metadata_columns = ['title', 'topic'],
  content_columns = ['content'],
  id_column = 'doc_id';

Insert data into the KB

Once your KB is defined, insert content into it. This triggers automatic semantic indexing (embedding). You can insert the data using SQL Insert statements in scratchpad.sql or use the insert_records action in data_actions.py that is part of the template. Ensure that your data matches the schema defined in the Knowledge Base including the content and metadata columns.

 
    columns = ["id", "content"] + available_columns
    
    placeholders = (
        ["$id", "$content"] + [f"${col}" for col in available_columns]
    )
    
    sql = f"""
    INSERT INTO sema4ai.{data_source.datasource_name} ({', '.join(columns)}) 
    VALUES ({', '.join(placeholders)})
    """
    
    # Execute each record as a separate statement with parameters
    for record in req.records:
        params = {
            "id": record.id,
            "content": record.content
        }
        
        # Add metadata values to their respective columns
        if record.metadata:
            for column in available_columns:
                params[column] = record.metadata.get(column)
        
        get_connection().execute_sql(sql, params=params)

Verify the KB and inspect data

After insertion:

Use a test query to confirm embedding and retrieval
Preview metadata and top chunks

Again, you can use the scratchpad.sql file to run a test query or use the list_records action in data_actions.py to fetch and inspect the data.

SELECT id, relevance
FROM customer_support_kb
WHERE content = 'How do I reset my password?'
LIMIT 5;

🌟

Once data is inserted, it’s ready to be queried by your agent. No separate embedding step required.

Getting started with Knowledge Bases Define Queries for the Knowledge Base