Databricks

What is Vector Search? A Complete Guide to Databricks Vector Search

2026-03-21
更新: 2026-03-27
NicheeLab Editorial Team

Databricks Vector Search is a managed vector database integrated with Unity Catalog. It converts unstructured data such as text and images into vectors (embeddings) and provides search based on semantic similarity. It plays a central role as the retrieval component of RAG (Retrieval-Augmented Generation) pipelines, and the GenAI Engineer exam frequently tests index type selection and query design.

Vector Search Overview

Traditional keyword search (BM25, etc.) depends on lexical matching between query and document, so it cannot handle variations in wording such as "machine learning" vs "ML". Vector Search converts text into high-dimensional vectors and computes semantic closeness via cosine similarity or dot product, enabling search that handles synonyms, paraphrases, and multilingual content.

Key Features of Databricks Vector Search

  • Fully managed: Databricks automatically handles index build, scaling, and availability
  • Unity Catalog integration: Table-level access control and lineage tracking apply to Vector Search as well
  • Delta Lake integration: Provides an auto-syncing index (Delta Sync Index) sourced from a Delta Table
  • Foundation Model API integration: Embedding computation can stay entirely inside Databricks
  • Metadata filtering: Hybrid search that combines vector similarity with metadata predicates

Index Type Comparison

Vector Search offers two index types that differ in how data is managed and which use cases they fit. The GenAI Engineer exam tests selecting the right index type for a given scenario.

ComparisonDelta Sync IndexDirect Vector Access Index
Data sourceDelta Table (governed by Unity Catalog)Inserted directly via REST API
Sync modelDetects and syncs source table changes automaticallyCall upsert/delete APIs manually
Embedding computationAuto-compute mode (specify a model) or precomputed columnAccepts precomputed vectors only
Update frequencyContinuous or TriggeredOnly on API calls
Operational overheadLow (auto-sync)High (you must implement sync logic)
Best forInternal-document RAG, knowledge base searchExternal system integration, real-time vector ingestion

In Delta Sync Index's auto-compute mode, you just point at a text column and Databricks computes the embeddings and stores them in the index. In precomputed-column mode, you compute the embeddings yourself, store them in a Delta Table column, and point the index at that column.

Creating an Endpoint

A Vector Search Endpoint is the compute resource that hosts indexes. You create the endpoint first, then attach indexes to it.

from databricks.vector_search.client import VectorSearchClient

vsc = VectorSearchClient()

# Create a Vector Search endpoint
vsc.create_endpoint(
    name="my_vs_endpoint",
    endpoint_type="STANDARD"
)

# Check endpoint status
endpoint = vsc.get_endpoint("my_vs_endpoint")
print(endpoint["endpoint_status"]["state"])  # "ONLINE"

Creating an Index

Delta Sync Index (auto embedding computation)

# Create a Delta Sync Index (auto-compute embeddings)
index = vsc.create_delta_sync_index(
    endpoint_name="my_vs_endpoint",
    index_name="catalog.schema.doc_index",
    source_table_name="catalog.schema.documents",
    primary_key="doc_id",
    pipeline_type="TRIGGERED",          # "TRIGGERED" or "CONTINUOUS"
    embedding_source_column="content",  # Specify the text column
    embedding_model_endpoint_name="databricks-gte-large-en"
)

# Check index status
print(index.describe()["status"]["ready"])  # True

Delta Sync Index (precomputed embeddings)

# Create an index from a Delta Table that already has a precomputed embedding column
index = vsc.create_delta_sync_index(
    endpoint_name="my_vs_endpoint",
    index_name="catalog.schema.doc_index_precomputed",
    source_table_name="catalog.schema.documents_with_embeddings",
    primary_key="doc_id",
    pipeline_type="TRIGGERED",
    embedding_dimension=1024,
    embedding_vector_column="embedding_vector"
)

Direct Vector Access Index

# Create a Direct Vector Access Index
index = vsc.create_direct_access_index(
    endpoint_name="my_vs_endpoint",
    index_name="catalog.schema.realtime_index",
    primary_key="item_id",
    embedding_dimension=1024,
    embedding_vector_column="embedding",
    schema={
        "item_id": "string",
        "embedding": "array<float>",
        "category": "string",
        "title": "string"
    }
)

# Insert vectors directly
index.upsert([
    {"item_id": "001", "embedding": [0.1, 0.2, ...], "category": "tech", "title": "..."},
    {"item_id": "002", "embedding": [0.3, 0.1, ...], "category": "science", "title": "..."}
])

Similarity Search Queries

Run a query against the index to retrieve the top-K most similar documents. You can use a text query (auto-embedded) or a vector query (precomputed).

# Text-based similarity search (Delta Sync Index + auto embeddings)
results = index.similarity_search(
    query_text="How to manage Databricks clusters",
    columns=["doc_id", "content", "source_url"],
    num_results=5
)

# Process the results
for doc in results["result"]["data_array"]:
    print(f"Score: {doc[-1]:.4f} | ID: {doc[0]} | Content: {doc[1][:100]}...")

# Vector-based similarity search (Direct Vector Access Index)
results = index.similarity_search(
    query_vector=[0.1, 0.2, 0.3, ...],  # precomputed vector
    columns=["item_id", "title", "category"],
    num_results=10
)

Metadata Filtering

Adding metadata predicates to a vector similarity search lets you restrict results to a specific category or time range. In RAG pipelines, this is used for permission-based document filtering and category narrowing.

# Search with metadata filtering
results = index.similarity_search(
    query_text="about security policy",
    columns=["doc_id", "content", "department", "updated_at"],
    num_results=5,
    filters={"department": "engineering", "updated_at >=": "2026-01-01"}
)

# Multi-predicate filtering
results = index.similarity_search(
    query_text="data pipeline design",
    columns=["doc_id", "content", "category"],
    num_results=5,
    filters={"category IN": ["architecture", "data-engineering"]}
)

Embedding Model Comparison

ModelProviderDimensionsFoundation Model API supportNotes
GTE-large-enDatabricks1024SupportedSelectable for Delta Sync Index auto-compute mode
BGE-large-enBAAI (open source)1024Available via Model ServingSelf-hostable and cost efficient
text-embedding-3-largeOpenAI3072Available via External ModelHigh accuracy, variable dimensions
Cohere embed-v3Cohere1024Available via External ModelMultilingual, optimized for search/classification

When using Delta Sync Index's auto embedding mode, specify a Foundation Model API model (such as GTE) via embedding_model_endpoint_name. To use an external model, create an endpoint with the External Model feature, precompute embeddings, store them in a Delta Table, and choose the precomputed-column mode.

GenAI Engineer Exam Focus Areas

  • Index type selection: Choosing between Delta Sync and Direct Vector Access based on the scenario
  • Embedding modes: When to use auto-compute mode vs precomputed-column mode
  • pipeline_type: Difference between CONTINUOUS and TRIGGERED (real-time vs cost)
  • Metadata filtering: Implementation patterns for security and category narrowing
  • Unity Catalog integration: Applying ACLs to Vector Search indexes

Sample Question

Vector Search / GenAI Engineer

問題 1

A team is building a RAG chatbot over internal documents already stored in a Delta Table. The documents are updated daily by an ETL pipeline, and the team wants embedding computation to stay entirely inside the Databricks platform. Which Vector Search configuration is most appropriate?

  1. Create a Direct Vector Access Index and upsert vectors via REST API at the end of the ETL pipeline
  2. Create a Delta Sync Index in auto embedding mode and set pipeline_type to TRIGGERED
  3. Create a Delta Sync Index in precomputed-column mode and call an external embedding API manually
  4. Create a Direct Vector Access Index, compute embeddings with the Foundation Model API, and batch-insert them

正解: B

Because the source is a Delta Table updated daily, a Delta Sync Index is the right fit. Since the team wants embedding computation to stay inside Databricks, the auto embedding mode (specifying the text column via embedding_source_column and a Foundation Model API model via embedding_model_endpoint_name) is best. For daily updates, TRIGGERED is sufficient. Direct Vector Access Index (A, D) cannot auto-sync with a Delta Table and requires manual management. The precomputed-column mode (C) adds the overhead of managing an external API, which does not match the requirements.

Frequently Asked Questions

When should I use Delta Sync Index vs Direct Vector Access Index?

Delta Sync Index automatically syncs with its source Delta Table, making it ideal for internal document search and RAG pipelines where data is updated on a regular cadence. Direct Vector Access Index inserts and updates vectors directly via REST API, so choose it when you need real-time vector ingestion from external systems or when you want to use embeddings already computed outside of Databricks.

What is the relationship between a Vector Search Endpoint and an Index?

A Vector Search Endpoint is a compute resource that hosts one or more Vector Search Indexes. Attaching multiple indexes to a single endpoint lets you share compute efficiently. Endpoints are created at the workspace level, while indexes are created against tables governed by Unity Catalog. Endpoint scaling is managed automatically by Databricks.

How is Vector Search priced?

Vector Search runs on serverless compute and is billed in DBUs (Databricks Units) based on endpoint uptime. Endpoints auto-scale based on index size and query volume, so cost scales with usage. Note that the automatic sync work for a Delta Sync Index incurs additional compute cost on top of the endpoint.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Databricks

Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)

Complete guide to all 7 Databricks certifications — Data Eng...

Databricks

Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)

Every Databricks certification ranked by difficulty, with st...

Databricks

Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)

How to pass Databricks certifications efficiently. Official ...

Databricks

Databricks Data Engineer Associate: Complete Guide (2026)

Domain-by-domain breakdown of the Databricks Certified Data ...

Databricks

Databricks Data Engineer Professional: Complete Guide (2026)

Tactics for the Databricks Certified Data Engineer Professio...

Browse all Databricks articles (110)
© 2026 NicheeLab All rights reserved.