RAG on Databricks: Vector Search + LLM (2026)

RAG (Retrieval-Augmented Generation) is an architecture that combines an LLM's generation capability with retrieval over external knowledge, suppressing hallucinations while producing up-to-date, accurate answers. On the Databricks GenAI Engineer exam, roughly 30-40% of the questions cover RAG, testing your understanding of architecture design, component selection, and evaluation methods.

Overall RAG Architecture

A RAG pipeline consists of two stages: a retrieval phase and a generation phase. The user's query is converted into an Embedding, Vector Search retrieves related documents, and the prompt is then sent to the LLM together with that context.

┌─────────────────────────────────────────────────────────────┐
│                    RAG Pipeline                             │
│                                                             │
│  User Query                                                 │
│      │                                                      │
│      ▼                                                      │
│  ┌──────────────┐    ┌──────────────────┐                   │
│  │ Embedding    │───▶│ Vector Search    │                   │
│  │ Model        │    │ (similarity)     │                   │
│  └──────────────┘    └───────┬──────────┘                   │
│                              │ Top-K chunks                 │
│                              ▼                              │
│                     ┌──────────────────┐                    │
│                     │ Prompt Template  │                    │
│                     │ (Query+Context)  │                    │
│                     └───────┬──────────┘                    │
│                             │                               │
│                             ▼                               │
│                     ┌──────────────────┐                    │
│                     │ LLM (generation) │                    │
│                     └───────┬──────────┘                    │
│                             │                               │
│                             ▼                               │
│                        Response                             │
└─────────────────────────────────────────────────────────────┘

This pipeline design lets the LLM access information it did not see at training time (internal documents, fresh data, etc.) and produce grounded answers.

Comparing Chunking Strategies

Before storing documents in Vector Search, you need to split them into appropriately sized chunks. Chunking strategy directly affects retrieval accuracy, so the right choice depends on the characteristics of the document.

Strategy	Split Criteria	Pros	Cons	When to Use
Fixed-size	Fixed token count (e.g., 512 tokens)	Simple to implement, fast to process	May cut sentences in the middle	Uniform documents (FAQs, logs)
Semantic	Semantic sentence boundaries	High semantic coherence	Additional Embedding model cost	Technical documents, papers
Recursive	Hierarchical split: paragraph → sentence → token	Preserves structure while controlling size	Requires parameter tuning	Markdown and structured HTML documents

The GenAI Engineer exam asks which chunking strategy to pick based on the type of document. For example, Recursive fits a document with heading structure (like an internal wiki), while Fixed-size suits a collection of short FAQs.

Choosing an Embedding Model

Embedding models convert text into a vector space and form the foundation of similarity search. On Databricks, you can use models served by the Foundation Model API or call external APIs.

Model	Dimensions	Japanese Support	Notes
BGE-large-en	1024	Limited	Open-source, self-hostable
Instructor	768	Limited	Task-instruction-aware Embedding
OpenAI text-embedding-3	256-3072 (variable)	Supported	High accuracy, usage-based API pricing
GTE-large (provided by Databricks)	1024	Supported	Ready to use via the Foundation Model API

If you want to stay entirely within Databricks, the GTE model on the Foundation Model API is the easiest to integrate. You can also use the External Model feature to call external APIs such as OpenAI.

Configuring Vector Search

Databricks Vector Search is a managed vector database that offers similarity search under Unity Catalog-integrated access control. The choice of index type has a major impact on RAG architecture operability.

Item	Delta Sync Index	Direct Vector Access Index
Data Source	Delta Table (auto-synced)	Direct writes via REST API
Update Mode	Auto-updates when the source table changes	Manually insert and update vectors
Embedding	Auto-computed (specify a model) or precomputed column	Precomputed vectors only
When to Use	RAG over internal documents (periodic updates)	Real-time integration with external systems

Prompt Engineering Templates

In RAG, the prompt you hand to the LLM determines answer quality. Passing the retrieved context and the user's query in a structured way helps suppress hallucinations.

# Example RAG prompt template

prompt_template = """
You are an assistant that answers based on internal documents.
Use ONLY the context below to answer the question.
If the context does not contain the answer, reply "Information not found."

## Context
{context}

## Question
{query}

## Answer
"""

# Running it on Databricks Foundation Model API
import mlflow.deployments

client = mlflow.deployments.get_deploy_client("databricks")

response = client.predict(
    endpoint="databricks-meta-llama-3-1-70b-instruct",
    inputs={
        "messages": [
            {"role": "system", "content": "Answer based on the internal documents."},
            {"role": "user", "content": prompt_template.format(
                context=retrieved_context,
                query=user_query
            )}
        ],
        "max_tokens": 1024,
        "temperature": 0.1
    }
)

Setting temperature low (0.0-0.2) makes it easier to produce answers that stay faithful to the context. The exam tests how important it is to include a constraint like "do not use information outside the context" in the prompt template.

RAG Evaluation Metrics and MLflow evaluate()

To quantitatively evaluate RAG pipeline quality, use the following three metrics. On Databricks, they are built into MLflow evaluate().

Metric	What It Measures	Judgment Criterion
Faithfulness	Whether the answer is faithful to the context	Whether each sentence of the answer is supported by the context
Answer Relevance	Whether the answer is relevant to the question	Whether the answer correctly captures the intent of the question
Context Precision	Retrieval precision	Whether the retrieved chunks contain ones relevant to the question

import mlflow

# Run the RAG evaluation pipeline
eval_data = pd.DataFrame({
    "questions": ["What are Databricks cluster policies?", ...],
    "ground_truth": ["A feature that defines constraints on cluster creation settings.", ...],
    "retrieved_context": [retrieved_chunks_list, ...],
    "generated_answers": [rag_responses_list, ...]
})

results = mlflow.evaluate(
    data=eval_data,
    model_type="question-answering",
    evaluators="default",
    extra_metrics=[
        mlflow.metrics.genai.faithfulness(),
        mlflow.metrics.genai.relevance(),
    ]
)

print(results.metrics)
# {'faithfulness/v1/mean': 0.92, 'relevance/v1/mean': 0.88, ...}

MLflow evaluate() uses the LLM-as-a-Judge pattern, having a separate LLM score the answer quality. Each metric is rated on a 1-5 scale and can be used as a threshold-based quality gate for the pipeline.

RAG vs Fine-tuning Comparison

Dimension	RAG	Fine-tuning
Knowledge Updates	Reflected immediately by updating documents	Requires retraining
Citing Sources	Can cite source documents	Embedded in the model; cannot be cited
Cost	Vector Search plus LLM call at inference time	Training cost plus inference cost
Latency	Slightly slower due to the retrieval step	Fast — a single model call
Hallucinations	Suppressed via context constraints	More likely on questions outside the training data
Example Use Cases	Internal Q&A, document search	Code generation, domain-specific style transfer

Key Topics on the GenAI Engineer Exam

Choosing a chunking strategy: picking Fixed / Semantic / Recursive based on document characteristics
Choosing an index type: criteria for Delta Sync Index vs Direct Vector Access Index
Understanding evaluation metrics: definitions and measurement methods for Faithfulness, Relevance, and Context Precision
RAG vs Fine-tuning: picking the right approach for the use case
Prompt design: building a prompt template that includes context constraints
MLflow evaluate(): evaluation methods for RAG pipeline quality

Roughly 30-40% of the GenAI Engineer exam covers RAG. It tests more than just architectural understanding — you need practical judgment about which component to choose in a given scenario.

Sample Question

RAG / GenAI Engineer

問題 1

A company is building a Q&A chatbot powered by its internal knowledge base (thousands of pages on Confluence). The documents are updated weekly and every answer must include a link to the source page. Which approach best fits these requirements?

Fine-tune the LLM on all internal documents and retrain it periodically
Build a RAG pipeline, auto-sync weekly updates via a Delta Sync Index, and pull source URLs from the retrieved chunks' metadata
Stuff the entire document set directly into the LLM's context window
Pass document summaries to the LLM as a few-shot prompt every time

正解: B

For a Q&A bot driven by weekly-updated documents, RAG is the best fit because data changes are reflected immediately. With a Delta Sync Index, source-table changes auto-sync into Vector Search, and chunk metadata (e.g., source URLs) is preserved so you can surface source links. Fine-tuning (A) makes citing sources hard and retraining is costly. Stuffing all documents directly (C) exceeds the context window. Few-shot summaries (D) cannot guarantee coverage or accuracy.

Frequently Asked Questions

When should I use RAG vs Fine-tuning?

RAG is the right choice when you need access to fresh information or external knowledge such as internal documents. Fine-tuning fits when you want to permanently change the model's output format, tone, or domain vocabulary. The Databricks GenAI Engineer exam frequently tests this decision criterion. Hybrid setups that combine both approaches are also effective in production.

What components are required to build RAG on Databricks?

At minimum you need four components: (1) a pipeline that splits documents into chunks, (2) an Embedding model (Foundation Model API or external API), (3) a Vector Search Index (Delta Sync Index or Direct Vector Access Index), and (4) an LLM (Foundation Model API or External Model). It is also recommended to add a quality evaluation pipeline using MLflow evaluate().

What evaluation metrics are used for RAG?

There are three main metrics: Faithfulness (whether the generated answer is faithful to the context), Answer Relevance (whether the answer properly addresses the user's question), and Context Precision (whether the retrieved context contains chunks relevant to the question). On Databricks, these metrics are built into MLflow evaluate() and can be automatically scored using the LLM-as-a-Judge pattern.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

What is the RAG Pattern? Implementing Retrieval-Augmented Generation on Databricks