Databricks Feature Store: Online & Offline Features (2026)

A Feature Store is a repository for centrally managing, sharing, and reusing the features that feed ML models. In Databricks it is delivered as "Feature Engineering in Unity Catalog", giving you a feature management platform integrated with Unity Catalog governance. Feature Store topics make up roughly 10-15% of the ML Associate exam, and questions on Feature Table creation, FeatureLookup, and Point-in-Time Join show up frequently.

Problems a Feature Store Solves

In ML projects, building features from data typically consumes the majority of development time. Without a Feature Store, the following problems frequently occur and degrade both model quality and development efficiency.

Problem	Without Feature Store	With Feature Store
Duplicate feature creation	Different teams compute the same features independently	Features created once are shared and reused across the org
Training-Serving Skew	Different logic is used at training time vs. inference time	Features are fetched from the same Feature Table, guaranteeing consistency
Data leakage	Risk of future data leaking in during time-series joins	Automatically prevented by Point-in-Time Join
Feature discoverability	Hard to know which features are available	Searchable and metadata-managed via Unity Catalog
Governance	Access control and audit logs are scattered	Unity Catalog ACL and lineage tracking apply uniformly

Creating a Feature Table

A Feature Table is a Delta Table with a primary key. It is managed in Unity Catalog, with the same access control and lineage tracking that apply to regular tables.

from databricks.feature_engineering import FeatureEngineeringClient

fe = FeatureEngineeringClient()

# Create a Feature Table (seed initial data from a DataFrame)
fe.create_table(
    name="catalog.schema.customer_features",
    primary_keys=["customer_id"],
    timestamp_keys=["timestamp"],       # For Point-in-Time Lookup
    df=customer_features_df,
    description="Customer behavior features (for churn prediction)"
)

# Read the features back as a DataFrame
features_df = fe.read_table(name="catalog.schema.customer_features")

primary_keys is a required parameter that uniquely identifies each row.timestamp_keys is an optional parameter used for Point-in-Time Lookup; specify it when you are working with time-series features.

Writing Features (write_table)

Use fe.write_table() to append or update features in an existing Feature Table. The mode parameter controls the write behavior.

# Add or update features in an existing Feature Table
fe.write_table(
    name="catalog.schema.customer_features",
    df=new_features_df,
    mode="merge"  # Update rows whose primary key matches; insert new rows
)

# To overwrite all data
fe.write_table(
    name="catalog.schema.customer_features",
    df=full_features_df,
    mode="overwrite"
)

merge: rows whose primary key matches are updated (UPSERT); unmatched rows are inserted
overwrite: replaces the entire table (used for daily batch updates)

Feature Lookup

Feature Lookup is the mechanism for assembling a training dataset from Feature Tables. Provide a DataFrame containing your labels along with the Feature Table's primary key, and the join is performed automatically.

from databricks.feature_engineering import FeatureLookup

# Define lookups from multiple Feature Tables
feature_lookups = [
    FeatureLookup(
        table_name="catalog.schema.customer_features",
        feature_names=["total_purchases", "avg_session_time", "days_since_last_visit"],
        lookup_key="customer_id"
    ),
    FeatureLookup(
        table_name="catalog.schema.product_features",
        feature_names=["category", "price_tier"],
        lookup_key="product_id"
    )
]

# Build the training dataset
training_set = fe.create_training_set(
    df=label_df,          # DataFrame containing labels and primary keys
    feature_lookups=feature_lookups,
    label="churn",
    exclude_columns=["customer_id"]  # Columns to drop from model input
)

training_df = training_set.load_df()
print(training_df.columns)
# ['total_purchases', 'avg_session_time', 'days_since_last_visit',
#  'category', 'price_tier', 'churn']

This guarantees that training and inference reference the same Feature Table, preventing Training-Serving Skew (the mismatch between training-time and serving-time features).

Point-in-Time Join

Point-in-Time Join is the mechanism that prevents data leakage when working with time-series features. It joins only the features that were actually available at the label timestamp of each training sample, keeping future information out of training.

# Configure Point-in-Time Lookup
feature_lookups = [
    FeatureLookup(
        table_name="catalog.schema.customer_features",
        feature_names=["total_purchases", "avg_session_time"],
        lookup_key="customer_id",
        timestamp_lookup_key="event_timestamp"  # Timestamp column on the label side
    )
]

# Label DataFrame (each row needs a timestamp column)
# customer_id | event_timestamp       | churn
# C001        | 2026-03-15 10:00:00   | 1
training_set = fe.create_training_set(
    df=label_df,
    feature_lookups=feature_lookups,
    label="churn"
)

In the example above, for a row whose label timestamp is 2026-03-15 10:00, only the most recent features in the Feature Table from before 2026-03-15 10:00 are joined. Features computed at 03-15 10:01 or later are not used. On the ML Associate exam, questions asking what Point-in-Time Join prevents (data leakage) are frequent.

Integrating Models with the Feature Store

When you log a model with fe.log_model(), the reference to the Feature Table (lineage) is automatically recorded in MLflow. At inference time the model automatically fetches features from the Feature Table, so callers only need to pass the primary keys.

import mlflow

# Train and log a model with the Feature Store
with mlflow.start_run():
    model = train_model(training_df)

    fe.log_model(
        model=model,
        artifact_path="model",
        flavor=mlflow.sklearn,
        training_set=training_set,
        registered_model_name="catalog.schema.churn_model"
    )

# Inference: pass a DataFrame with only primary keys; features are auto-fetched from the Feature Table
predictions = fe.score_batch(
    model_uri="models:/catalog.schema.churn_model/1",
    df=inference_df  # customer_id column is sufficient
)

The difference between fe.log_model() and mlflow.sklearn.log_model() is a frequent exam topic. Only fe.log_model() records the Feature Store linkage on the model, which is what enables automatic feature retrieval at inference time.

Online Feature Serving

Real-time inference (Model Serving) requires low-latency feature retrieval. Databricks Online Feature Serving makes Feature Table data directly available to real-time inference endpoints.

Aspect	Offline Store (Delta Table)	Online Store
Use case	Batch training and analytics	Real-time inference
Latency	Seconds to minutes	Milliseconds
Access pattern	Large row scans	Point lookups by primary key
Data sync	—	Auto-synced from the offline store

from databricks.feature_engineering.online_store_spec import (
    AmazonDynamoDBSpec
)

# Configure the online store (DynamoDB example)
online_store_spec = AmazonDynamoDBSpec(
    region="ap-northeast-1",
    table_name="customer_features_online"
)

# Publish the Feature Table to the online store
fe.publish_table(
    name="catalog.schema.customer_features",
    online_store=online_store_spec,
    mode="merge"
)

Legacy Workspace vs. Unity Catalog Version

Aspect	Workspace Feature Store (legacy)	Feature Engineering in UC (current)
Scope	Limited to a single workspace	Entire Unity Catalog metastore
Cross-workspace sharing	Requires replication	Shareable natively
Access control	Workspace ACL	Unity Catalog ACL (GRANT/REVOKE)
Lineage	Limited	Auto-tracked by Unity Catalog Lineage
Client API	FeatureStoreClient	FeatureEngineeringClient
Databricks recommendation	Deprecated (migration recommended)	Recommended

Exam questions often ask which to use between FeatureStoreClient and FeatureEngineeringClient. For any new project, always choose the Unity Catalog version: FeatureEngineeringClient.

ML Associate Exam Focus Points

fe.create_table(): role and configuration of primary_keys and timestamp_keys
FeatureLookup: how to specify lookup_key and configure lookups across multiple tables
Point-in-Time Join: definition of data leakage and how timestamp_lookup_key prevents it
fe.log_model() vs mlflow.log_model(): difference in whether Feature Store linkage is recorded
write_table modes: when to use merge vs. overwrite
Training-Serving Skew: how a Feature Store solves this problem

Sample Question

Feature Store / ML Associate

問題 1

A data scientist is building a churn prediction model that uses a customer's total spend over the past 30 days (daily_total_spend) as a feature. The feature is updated by a daily batch, and the label data contains the churn event timestamp for each customer. Which approach is the most appropriate for building the training dataset?

Create the Feature Table without timestamp_keys and join using only the lookup_key in FeatureLookup
Create the Feature Table with timestamp_keys and perform a Point-in-Time Join by specifying timestamp_lookup_key in FeatureLookup
Skip FeatureLookup and join the Feature Table with the label DataFrame directly using a Spark JOIN clause
Merge the label DataFrame into the Feature Table with fe.write_table() and use it as a single table for training

正解: B

When joining daily-updated time-series features with label data that contains churn event timestamps, Point-in-Time Join is required. By setting timestamp_keys on the Feature Table and specifying the label's event timestamp column as timestamp_lookup_key in FeatureLookup, only the features that were actually available at each label row's event time are joined. A normal JOIN without timestamp_keys (A) causes data leakage because future daily_total_spend values are used. A Spark JOIN (C) loses Feature Store lineage management and automatic feature retrieval at inference. Merging labels into the Feature Table (D) is an inappropriate Feature Table design.

Frequently Asked Questions

What is the difference between a Feature Store and Feature Engineering?

Feature Engineering refers to the entire process of creating input data (features) for ML models. A Feature Store is a repository for centrally managing, sharing, and reusing those features. In Databricks this is delivered as Feature Engineering in Unity Catalog — an integrated capability that covers both creating features (Engineering) and managing them (Store).

What is the difference between the legacy Workspace Feature Store and Unity Catalog Feature Engineering?

The legacy Workspace Feature Store was scoped to a single workspace, so sharing features across workspaces required replication. With Unity Catalog Feature Engineering, Feature Tables are managed as Unity Catalog tables, giving you catalog- and schema-level access control, lineage tracking, and cross-workspace sharing natively. Databricks recommends the UC version for all new projects.

Where do Feature Store topics appear in Databricks exams?

On the ML Associate exam, roughly 10-15% of the questions are Feature Store related, focusing on Feature Table creation, FeatureLookup, and Point-in-Time Join fundamentals. The ML Professional exam covers more advanced design patterns: online/offline Feature Serving architecture, real-time feature pipelines, and strategies for preventing Training-Serving Skew.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

What Is a Feature Store? Databricks Feature Management Guide