Databricks

What Is a Feature Store? Databricks Feature Management Guide

2026-03-21
更新: 2026-03-27
NicheeLab Editorial Team

A Feature Store is a repository for centrally managing, sharing, and reusing the features that feed ML models. In Databricks it is delivered as "Feature Engineering in Unity Catalog", giving you a feature management platform integrated with Unity Catalog governance. Feature Store topics make up roughly 10-15% of the ML Associate exam, and questions on Feature Table creation, FeatureLookup, and Point-in-Time Join show up frequently.

Problems a Feature Store Solves

In ML projects, building features from data typically consumes the majority of development time. Without a Feature Store, the following problems frequently occur and degrade both model quality and development efficiency.

ProblemWithout Feature StoreWith Feature Store
Duplicate feature creationDifferent teams compute the same features independentlyFeatures created once are shared and reused across the org
Training-Serving SkewDifferent logic is used at training time vs. inference timeFeatures are fetched from the same Feature Table, guaranteeing consistency
Data leakageRisk of future data leaking in during time-series joinsAutomatically prevented by Point-in-Time Join
Feature discoverabilityHard to know which features are availableSearchable and metadata-managed via Unity Catalog
GovernanceAccess control and audit logs are scatteredUnity Catalog ACL and lineage tracking apply uniformly

Creating a Feature Table

A Feature Table is a Delta Table with a primary key. It is managed in Unity Catalog, with the same access control and lineage tracking that apply to regular tables.

from databricks.feature_engineering import FeatureEngineeringClient

fe = FeatureEngineeringClient()

# Create a Feature Table (seed initial data from a DataFrame)
fe.create_table(
    name="catalog.schema.customer_features",
    primary_keys=["customer_id"],
    timestamp_keys=["timestamp"],       # For Point-in-Time Lookup
    df=customer_features_df,
    description="Customer behavior features (for churn prediction)"
)

# Read the features back as a DataFrame
features_df = fe.read_table(name="catalog.schema.customer_features")

primary_keys is a required parameter that uniquely identifies each row.timestamp_keys is an optional parameter used for Point-in-Time Lookup; specify it when you are working with time-series features.

Writing Features (write_table)

Use fe.write_table() to append or update features in an existing Feature Table. The mode parameter controls the write behavior.

# Add or update features in an existing Feature Table
fe.write_table(
    name="catalog.schema.customer_features",
    df=new_features_df,
    mode="merge"  # Update rows whose primary key matches; insert new rows
)

# To overwrite all data
fe.write_table(
    name="catalog.schema.customer_features",
    df=full_features_df,
    mode="overwrite"
)
  • merge: rows whose primary key matches are updated (UPSERT); unmatched rows are inserted
  • overwrite: replaces the entire table (used for daily batch updates)

Feature Lookup

Feature Lookup is the mechanism for assembling a training dataset from Feature Tables. Provide a DataFrame containing your labels along with the Feature Table's primary key, and the join is performed automatically.

from databricks.feature_engineering import FeatureLookup

# Define lookups from multiple Feature Tables
feature_lookups = [
    FeatureLookup(
        table_name="catalog.schema.customer_features",
        feature_names=["total_purchases", "avg_session_time", "days_since_last_visit"],
        lookup_key="customer_id"
    ),
    FeatureLookup(
        table_name="catalog.schema.product_features",
        feature_names=["category", "price_tier"],
        lookup_key="product_id"
    )
]

# Build the training dataset
training_set = fe.create_training_set(
    df=label_df,          # DataFrame containing labels and primary keys
    feature_lookups=feature_lookups,
    label="churn",
    exclude_columns=["customer_id"]  # Columns to drop from model input
)

training_df = training_set.load_df()
print(training_df.columns)
# ['total_purchases', 'avg_session_time', 'days_since_last_visit',
#  'category', 'price_tier', 'churn']

This guarantees that training and inference reference the same Feature Table, preventing Training-Serving Skew (the mismatch between training-time and serving-time features).

Point-in-Time Join

Point-in-Time Join is the mechanism that prevents data leakage when working with time-series features. It joins only the features that were actually available at the label timestamp of each training sample, keeping future information out of training.

# Configure Point-in-Time Lookup
feature_lookups = [
    FeatureLookup(
        table_name="catalog.schema.customer_features",
        feature_names=["total_purchases", "avg_session_time"],
        lookup_key="customer_id",
        timestamp_lookup_key="event_timestamp"  # Timestamp column on the label side
    )
]

# Label DataFrame (each row needs a timestamp column)
# customer_id | event_timestamp       | churn
# C001        | 2026-03-15 10:00:00   | 1
training_set = fe.create_training_set(
    df=label_df,
    feature_lookups=feature_lookups,
    label="churn"
)

In the example above, for a row whose label timestamp is 2026-03-15 10:00, only the most recent features in the Feature Table from before 2026-03-15 10:00 are joined. Features computed at 03-15 10:01 or later are not used. On the ML Associate exam, questions asking what Point-in-Time Join prevents (data leakage) are frequent.

Integrating Models with the Feature Store

When you log a model with fe.log_model(), the reference to the Feature Table (lineage) is automatically recorded in MLflow. At inference time the model automatically fetches features from the Feature Table, so callers only need to pass the primary keys.

import mlflow

# Train and log a model with the Feature Store
with mlflow.start_run():
    model = train_model(training_df)

    fe.log_model(
        model=model,
        artifact_path="model",
        flavor=mlflow.sklearn,
        training_set=training_set,
        registered_model_name="catalog.schema.churn_model"
    )

# Inference: pass a DataFrame with only primary keys; features are auto-fetched from the Feature Table
predictions = fe.score_batch(
    model_uri="models:/catalog.schema.churn_model/1",
    df=inference_df  # customer_id column is sufficient
)

The difference between fe.log_model() and mlflow.sklearn.log_model() is a frequent exam topic. Only fe.log_model() records the Feature Store linkage on the model, which is what enables automatic feature retrieval at inference time.

Online Feature Serving

Real-time inference (Model Serving) requires low-latency feature retrieval. Databricks Online Feature Serving makes Feature Table data directly available to real-time inference endpoints.

AspectOffline Store (Delta Table)Online Store
Use caseBatch training and analyticsReal-time inference
LatencySeconds to minutesMilliseconds
Access patternLarge row scansPoint lookups by primary key
Data syncAuto-synced from the offline store
from databricks.feature_engineering.online_store_spec import (
    AmazonDynamoDBSpec
)

# Configure the online store (DynamoDB example)
online_store_spec = AmazonDynamoDBSpec(
    region="ap-northeast-1",
    table_name="customer_features_online"
)

# Publish the Feature Table to the online store
fe.publish_table(
    name="catalog.schema.customer_features",
    online_store=online_store_spec,
    mode="merge"
)

Legacy Workspace vs. Unity Catalog Version

AspectWorkspace Feature Store (legacy)Feature Engineering in UC (current)
ScopeLimited to a single workspaceEntire Unity Catalog metastore
Cross-workspace sharingRequires replicationShareable natively
Access controlWorkspace ACLUnity Catalog ACL (GRANT/REVOKE)
LineageLimitedAuto-tracked by Unity Catalog Lineage
Client APIFeatureStoreClientFeatureEngineeringClient
Databricks recommendationDeprecated (migration recommended)Recommended

Exam questions often ask which to use between FeatureStoreClient and FeatureEngineeringClient. For any new project, always choose the Unity Catalog version: FeatureEngineeringClient.

ML Associate Exam Focus Points

  • fe.create_table(): role and configuration of primary_keys and timestamp_keys
  • FeatureLookup: how to specify lookup_key and configure lookups across multiple tables
  • Point-in-Time Join: definition of data leakage and how timestamp_lookup_key prevents it
  • fe.log_model() vs mlflow.log_model(): difference in whether Feature Store linkage is recorded
  • write_table modes: when to use merge vs. overwrite
  • Training-Serving Skew: how a Feature Store solves this problem

Sample Question

Feature Store / ML Associate

問題 1

A data scientist is building a churn prediction model that uses a customer's total spend over the past 30 days (daily_total_spend) as a feature. The feature is updated by a daily batch, and the label data contains the churn event timestamp for each customer. Which approach is the most appropriate for building the training dataset?

  1. Create the Feature Table without timestamp_keys and join using only the lookup_key in FeatureLookup
  2. Create the Feature Table with timestamp_keys and perform a Point-in-Time Join by specifying timestamp_lookup_key in FeatureLookup
  3. Skip FeatureLookup and join the Feature Table with the label DataFrame directly using a Spark JOIN clause
  4. Merge the label DataFrame into the Feature Table with fe.write_table() and use it as a single table for training

正解: B

When joining daily-updated time-series features with label data that contains churn event timestamps, Point-in-Time Join is required. By setting timestamp_keys on the Feature Table and specifying the label's event timestamp column as timestamp_lookup_key in FeatureLookup, only the features that were actually available at each label row's event time are joined. A normal JOIN without timestamp_keys (A) causes data leakage because future daily_total_spend values are used. A Spark JOIN (C) loses Feature Store lineage management and automatic feature retrieval at inference. Merging labels into the Feature Table (D) is an inappropriate Feature Table design.

Frequently Asked Questions

What is the difference between a Feature Store and Feature Engineering?

Feature Engineering refers to the entire process of creating input data (features) for ML models. A Feature Store is a repository for centrally managing, sharing, and reusing those features. In Databricks this is delivered as Feature Engineering in Unity Catalog — an integrated capability that covers both creating features (Engineering) and managing them (Store).

What is the difference between the legacy Workspace Feature Store and Unity Catalog Feature Engineering?

The legacy Workspace Feature Store was scoped to a single workspace, so sharing features across workspaces required replication. With Unity Catalog Feature Engineering, Feature Tables are managed as Unity Catalog tables, giving you catalog- and schema-level access control, lineage tracking, and cross-workspace sharing natively. Databricks recommends the UC version for all new projects.

Where do Feature Store topics appear in Databricks exams?

On the ML Associate exam, roughly 10-15% of the questions are Feature Store related, focusing on Feature Table creation, FeatureLookup, and Point-in-Time Join fundamentals. The ML Professional exam covers more advanced design patterns: online/offline Feature Serving architecture, real-time feature pipelines, and strategies for preventing Training-Serving Skew.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Databricks

Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)

Complete guide to all 7 Databricks certifications — Data Eng...

Databricks

Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)

Every Databricks certification ranked by difficulty, with st...

Databricks

Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)

How to pass Databricks certifications efficiently. Official ...

Databricks

Databricks Data Engineer Associate: Complete Guide (2026)

Domain-by-domain breakdown of the Databricks Certified Data ...

Databricks

Databricks Data Engineer Professional: Complete Guide (2026)

Tactics for the Databricks Certified Data Engineer Professio...

Browse all Databricks articles (110)
© 2026 NicheeLab All rights reserved.