A Feature Store is a repository for centrally managing, sharing, and reusing the features that feed ML models. In Databricks it is delivered as "Feature Engineering in Unity Catalog", giving you a feature management platform integrated with Unity Catalog governance. Feature Store topics make up roughly 10-15% of the ML Associate exam, and questions on Feature Table creation, FeatureLookup, and Point-in-Time Join show up frequently.
In ML projects, building features from data typically consumes the majority of development time. Without a Feature Store, the following problems frequently occur and degrade both model quality and development efficiency.
| Problem | Without Feature Store | With Feature Store |
|---|---|---|
| Duplicate feature creation | Different teams compute the same features independently | Features created once are shared and reused across the org |
| Training-Serving Skew | Different logic is used at training time vs. inference time | Features are fetched from the same Feature Table, guaranteeing consistency |
| Data leakage | Risk of future data leaking in during time-series joins | Automatically prevented by Point-in-Time Join |
| Feature discoverability | Hard to know which features are available | Searchable and metadata-managed via Unity Catalog |
| Governance | Access control and audit logs are scattered | Unity Catalog ACL and lineage tracking apply uniformly |
A Feature Table is a Delta Table with a primary key. It is managed in Unity Catalog, with the same access control and lineage tracking that apply to regular tables.
from databricks.feature_engineering import FeatureEngineeringClient
fe = FeatureEngineeringClient()
# Create a Feature Table (seed initial data from a DataFrame)
fe.create_table(
name="catalog.schema.customer_features",
primary_keys=["customer_id"],
timestamp_keys=["timestamp"], # For Point-in-Time Lookup
df=customer_features_df,
description="Customer behavior features (for churn prediction)"
)
# Read the features back as a DataFrame
features_df = fe.read_table(name="catalog.schema.customer_features")primary_keys is a required parameter that uniquely identifies each row.timestamp_keys is an optional parameter used for Point-in-Time Lookup; specify it when you are working with time-series features.
Use fe.write_table() to append or update features in an existing Feature Table. The mode parameter controls the write behavior.
# Add or update features in an existing Feature Table
fe.write_table(
name="catalog.schema.customer_features",
df=new_features_df,
mode="merge" # Update rows whose primary key matches; insert new rows
)
# To overwrite all data
fe.write_table(
name="catalog.schema.customer_features",
df=full_features_df,
mode="overwrite"
)Feature Lookup is the mechanism for assembling a training dataset from Feature Tables. Provide a DataFrame containing your labels along with the Feature Table's primary key, and the join is performed automatically.
from databricks.feature_engineering import FeatureLookup
# Define lookups from multiple Feature Tables
feature_lookups = [
FeatureLookup(
table_name="catalog.schema.customer_features",
feature_names=["total_purchases", "avg_session_time", "days_since_last_visit"],
lookup_key="customer_id"
),
FeatureLookup(
table_name="catalog.schema.product_features",
feature_names=["category", "price_tier"],
lookup_key="product_id"
)
]
# Build the training dataset
training_set = fe.create_training_set(
df=label_df, # DataFrame containing labels and primary keys
feature_lookups=feature_lookups,
label="churn",
exclude_columns=["customer_id"] # Columns to drop from model input
)
training_df = training_set.load_df()
print(training_df.columns)
# ['total_purchases', 'avg_session_time', 'days_since_last_visit',
# 'category', 'price_tier', 'churn']This guarantees that training and inference reference the same Feature Table, preventing Training-Serving Skew (the mismatch between training-time and serving-time features).
Point-in-Time Join is the mechanism that prevents data leakage when working with time-series features. It joins only the features that were actually available at the label timestamp of each training sample, keeping future information out of training.
# Configure Point-in-Time Lookup
feature_lookups = [
FeatureLookup(
table_name="catalog.schema.customer_features",
feature_names=["total_purchases", "avg_session_time"],
lookup_key="customer_id",
timestamp_lookup_key="event_timestamp" # Timestamp column on the label side
)
]
# Label DataFrame (each row needs a timestamp column)
# customer_id | event_timestamp | churn
# C001 | 2026-03-15 10:00:00 | 1
training_set = fe.create_training_set(
df=label_df,
feature_lookups=feature_lookups,
label="churn"
)In the example above, for a row whose label timestamp is 2026-03-15 10:00, only the most recent features in the Feature Table from before 2026-03-15 10:00 are joined. Features computed at 03-15 10:01 or later are not used. On the ML Associate exam, questions asking what Point-in-Time Join prevents (data leakage) are frequent.
When you log a model with fe.log_model(), the reference to the Feature Table (lineage) is automatically recorded in MLflow. At inference time the model automatically fetches features from the Feature Table, so callers only need to pass the primary keys.
import mlflow
# Train and log a model with the Feature Store
with mlflow.start_run():
model = train_model(training_df)
fe.log_model(
model=model,
artifact_path="model",
flavor=mlflow.sklearn,
training_set=training_set,
registered_model_name="catalog.schema.churn_model"
)
# Inference: pass a DataFrame with only primary keys; features are auto-fetched from the Feature Table
predictions = fe.score_batch(
model_uri="models:/catalog.schema.churn_model/1",
df=inference_df # customer_id column is sufficient
)The difference between fe.log_model() and mlflow.sklearn.log_model() is a frequent exam topic. Only fe.log_model() records the Feature Store linkage on the model, which is what enables automatic feature retrieval at inference time.
Real-time inference (Model Serving) requires low-latency feature retrieval. Databricks Online Feature Serving makes Feature Table data directly available to real-time inference endpoints.
| Aspect | Offline Store (Delta Table) | Online Store |
|---|---|---|
| Use case | Batch training and analytics | Real-time inference |
| Latency | Seconds to minutes | Milliseconds |
| Access pattern | Large row scans | Point lookups by primary key |
| Data sync | — | Auto-synced from the offline store |
from databricks.feature_engineering.online_store_spec import (
AmazonDynamoDBSpec
)
# Configure the online store (DynamoDB example)
online_store_spec = AmazonDynamoDBSpec(
region="ap-northeast-1",
table_name="customer_features_online"
)
# Publish the Feature Table to the online store
fe.publish_table(
name="catalog.schema.customer_features",
online_store=online_store_spec,
mode="merge"
)| Aspect | Workspace Feature Store (legacy) | Feature Engineering in UC (current) |
|---|---|---|
| Scope | Limited to a single workspace | Entire Unity Catalog metastore |
| Cross-workspace sharing | Requires replication | Shareable natively |
| Access control | Workspace ACL | Unity Catalog ACL (GRANT/REVOKE) |
| Lineage | Limited | Auto-tracked by Unity Catalog Lineage |
| Client API | FeatureStoreClient | FeatureEngineeringClient |
| Databricks recommendation | Deprecated (migration recommended) | Recommended |
Exam questions often ask which to use between FeatureStoreClient and FeatureEngineeringClient. For any new project, always choose the Unity Catalog version: FeatureEngineeringClient.
Feature Store / ML Associate
問題 1
A data scientist is building a churn prediction model that uses a customer's total spend over the past 30 days (daily_total_spend) as a feature. The feature is updated by a daily batch, and the label data contains the churn event timestamp for each customer. Which approach is the most appropriate for building the training dataset?
正解: B
When joining daily-updated time-series features with label data that contains churn event timestamps, Point-in-Time Join is required. By setting timestamp_keys on the Feature Table and specifying the label's event timestamp column as timestamp_lookup_key in FeatureLookup, only the features that were actually available at each label row's event time are joined. A normal JOIN without timestamp_keys (A) causes data leakage because future daily_total_spend values are used. A Spark JOIN (C) loses Feature Store lineage management and automatic feature retrieval at inference. Merging labels into the Feature Table (D) is an inappropriate Feature Table design.
What is the difference between a Feature Store and Feature Engineering?
Feature Engineering refers to the entire process of creating input data (features) for ML models. A Feature Store is a repository for centrally managing, sharing, and reusing those features. In Databricks this is delivered as Feature Engineering in Unity Catalog — an integrated capability that covers both creating features (Engineering) and managing them (Store).
What is the difference between the legacy Workspace Feature Store and Unity Catalog Feature Engineering?
The legacy Workspace Feature Store was scoped to a single workspace, so sharing features across workspaces required replication. With Unity Catalog Feature Engineering, Feature Tables are managed as Unity Catalog tables, giving you catalog- and schema-level access control, lineage tracking, and cross-workspace sharing natively. Databricks recommends the UC version for all new projects.
Where do Feature Store topics appear in Databricks exams?
On the ML Associate exam, roughly 10-15% of the questions are Feature Store related, focusing on Feature Table creation, FeatureLookup, and Point-in-Time Join fundamentals. The ML Professional exam covers more advanced design patterns: online/offline Feature Serving architecture, real-time feature pipelines, and strategies for preventing Training-Serving Skew.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)
Complete guide to all 7 Databricks certifications — Data Eng...
Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)
Every Databricks certification ranked by difficulty, with st...
Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)
How to pass Databricks certifications efficiently. Official ...
Databricks Data Engineer Associate: Complete Guide (2026)
Domain-by-domain breakdown of the Databricks Certified Data ...
Databricks Data Engineer Professional: Complete Guide (2026)
Tactics for the Databricks Certified Data Engineer Professio...