Databricks ML Associate: MLflow, AutoML, Feature Store Prep (2026)

Databricks Certified Machine Learning Associate (ML Associate) validates practical skills in MLflow experiment tracking, Spark ML pipeline construction, and AutoML-based modeling. Alongside Data Engineer Associate (DEA), it is one of the most popular Databricks certifications and the standard credential for kicking off a career as an ML engineer.

Exam Overview

Item	Details
Official Name	Databricks Certified Machine Learning Associate
Questions	45 (multiple choice)
Duration	90 minutes
Passing Score	70% (32 of 45 questions)
Fee	$200 (excluding tax)
Language	English and Japanese
Prerequisites	None (6+ months of hands-on experience recommended)
Validity	2 years
Delivery	Online-proctored (take from home)

The Japanese version is available, but translation quality varies. For questions that include code snippets, we recommend toggling the English original to confirm intent. The remaining time is always shown at the top of the screen, and you can flag questions to review later.

The 4 Exam Domains and Their Weights

Domain	Weight	Key Topics
Databricks ML	29% (~13 questions)	AutoML, Feature Store, MLflow Tracking
ML Workflows	29% (~13 questions)	Experiment tracking, Model Registry, deployment
Spark ML	22% (~10 questions)	Pipeline, Transformer, Estimator
Scaling ML Models	20% (~9 questions)	Distributed training, Pandas UDF, distributed inference

Databricks ML and ML Workflows together account for 58% of the exam. Both domains are heavy on MLflow, so a solid grasp of MLflow Tracking, Model Registry, and Autologging can cover about 25 questions (55% of the exam). The remaining 20% — Scaling ML Models — requires distribution-specific knowledge (Pandas UDF, spark-tensorflow-distributor, etc.), so you should learn it on top of a solid Spark foundation.

Domain 1: Databricks ML (29%)

AutoML

Databricks AutoML is an automated machine learning feature that supports three tasks: Classification, Regression, and Forecasting. You can run it from the UI or via API. Under the hood, it automates data preprocessing, feature engineering, hyperparameter tuning, and model selection.

# Run AutoML via the API
from databricks import automl

summary = automl.classify(
    dataset=train_df,           # Spark DataFrame or Pandas DataFrame
    target_col="churn",         # Target column name
    primary_metric="f1",        # Metric to optimize
    timeout_minutes=30,         # Maximum runtime
    max_trials=20               # Maximum number of trials
)

# Retrieve the best model, Run, and notebook
print(summary.best_trial)      # Best Trial object
print(summary.best_trial.model_path)  # Path to the model artifact

What the exam tests is the meaning of the notebook AutoML generates as output. AutoML automatically produces an editable notebook with the source code for each trial, so data scientists can manually tweak preprocessing logic or hyperparameters. In other words, AutoML is designed as a "starting point" providing a baseline model, not as a "black box."

Feature Store

Databricks Feature Store is a centralized repository for managing ML features. Feature tables are registered as Delta Tables in Unity Catalog, sharing the same feature definitions between training and serving.

Feature tables are stored on Delta Lake; Time Travel lets you reproduce point-in-time features.
Training: FeatureStoreClient.create_training_set() joins the features.
Inference: FeatureStoreClient.score_batch() automatically looks up the latest features.
Publishing to an online store (Cosmos DB, DynamoDB, etc.) enables real-time inference.

MLflow Tracking

In the Databricks ML domain, the exam tests basic MLflow Tracking operations (creating an Experiment, recording a Run, saving parameters/metrics/artifacts). On Databricks, mlflow.autolog() is enabled by default, and parameters, metrics, and models are automatically logged for major frameworks like scikit-learn, XGBoost, LightGBM, PyTorch, and TensorFlow.

Domain 2: ML Workflows (29%)

Experiment Tracking Patterns

The ML Workflows domain tests end-to-end workflows centered on MLflow experiment tracking. Rather than just API calls, the exam asks "why you use a given API" — judgment is what's being tested.

import mlflow
from mlflow.models import infer_signature

mlflow.set_experiment("/Experiments/fraud_detection")

with mlflow.start_run(run_name="lgbm_baseline") as run:
    # Log parameters
    mlflow.log_param("model_type", "LightGBM")
    mlflow.log_param("num_leaves", 31)
    mlflow.log_param("learning_rate", 0.05)
    mlflow.log_param("data_version", "delta_v3")

    # Training
    model = lgb.train(params, train_data, valid_sets=[val_data])

    # Log metrics
    mlflow.log_metric("auc", 0.934)
    mlflow.log_metric("precision", 0.891)
    mlflow.log_metric("recall", 0.867)

    # Infer signature and log the model
    signature = infer_signature(X_train, model.predict(X_train))
    mlflow.lightgbm.log_model(model, "model", signature=signature)

    # Save supporting artifacts
    mlflow.log_artifact("feature_importance.png")
    mlflow.log_artifact("confusion_matrix.html")

The exam tests the difference between log_param (single string/number), log_params (dict, bulk),log_metric (single number, with optional step), and log_metrics (dict, bulk). Another frequent point: log_artifact takes a file path, while log_model takes a framework-specific model object.

Model Registry

The Model Registry handles model version and lifecycle management. The exam covers both the legacy Workspace Model Registry stage transitions and Unity Catalog Model Registry aliases.

Aspect	Legacy Workspace Registry	Unity Catalog Registry
Stage Management	None → Staging → Production → Archived	Aliases (e.g., champion / challenger; user-defined)
Scope	Single workspace	Cross-account (shared across workspaces)
Permission Model	Workspace-level ACLs	Unity Catalog 3-level permissions (catalog.schema.model)
Lineage	Limited	Automatic tracking from table → model → endpoint

# Register to the Unity Catalog Model Registry
mlflow.set_registry_uri("databricks-uc")
mlflow.register_model(
    model_uri=f"runs:/{run.info.run_id}/model",
    name="ml_prod.fraud.lgbm_model"  # 3-level name: catalog.schema.model
)

# Set an alias
from mlflow import MlflowClient
client = MlflowClient()
client.set_registered_model_alias(
    name="ml_prod.fraud.lgbm_model",
    alias="champion",
    version=5
)

Deployment Patterns

Real-time inference: create an endpoint with Model Serving and infer over REST API
Batch inference: mlflow.pyfunc.spark_udf() applied to a Spark DataFrame
Streaming inference: Structured Streaming + spark_udf() applied to real-time data

# Typical batch inference pattern
predict_udf = mlflow.pyfunc.spark_udf(
    spark,
    model_uri="models:/ml_prod.fraud.lgbm_model@champion"
)
predictions = (spark.table("silver.transactions")
    .withColumn("fraud_score", predict_udf("amount", "merchant_category", "hour_of_day"))
)

Domain 3: Spark ML (22%)

The focus is the Pipeline API in Spark MLlib (the pyspark.ml package). Distinguishing Transformer (transformation: takes data, returns data) from Estimator (estimation: takes data, returns a Model = Transformer) is the single most fundamental and most-tested point on the exam.

Concept	Role	Examples
Transformer	DataFrame → DataFrame (adds/transforms columns)	VectorAssembler, StringIndexer (fitted), Tokenizer
Estimator	DataFrame → produces a Model (Transformer)	LogisticRegression, RandomForestClassifier, StringIndexer (unfitted)
Pipeline	Sequential chain of Transformers/Estimators	Bundles preprocessing → feature transform → training into one object
CrossValidator	Hyperparameter tuning via cross-validation	Grid search using ParamGridBuilder

from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StringIndexer, StandardScaler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Stage 1: Encode categorical variables
indexer = StringIndexer(inputCol="category", outputCol="category_idx")

# Stage 2: Build the feature vector
assembler = VectorAssembler(
    inputCols=["amount", "category_idx", "hour_of_day"],
    outputCol="features"
)

# Stage 3: Scaling
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")

# Stage 4: Logistic regression
lr = LogisticRegression(featuresCol="scaled_features", labelCol="label")

# Build the Pipeline
pipeline = Pipeline(stages=[indexer, assembler, scaler, lr])

# Hyperparameter tuning with CrossValidator
param_grid = (ParamGridBuilder()
    .addGrid(lr.regParam, [0.01, 0.1, 1.0])
    .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])
    .build())

evaluator = BinaryClassificationEvaluator(
    labelCol="label", metricName="areaUnderROC"
)

cv = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=param_grid,
    evaluator=evaluator,
    numFolds=3,
    parallelism=4   # Parallelism level (tune for available cluster resources)
)

cv_model = cv.fit(train_df)
best_model = cv_model.bestModel

The parallelism parameter of CrossValidator is a popular exam target. The default is 1 (sequential), but with spare cluster resources you can shorten tuning time by increasing parallelism. Be aware, however, that each fold's training runs as a distributed Spark job — pushing parallelism too high can cause resource contention.

Domain 4: Scaling ML Models (20%)

Pandas UDF (Vectorized UDF)

A mechanism for distributing single-node Pandas processing across Spark. Apache Arrow handles data transfer, delivering 10x-100x speedups versus traditional UDFs.

import pandas as pd
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import DoubleType

@pandas_udf(DoubleType())
def predict_batch(features: pd.Series) -> pd.Series:
    """Run Pandas-based inference on each partition"""
    model = load_model()  # Load from a broadcast variable or MLflow
    return pd.Series(model.predict(features.tolist()))

# Runs in parallel across each partition of the Spark DataFrame
predictions = df.withColumn("prediction", predict_batch("features"))

Distributed Training

For deep learning on large datasets, use the distributed training frameworks Databricks provides.

spark-tensorflow-distributor: Distribute TensorFlow/Keras model training across multiple nodes. Launch with TensorflowDistributor(num_processes=8, local_mode=False).run(train_fn).
torch.distributed (PyTorch): Use TorchDistributor. TorchDistributor(num_processes=4, local_mode=False).run(train_fn)
Horovod: Use HorovodRunner for data-parallel distributed training of TensorFlow/PyTorch models.

The exam asks "which distributed framework matches which ML platform" and the difference between local_mode=True and False.local_mode=True runs multi-process on the driver node (for debugging), while False performs production distributed execution across worker nodes.

Pandas API on Spark

pyspark.pandas (formerly Koalas) lets you operate on Spark DataFrames with the Pandas API. Expect questions about scenarios where single-node Pandas code is migrated to distributed execution with minimal changes.

import pyspark.pandas as ps

# Operate via the Pandas API (internally distributed as a Spark DataFrame)
psdf = ps.read_delta("dbfs:/mnt/gold/features")
psdf["feature_ratio"] = psdf["feature_a"] / psdf["feature_b"]
result = psdf.groupby("segment").mean()

MLflow High-Frequency Patterns

Roughly 40% of the ML Associate exam touches MLflow, and memorizing the following API patterns accurately is the key to passing.

API	Arguments	Purpose
`mlflow.start_run()`	run_name, nested	Start a Run (use as a context manager)
`mlflow.log_param()`	key, value	Log a single hyperparameter
`mlflow.log_metric()`	key, value, step	Log a single metric (use step to track epochs)
`mlflow.log_artifact()`	local_path	Save a local file as an artifact
`mlflow.sklearn.log_model()`	model, artifact_path	Save a scikit-learn model (flavor-specific)
`mlflow.autolog()`	None (or a config dict)	Automatically log parameters/metrics/models for supported frameworks
`mlflow.register_model()`	model_uri, name	Register to the Model Registry
`mlflow.pyfunc.spark_udf()`	spark, model_uri	Use as a Spark UDF for batch inference

AutoML Question Patterns

Theme	What is Tested
Supported Tasks	Three tasks: Classification, Regression, and Forecasting (time series). Clustering is not supported.
How to Run	UI (launched from the Experiments screen) and API (`automl.classify()`, etc.) — two options
Outputs	MLflow Experiment, a Run for each Trial, an editable source notebook, and the best model
Data Preprocessing	Missing-value imputation, one-hot encoding, and feature selection are performed automatically
Customization	Edit the generated notebook to tune preprocessing or the model and rerun as your own Run

Overlap and Differences vs. DEA (Data Engineer Associate)

Topic	DEA	ML Associate
Delta Lake basics	Tested in depth (MERGE, CDF, constraints)	Basic level, in the feature-table context
Spark DataFrame	Centered on ETL operations (filter, join, aggregate)	Feature engineering operations for ML
Unity Catalog	Table and view permission management	Model and Feature Store permission management
MLflow	Basic concepts only (low frequency)	Tracking/Registry/Serving tested in detail
Spark ML	Not tested	Pipeline/CrossValidator tested in detail
AutoML	Not tested	How to run, outputs, and customization are all tested
DLT (Delta Live Tables)	Tested in detail	Not tested
Workflows / Jobs	Tested in detail	Lightly tested in the ML pipeline orchestration context

Candidates who already passed DEA can carry over their Delta Lake, Spark DataFrame, and Unity Catalog fundamentals. On top of that, you need MLflow (about 40%), Spark ML (about 22%), AutoML and Feature Store (about 10%), and distributed training (about 15%). DEA knowledge transfers for roughly 13-15% of the exam, so even DEA-certified candidates should budget about 2 months of additional study.

Study Roadmap (2-3 Months)

Phase 1: Build the Foundation (Weeks 1-3)

Complete the official MLflow tutorial (free course on Databricks Academy).
On Community Edition, run the full flow: create an Experiment → record a Run → register to the Model Registry.
Review Transformer/Estimator/Evaluator concepts of Spark ML Pipeline in the official documentation.
Shore up basic operations in scikit-learn, Pandas, and NumPy if you are not confident.

Phase 2: Hands-on Practice (Weeks 4-7)

Run AutoML from both UI and API and learn the structure of the generated notebook.
Implement hyperparameter tuning with CrossValidator + ParamGridBuilder.
Register a feature table to the Feature Store and build training data with create_training_set().
Build a batch inference pipeline using Pandas UDFs.
Practice the end-to-end flow: set an alias in Unity Catalog Model Registry → deploy to Model Serving.

Phase 3: Exam Prep (Weeks 8-12)

Complete the official Databricks Practice Exam at least twice and identify your weak domains.
Memorize the MLflow API table (log_param/log_metric/log_artifact/log_model).
Sort out the difference between legacy Model Registry stage transitions and Unity Catalog aliases.
Review when to use each distributed training framework (TensorflowDistributor / TorchDistributor / HorovodRunner).
Practice pacing 45 questions within 90 minutes on a full mock exam.

Try a Sample Question

Spark ML

問題 1

An ML engineer wants to build a Spark ML pipeline and perform hyperparameter tuning. Which combination correctly fills in the blanks in the code below? pipeline = Pipeline(stages=[indexer, assembler, lr]) param_grid = ParamGridBuilder().addGrid(lr.regParam, [0.01, 0.1]).build() evaluator = BinaryClassificationEvaluator(labelCol='label') cv = CrossValidator(estimator=___A___, estimatorParamMaps=___B___, evaluator=evaluator, numFolds=3) cv_model = cv.fit(train_df) best = cv_model.___C___

A: pipeline, B: param_grid, C: bestModel
A: lr, B: param_grid, C: bestModel
A: pipeline, B: param_grid, C: best_params
A: pipeline, B: [0.01, 0.1], C: bestModel

正解: A

Pass the entire Pipeline to CrossValidator's estimator. Passing only lr omits preprocessing (indexer, assembler), so training runs without the necessary data transformations and fails. estimatorParamMaps takes the parameter-grid list built by ParamGridBuilder. Retrieve the best model via the bestModel property (camelCase in Python). best_params is a scikit-learn API and is not used in Spark ML.

Frequently Asked Questions

Can I study for ML Associate and Data Engineer Associate at the same time?

About 30% of the scope (Delta Lake basics, Spark DataFrame operations, Unity Catalog permissions) overlaps, so passing DEA first and then moving on to ML Associate is the efficient route. However, the ML Associate-specific topics — MLflow Tracking, Model Registry, AutoML, and Spark ML Pipelines — must be studied separately, and they account for about 70% of the score. Focusing on one exam at a time (DEA → ML Associate) tends to produce a higher pass rate than studying both in parallel.

How much Python code appears on the ML Associate exam?

Roughly 15-20 of the 45 questions include Python code snippets. Frequent topics include the mlflow.start_run() context manager, choosing between mlflow.log_param/metric/artifact, building Spark ML Pipelines with Transformers and Estimators, and configuring CrossValidator. Code typically appears as fill-in-the-blank or spot-the-error — you don't need to actually write code, but you must memorize API argument names and return types precisely.

What is the difference between ML Professional and ML Associate?

Associate centers on individual workflows: logging experiments with MLflow, building pipelines with Spark ML, and rapidly producing baseline models with AutoML. Professional asks about team-level operational design: production ML pipeline architecture, distributed training tuning, A/B testing, model drift monitoring, and online/offline consistency in the Feature Store. Passing Associate is not a prerequisite, but Associate-level knowledge is assumed.

Related Databricks Certification Articles

Machine Learning Professional: Complete Guide

MLP — production ML system design

Generative AI Engineer Associate: Complete Guide

Next cert — ~30% scope overlap with MLA

Databricks Exam Difficulty Ranking

All 7 exams ranked with study-time estimates

Databricks Certifications Overview

Scope and passing scores at a glance

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

Databricks ML Associate: Complete Guide to MLflow & AutoML