Databricks

Databricks ML Associate: Complete Guide to MLflow & AutoML

2026-03-26
更新: 2026-03-27
NicheeLab Editorial Team

Databricks Certified Machine Learning Associate (ML Associate) validates practical skills in MLflow experiment tracking, Spark ML pipeline construction, and AutoML-based modeling. Alongside Data Engineer Associate (DEA), it is one of the most popular Databricks certifications and the standard credential for kicking off a career as an ML engineer.

Exam Overview

ItemDetails
Official NameDatabricks Certified Machine Learning Associate
Questions45 (multiple choice)
Duration90 minutes
Passing Score70% (32 of 45 questions)
Fee$200 (excluding tax)
LanguageEnglish and Japanese
PrerequisitesNone (6+ months of hands-on experience recommended)
Validity2 years
DeliveryOnline-proctored (take from home)

The Japanese version is available, but translation quality varies. For questions that include code snippets, we recommend toggling the English original to confirm intent. The remaining time is always shown at the top of the screen, and you can flag questions to review later.

The 4 Exam Domains and Their Weights

DomainWeightKey Topics
Databricks ML29% (~13 questions)AutoML, Feature Store, MLflow Tracking
ML Workflows29% (~13 questions)Experiment tracking, Model Registry, deployment
Spark ML22% (~10 questions)Pipeline, Transformer, Estimator
Scaling ML Models20% (~9 questions)Distributed training, Pandas UDF, distributed inference

Databricks ML and ML Workflows together account for 58% of the exam. Both domains are heavy on MLflow, so a solid grasp of MLflow Tracking, Model Registry, and Autologging can cover about 25 questions (55% of the exam). The remaining 20% — Scaling ML Models — requires distribution-specific knowledge (Pandas UDF, spark-tensorflow-distributor, etc.), so you should learn it on top of a solid Spark foundation.

Domain 1: Databricks ML (29%)

AutoML

Databricks AutoML is an automated machine learning feature that supports three tasks: Classification, Regression, and Forecasting. You can run it from the UI or via API. Under the hood, it automates data preprocessing, feature engineering, hyperparameter tuning, and model selection.

# Run AutoML via the API
from databricks import automl

summary = automl.classify(
    dataset=train_df,           # Spark DataFrame or Pandas DataFrame
    target_col="churn",         # Target column name
    primary_metric="f1",        # Metric to optimize
    timeout_minutes=30,         # Maximum runtime
    max_trials=20               # Maximum number of trials
)

# Retrieve the best model, Run, and notebook
print(summary.best_trial)      # Best Trial object
print(summary.best_trial.model_path)  # Path to the model artifact

What the exam tests is the meaning of the notebook AutoML generates as output. AutoML automatically produces an editable notebook with the source code for each trial, so data scientists can manually tweak preprocessing logic or hyperparameters. In other words, AutoML is designed as a "starting point" providing a baseline model, not as a "black box."

Feature Store

Databricks Feature Store is a centralized repository for managing ML features. Feature tables are registered as Delta Tables in Unity Catalog, sharing the same feature definitions between training and serving.

  • Feature tables are stored on Delta Lake; Time Travel lets you reproduce point-in-time features.
  • Training: FeatureStoreClient.create_training_set() joins the features.
  • Inference: FeatureStoreClient.score_batch() automatically looks up the latest features.
  • Publishing to an online store (Cosmos DB, DynamoDB, etc.) enables real-time inference.

MLflow Tracking

In the Databricks ML domain, the exam tests basic MLflow Tracking operations (creating an Experiment, recording a Run, saving parameters/metrics/artifacts). On Databricks, mlflow.autolog() is enabled by default, and parameters, metrics, and models are automatically logged for major frameworks like scikit-learn, XGBoost, LightGBM, PyTorch, and TensorFlow.

Domain 2: ML Workflows (29%)

Experiment Tracking Patterns

The ML Workflows domain tests end-to-end workflows centered on MLflow experiment tracking. Rather than just API calls, the exam asks "why you use a given API" — judgment is what's being tested.

import mlflow
from mlflow.models import infer_signature

mlflow.set_experiment("/Experiments/fraud_detection")

with mlflow.start_run(run_name="lgbm_baseline") as run:
    # Log parameters
    mlflow.log_param("model_type", "LightGBM")
    mlflow.log_param("num_leaves", 31)
    mlflow.log_param("learning_rate", 0.05)
    mlflow.log_param("data_version", "delta_v3")

    # Training
    model = lgb.train(params, train_data, valid_sets=[val_data])

    # Log metrics
    mlflow.log_metric("auc", 0.934)
    mlflow.log_metric("precision", 0.891)
    mlflow.log_metric("recall", 0.867)

    # Infer signature and log the model
    signature = infer_signature(X_train, model.predict(X_train))
    mlflow.lightgbm.log_model(model, "model", signature=signature)

    # Save supporting artifacts
    mlflow.log_artifact("feature_importance.png")
    mlflow.log_artifact("confusion_matrix.html")

The exam tests the difference between log_param (single string/number), log_params (dict, bulk),log_metric (single number, with optional step), and log_metrics (dict, bulk). Another frequent point: log_artifact takes a file path, while log_model takes a framework-specific model object.

Model Registry

The Model Registry handles model version and lifecycle management. The exam covers both the legacy Workspace Model Registry stage transitions and Unity Catalog Model Registry aliases.

AspectLegacy Workspace RegistryUnity Catalog Registry
Stage ManagementNone → Staging → Production → ArchivedAliases (e.g., champion / challenger; user-defined)
ScopeSingle workspaceCross-account (shared across workspaces)
Permission ModelWorkspace-level ACLsUnity Catalog 3-level permissions (catalog.schema.model)
LineageLimitedAutomatic tracking from table → model → endpoint
# Register to the Unity Catalog Model Registry
mlflow.set_registry_uri("databricks-uc")
mlflow.register_model(
    model_uri=f"runs:/{run.info.run_id}/model",
    name="ml_prod.fraud.lgbm_model"  # 3-level name: catalog.schema.model
)

# Set an alias
from mlflow import MlflowClient
client = MlflowClient()
client.set_registered_model_alias(
    name="ml_prod.fraud.lgbm_model",
    alias="champion",
    version=5
)

Deployment Patterns

  • Real-time inference: create an endpoint with Model Serving and infer over REST API
  • Batch inference: mlflow.pyfunc.spark_udf() applied to a Spark DataFrame
  • Streaming inference: Structured Streaming + spark_udf() applied to real-time data
# Typical batch inference pattern
predict_udf = mlflow.pyfunc.spark_udf(
    spark,
    model_uri="models:/ml_prod.fraud.lgbm_model@champion"
)
predictions = (spark.table("silver.transactions")
    .withColumn("fraud_score", predict_udf("amount", "merchant_category", "hour_of_day"))
)

Domain 3: Spark ML (22%)

The focus is the Pipeline API in Spark MLlib (the pyspark.ml package). Distinguishing Transformer (transformation: takes data, returns data) from Estimator (estimation: takes data, returns a Model = Transformer) is the single most fundamental and most-tested point on the exam.

ConceptRoleExamples
TransformerDataFrame → DataFrame (adds/transforms columns)VectorAssembler, StringIndexer (fitted), Tokenizer
EstimatorDataFrame → produces a Model (Transformer)LogisticRegression, RandomForestClassifier, StringIndexer (unfitted)
PipelineSequential chain of Transformers/EstimatorsBundles preprocessing → feature transform → training into one object
CrossValidatorHyperparameter tuning via cross-validationGrid search using ParamGridBuilder
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StringIndexer, StandardScaler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Stage 1: Encode categorical variables
indexer = StringIndexer(inputCol="category", outputCol="category_idx")

# Stage 2: Build the feature vector
assembler = VectorAssembler(
    inputCols=["amount", "category_idx", "hour_of_day"],
    outputCol="features"
)

# Stage 3: Scaling
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")

# Stage 4: Logistic regression
lr = LogisticRegression(featuresCol="scaled_features", labelCol="label")

# Build the Pipeline
pipeline = Pipeline(stages=[indexer, assembler, scaler, lr])

# Hyperparameter tuning with CrossValidator
param_grid = (ParamGridBuilder()
    .addGrid(lr.regParam, [0.01, 0.1, 1.0])
    .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])
    .build())

evaluator = BinaryClassificationEvaluator(
    labelCol="label", metricName="areaUnderROC"
)

cv = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=param_grid,
    evaluator=evaluator,
    numFolds=3,
    parallelism=4   # Parallelism level (tune for available cluster resources)
)

cv_model = cv.fit(train_df)
best_model = cv_model.bestModel

The parallelism parameter of CrossValidator is a popular exam target. The default is 1 (sequential), but with spare cluster resources you can shorten tuning time by increasing parallelism. Be aware, however, that each fold's training runs as a distributed Spark job — pushing parallelism too high can cause resource contention.

Domain 4: Scaling ML Models (20%)

Pandas UDF (Vectorized UDF)

A mechanism for distributing single-node Pandas processing across Spark. Apache Arrow handles data transfer, delivering 10x-100x speedups versus traditional UDFs.

import pandas as pd
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import DoubleType

@pandas_udf(DoubleType())
def predict_batch(features: pd.Series) -> pd.Series:
    """Run Pandas-based inference on each partition"""
    model = load_model()  # Load from a broadcast variable or MLflow
    return pd.Series(model.predict(features.tolist()))

# Runs in parallel across each partition of the Spark DataFrame
predictions = df.withColumn("prediction", predict_batch("features"))

Distributed Training

For deep learning on large datasets, use the distributed training frameworks Databricks provides.

  • spark-tensorflow-distributor: Distribute TensorFlow/Keras model training across multiple nodes. Launch with TensorflowDistributor(num_processes=8, local_mode=False).run(train_fn).
  • torch.distributed (PyTorch): Use TorchDistributor. TorchDistributor(num_processes=4, local_mode=False).run(train_fn)
  • Horovod: Use HorovodRunner for data-parallel distributed training of TensorFlow/PyTorch models.

The exam asks "which distributed framework matches which ML platform" and the difference between local_mode=True and False.local_mode=True runs multi-process on the driver node (for debugging), while False performs production distributed execution across worker nodes.

Pandas API on Spark

pyspark.pandas (formerly Koalas) lets you operate on Spark DataFrames with the Pandas API. Expect questions about scenarios where single-node Pandas code is migrated to distributed execution with minimal changes.

import pyspark.pandas as ps

# Operate via the Pandas API (internally distributed as a Spark DataFrame)
psdf = ps.read_delta("dbfs:/mnt/gold/features")
psdf["feature_ratio"] = psdf["feature_a"] / psdf["feature_b"]
result = psdf.groupby("segment").mean()

MLflow High-Frequency Patterns

Roughly 40% of the ML Associate exam touches MLflow, and memorizing the following API patterns accurately is the key to passing.

APIArgumentsPurpose
mlflow.start_run()run_name, nestedStart a Run (use as a context manager)
mlflow.log_param()key, valueLog a single hyperparameter
mlflow.log_metric()key, value, stepLog a single metric (use step to track epochs)
mlflow.log_artifact()local_pathSave a local file as an artifact
mlflow.sklearn.log_model()model, artifact_pathSave a scikit-learn model (flavor-specific)
mlflow.autolog()None (or a config dict)Automatically log parameters/metrics/models for supported frameworks
mlflow.register_model()model_uri, nameRegister to the Model Registry
mlflow.pyfunc.spark_udf()spark, model_uriUse as a Spark UDF for batch inference

AutoML Question Patterns

ThemeWhat is Tested
Supported TasksThree tasks: Classification, Regression, and Forecasting (time series). Clustering is not supported.
How to RunUI (launched from the Experiments screen) and API (automl.classify(), etc.) — two options
OutputsMLflow Experiment, a Run for each Trial, an editable source notebook, and the best model
Data PreprocessingMissing-value imputation, one-hot encoding, and feature selection are performed automatically
CustomizationEdit the generated notebook to tune preprocessing or the model and rerun as your own Run

Overlap and Differences vs. DEA (Data Engineer Associate)

TopicDEAML Associate
Delta Lake basicsTested in depth (MERGE, CDF, constraints)Basic level, in the feature-table context
Spark DataFrameCentered on ETL operations (filter, join, aggregate)Feature engineering operations for ML
Unity CatalogTable and view permission managementModel and Feature Store permission management
MLflowBasic concepts only (low frequency)Tracking/Registry/Serving tested in detail
Spark MLNot testedPipeline/CrossValidator tested in detail
AutoMLNot testedHow to run, outputs, and customization are all tested
DLT (Delta Live Tables)Tested in detailNot tested
Workflows / JobsTested in detailLightly tested in the ML pipeline orchestration context

Candidates who already passed DEA can carry over their Delta Lake, Spark DataFrame, and Unity Catalog fundamentals. On top of that, you need MLflow (about 40%), Spark ML (about 22%), AutoML and Feature Store (about 10%), and distributed training (about 15%). DEA knowledge transfers for roughly 13-15% of the exam, so even DEA-certified candidates should budget about 2 months of additional study.

Study Roadmap (2-3 Months)

Phase 1: Build the Foundation (Weeks 1-3)

  • Complete the official MLflow tutorial (free course on Databricks Academy).
  • On Community Edition, run the full flow: create an Experiment → record a Run → register to the Model Registry.
  • Review Transformer/Estimator/Evaluator concepts of Spark ML Pipeline in the official documentation.
  • Shore up basic operations in scikit-learn, Pandas, and NumPy if you are not confident.

Phase 2: Hands-on Practice (Weeks 4-7)

  • Run AutoML from both UI and API and learn the structure of the generated notebook.
  • Implement hyperparameter tuning with CrossValidator + ParamGridBuilder.
  • Register a feature table to the Feature Store and build training data with create_training_set().
  • Build a batch inference pipeline using Pandas UDFs.
  • Practice the end-to-end flow: set an alias in Unity Catalog Model Registry → deploy to Model Serving.

Phase 3: Exam Prep (Weeks 8-12)

  • Complete the official Databricks Practice Exam at least twice and identify your weak domains.
  • Memorize the MLflow API table (log_param/log_metric/log_artifact/log_model).
  • Sort out the difference between legacy Model Registry stage transitions and Unity Catalog aliases.
  • Review when to use each distributed training framework (TensorflowDistributor / TorchDistributor / HorovodRunner).
  • Practice pacing 45 questions within 90 minutes on a full mock exam.

Try a Sample Question

Spark ML

問題 1

An ML engineer wants to build a Spark ML pipeline and perform hyperparameter tuning. Which combination correctly fills in the blanks in the code below? pipeline = Pipeline(stages=[indexer, assembler, lr]) param_grid = ParamGridBuilder().addGrid(lr.regParam, [0.01, 0.1]).build() evaluator = BinaryClassificationEvaluator(labelCol='label') cv = CrossValidator(estimator=___A___, estimatorParamMaps=___B___, evaluator=evaluator, numFolds=3) cv_model = cv.fit(train_df) best = cv_model.___C___

  1. A: pipeline, B: param_grid, C: bestModel
  2. A: lr, B: param_grid, C: bestModel
  3. A: pipeline, B: param_grid, C: best_params
  4. A: pipeline, B: [0.01, 0.1], C: bestModel

正解: A

Pass the entire Pipeline to CrossValidator's estimator. Passing only lr omits preprocessing (indexer, assembler), so training runs without the necessary data transformations and fails. estimatorParamMaps takes the parameter-grid list built by ParamGridBuilder. Retrieve the best model via the bestModel property (camelCase in Python). best_params is a scikit-learn API and is not used in Spark ML.

Frequently Asked Questions

Can I study for ML Associate and Data Engineer Associate at the same time?

About 30% of the scope (Delta Lake basics, Spark DataFrame operations, Unity Catalog permissions) overlaps, so passing DEA first and then moving on to ML Associate is the efficient route. However, the ML Associate-specific topics — MLflow Tracking, Model Registry, AutoML, and Spark ML Pipelines — must be studied separately, and they account for about 70% of the score. Focusing on one exam at a time (DEA → ML Associate) tends to produce a higher pass rate than studying both in parallel.

How much Python code appears on the ML Associate exam?

Roughly 15-20 of the 45 questions include Python code snippets. Frequent topics include the mlflow.start_run() context manager, choosing between mlflow.log_param/metric/artifact, building Spark ML Pipelines with Transformers and Estimators, and configuring CrossValidator. Code typically appears as fill-in-the-blank or spot-the-error — you don't need to actually write code, but you must memorize API argument names and return types precisely.

What is the difference between ML Professional and ML Associate?

Associate centers on individual workflows: logging experiments with MLflow, building pipelines with Spark ML, and rapidly producing baseline models with AutoML. Professional asks about team-level operational design: production ML pipeline architecture, distributed training tuning, A/B testing, model drift monitoring, and online/offline consistency in the Feature Store. Passing Associate is not a prerequisite, but Associate-level knowledge is assumed.

Related Databricks Certification Articles

Machine Learning Professional: Complete Guide

MLP — production ML system design

Generative AI Engineer Associate: Complete Guide

Next cert — ~30% scope overlap with MLA

Databricks Exam Difficulty Ranking

All 7 exams ranked with study-time estimates

Databricks Certifications Overview

Scope and passing scores at a glance

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Databricks

Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)

Complete guide to all 7 Databricks certifications — Data Eng...

Databricks

Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)

Every Databricks certification ranked by difficulty, with st...

Databricks

Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)

How to pass Databricks certifications efficiently. Official ...

Databricks

Databricks Data Engineer Associate: Complete Guide (2026)

Domain-by-domain breakdown of the Databricks Certified Data ...

Databricks

Databricks Data Engineer Professional: Complete Guide (2026)

Tactics for the Databricks Certified Data Engineer Professio...

Browse all Databricks articles (110)
© 2026 NicheeLab All rights reserved.