Databricks Certified Machine Learning Associate (ML Associate) validates practical skills in MLflow experiment tracking, Spark ML pipeline construction, and AutoML-based modeling. Alongside Data Engineer Associate (DEA), it is one of the most popular Databricks certifications and the standard credential for kicking off a career as an ML engineer.
| Item | Details |
|---|---|
| Official Name | Databricks Certified Machine Learning Associate |
| Questions | 45 (multiple choice) |
| Duration | 90 minutes |
| Passing Score | 70% (32 of 45 questions) |
| Fee | $200 (excluding tax) |
| Language | English and Japanese |
| Prerequisites | None (6+ months of hands-on experience recommended) |
| Validity | 2 years |
| Delivery | Online-proctored (take from home) |
The Japanese version is available, but translation quality varies. For questions that include code snippets, we recommend toggling the English original to confirm intent. The remaining time is always shown at the top of the screen, and you can flag questions to review later.
| Domain | Weight | Key Topics |
|---|---|---|
| Databricks ML | 29% (~13 questions) | AutoML, Feature Store, MLflow Tracking |
| ML Workflows | 29% (~13 questions) | Experiment tracking, Model Registry, deployment |
| Spark ML | 22% (~10 questions) | Pipeline, Transformer, Estimator |
| Scaling ML Models | 20% (~9 questions) | Distributed training, Pandas UDF, distributed inference |
Databricks ML and ML Workflows together account for 58% of the exam. Both domains are heavy on MLflow, so a solid grasp of MLflow Tracking, Model Registry, and Autologging can cover about 25 questions (55% of the exam). The remaining 20% — Scaling ML Models — requires distribution-specific knowledge (Pandas UDF, spark-tensorflow-distributor, etc.), so you should learn it on top of a solid Spark foundation.
Databricks AutoML is an automated machine learning feature that supports three tasks: Classification, Regression, and Forecasting. You can run it from the UI or via API. Under the hood, it automates data preprocessing, feature engineering, hyperparameter tuning, and model selection.
# Run AutoML via the API
from databricks import automl
summary = automl.classify(
dataset=train_df, # Spark DataFrame or Pandas DataFrame
target_col="churn", # Target column name
primary_metric="f1", # Metric to optimize
timeout_minutes=30, # Maximum runtime
max_trials=20 # Maximum number of trials
)
# Retrieve the best model, Run, and notebook
print(summary.best_trial) # Best Trial object
print(summary.best_trial.model_path) # Path to the model artifactWhat the exam tests is the meaning of the notebook AutoML generates as output. AutoML automatically produces an editable notebook with the source code for each trial, so data scientists can manually tweak preprocessing logic or hyperparameters. In other words, AutoML is designed as a "starting point" providing a baseline model, not as a "black box."
Databricks Feature Store is a centralized repository for managing ML features. Feature tables are registered as Delta Tables in Unity Catalog, sharing the same feature definitions between training and serving.
FeatureStoreClient.create_training_set() joins the features.FeatureStoreClient.score_batch() automatically looks up the latest features.In the Databricks ML domain, the exam tests basic MLflow Tracking operations (creating an Experiment, recording a Run, saving parameters/metrics/artifacts). On Databricks, mlflow.autolog() is enabled by default, and parameters, metrics, and models are automatically logged for major frameworks like scikit-learn, XGBoost, LightGBM, PyTorch, and TensorFlow.
The ML Workflows domain tests end-to-end workflows centered on MLflow experiment tracking. Rather than just API calls, the exam asks "why you use a given API" — judgment is what's being tested.
import mlflow
from mlflow.models import infer_signature
mlflow.set_experiment("/Experiments/fraud_detection")
with mlflow.start_run(run_name="lgbm_baseline") as run:
# Log parameters
mlflow.log_param("model_type", "LightGBM")
mlflow.log_param("num_leaves", 31)
mlflow.log_param("learning_rate", 0.05)
mlflow.log_param("data_version", "delta_v3")
# Training
model = lgb.train(params, train_data, valid_sets=[val_data])
# Log metrics
mlflow.log_metric("auc", 0.934)
mlflow.log_metric("precision", 0.891)
mlflow.log_metric("recall", 0.867)
# Infer signature and log the model
signature = infer_signature(X_train, model.predict(X_train))
mlflow.lightgbm.log_model(model, "model", signature=signature)
# Save supporting artifacts
mlflow.log_artifact("feature_importance.png")
mlflow.log_artifact("confusion_matrix.html")The exam tests the difference between log_param (single string/number), log_params (dict, bulk),log_metric (single number, with optional step), and log_metrics (dict, bulk). Another frequent point: log_artifact takes a file path, while log_model takes a framework-specific model object.
The Model Registry handles model version and lifecycle management. The exam covers both the legacy Workspace Model Registry stage transitions and Unity Catalog Model Registry aliases.
| Aspect | Legacy Workspace Registry | Unity Catalog Registry |
|---|---|---|
| Stage Management | None → Staging → Production → Archived | Aliases (e.g., champion / challenger; user-defined) |
| Scope | Single workspace | Cross-account (shared across workspaces) |
| Permission Model | Workspace-level ACLs | Unity Catalog 3-level permissions (catalog.schema.model) |
| Lineage | Limited | Automatic tracking from table → model → endpoint |
# Register to the Unity Catalog Model Registry
mlflow.set_registry_uri("databricks-uc")
mlflow.register_model(
model_uri=f"runs:/{run.info.run_id}/model",
name="ml_prod.fraud.lgbm_model" # 3-level name: catalog.schema.model
)
# Set an alias
from mlflow import MlflowClient
client = MlflowClient()
client.set_registered_model_alias(
name="ml_prod.fraud.lgbm_model",
alias="champion",
version=5
)mlflow.pyfunc.spark_udf() applied to a Spark DataFrame# Typical batch inference pattern
predict_udf = mlflow.pyfunc.spark_udf(
spark,
model_uri="models:/ml_prod.fraud.lgbm_model@champion"
)
predictions = (spark.table("silver.transactions")
.withColumn("fraud_score", predict_udf("amount", "merchant_category", "hour_of_day"))
)The focus is the Pipeline API in Spark MLlib (the pyspark.ml package). Distinguishing Transformer (transformation: takes data, returns data) from Estimator (estimation: takes data, returns a Model = Transformer) is the single most fundamental and most-tested point on the exam.
| Concept | Role | Examples |
|---|---|---|
| Transformer | DataFrame → DataFrame (adds/transforms columns) | VectorAssembler, StringIndexer (fitted), Tokenizer |
| Estimator | DataFrame → produces a Model (Transformer) | LogisticRegression, RandomForestClassifier, StringIndexer (unfitted) |
| Pipeline | Sequential chain of Transformers/Estimators | Bundles preprocessing → feature transform → training into one object |
| CrossValidator | Hyperparameter tuning via cross-validation | Grid search using ParamGridBuilder |
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StringIndexer, StandardScaler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator
# Stage 1: Encode categorical variables
indexer = StringIndexer(inputCol="category", outputCol="category_idx")
# Stage 2: Build the feature vector
assembler = VectorAssembler(
inputCols=["amount", "category_idx", "hour_of_day"],
outputCol="features"
)
# Stage 3: Scaling
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")
# Stage 4: Logistic regression
lr = LogisticRegression(featuresCol="scaled_features", labelCol="label")
# Build the Pipeline
pipeline = Pipeline(stages=[indexer, assembler, scaler, lr])
# Hyperparameter tuning with CrossValidator
param_grid = (ParamGridBuilder()
.addGrid(lr.regParam, [0.01, 0.1, 1.0])
.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])
.build())
evaluator = BinaryClassificationEvaluator(
labelCol="label", metricName="areaUnderROC"
)
cv = CrossValidator(
estimator=pipeline,
estimatorParamMaps=param_grid,
evaluator=evaluator,
numFolds=3,
parallelism=4 # Parallelism level (tune for available cluster resources)
)
cv_model = cv.fit(train_df)
best_model = cv_model.bestModelThe parallelism parameter of CrossValidator is a popular exam target. The default is 1 (sequential), but with spare cluster resources you can shorten tuning time by increasing parallelism. Be aware, however, that each fold's training runs as a distributed Spark job — pushing parallelism too high can cause resource contention.
A mechanism for distributing single-node Pandas processing across Spark. Apache Arrow handles data transfer, delivering 10x-100x speedups versus traditional UDFs.
import pandas as pd
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import DoubleType
@pandas_udf(DoubleType())
def predict_batch(features: pd.Series) -> pd.Series:
"""Run Pandas-based inference on each partition"""
model = load_model() # Load from a broadcast variable or MLflow
return pd.Series(model.predict(features.tolist()))
# Runs in parallel across each partition of the Spark DataFrame
predictions = df.withColumn("prediction", predict_batch("features"))For deep learning on large datasets, use the distributed training frameworks Databricks provides.
TensorflowDistributor(num_processes=8, local_mode=False).run(train_fn).TorchDistributor(num_processes=4, local_mode=False).run(train_fn)The exam asks "which distributed framework matches which ML platform" and the difference between local_mode=True and False.local_mode=True runs multi-process on the driver node (for debugging), while False performs production distributed execution across worker nodes.
pyspark.pandas (formerly Koalas) lets you operate on Spark DataFrames with the Pandas API. Expect questions about scenarios where single-node Pandas code is migrated to distributed execution with minimal changes.
import pyspark.pandas as ps
# Operate via the Pandas API (internally distributed as a Spark DataFrame)
psdf = ps.read_delta("dbfs:/mnt/gold/features")
psdf["feature_ratio"] = psdf["feature_a"] / psdf["feature_b"]
result = psdf.groupby("segment").mean()Roughly 40% of the ML Associate exam touches MLflow, and memorizing the following API patterns accurately is the key to passing.
| API | Arguments | Purpose |
|---|---|---|
mlflow.start_run() | run_name, nested | Start a Run (use as a context manager) |
mlflow.log_param() | key, value | Log a single hyperparameter |
mlflow.log_metric() | key, value, step | Log a single metric (use step to track epochs) |
mlflow.log_artifact() | local_path | Save a local file as an artifact |
mlflow.sklearn.log_model() | model, artifact_path | Save a scikit-learn model (flavor-specific) |
mlflow.autolog() | None (or a config dict) | Automatically log parameters/metrics/models for supported frameworks |
mlflow.register_model() | model_uri, name | Register to the Model Registry |
mlflow.pyfunc.spark_udf() | spark, model_uri | Use as a Spark UDF for batch inference |
| Theme | What is Tested |
|---|---|
| Supported Tasks | Three tasks: Classification, Regression, and Forecasting (time series). Clustering is not supported. |
| How to Run | UI (launched from the Experiments screen) and API (automl.classify(), etc.) — two options |
| Outputs | MLflow Experiment, a Run for each Trial, an editable source notebook, and the best model |
| Data Preprocessing | Missing-value imputation, one-hot encoding, and feature selection are performed automatically |
| Customization | Edit the generated notebook to tune preprocessing or the model and rerun as your own Run |
| Topic | DEA | ML Associate |
|---|---|---|
| Delta Lake basics | Tested in depth (MERGE, CDF, constraints) | Basic level, in the feature-table context |
| Spark DataFrame | Centered on ETL operations (filter, join, aggregate) | Feature engineering operations for ML |
| Unity Catalog | Table and view permission management | Model and Feature Store permission management |
| MLflow | Basic concepts only (low frequency) | Tracking/Registry/Serving tested in detail |
| Spark ML | Not tested | Pipeline/CrossValidator tested in detail |
| AutoML | Not tested | How to run, outputs, and customization are all tested |
| DLT (Delta Live Tables) | Tested in detail | Not tested |
| Workflows / Jobs | Tested in detail | Lightly tested in the ML pipeline orchestration context |
Candidates who already passed DEA can carry over their Delta Lake, Spark DataFrame, and Unity Catalog fundamentals. On top of that, you need MLflow (about 40%), Spark ML (about 22%), AutoML and Feature Store (about 10%), and distributed training (about 15%). DEA knowledge transfers for roughly 13-15% of the exam, so even DEA-certified candidates should budget about 2 months of additional study.
create_training_set().Spark ML
問題 1
An ML engineer wants to build a Spark ML pipeline and perform hyperparameter tuning. Which combination correctly fills in the blanks in the code below? pipeline = Pipeline(stages=[indexer, assembler, lr]) param_grid = ParamGridBuilder().addGrid(lr.regParam, [0.01, 0.1]).build() evaluator = BinaryClassificationEvaluator(labelCol='label') cv = CrossValidator(estimator=___A___, estimatorParamMaps=___B___, evaluator=evaluator, numFolds=3) cv_model = cv.fit(train_df) best = cv_model.___C___
正解: A
Pass the entire Pipeline to CrossValidator's estimator. Passing only lr omits preprocessing (indexer, assembler), so training runs without the necessary data transformations and fails. estimatorParamMaps takes the parameter-grid list built by ParamGridBuilder. Retrieve the best model via the bestModel property (camelCase in Python). best_params is a scikit-learn API and is not used in Spark ML.
Can I study for ML Associate and Data Engineer Associate at the same time?
About 30% of the scope (Delta Lake basics, Spark DataFrame operations, Unity Catalog permissions) overlaps, so passing DEA first and then moving on to ML Associate is the efficient route. However, the ML Associate-specific topics — MLflow Tracking, Model Registry, AutoML, and Spark ML Pipelines — must be studied separately, and they account for about 70% of the score. Focusing on one exam at a time (DEA → ML Associate) tends to produce a higher pass rate than studying both in parallel.
How much Python code appears on the ML Associate exam?
Roughly 15-20 of the 45 questions include Python code snippets. Frequent topics include the mlflow.start_run() context manager, choosing between mlflow.log_param/metric/artifact, building Spark ML Pipelines with Transformers and Estimators, and configuring CrossValidator. Code typically appears as fill-in-the-blank or spot-the-error — you don't need to actually write code, but you must memorize API argument names and return types precisely.
What is the difference between ML Professional and ML Associate?
Associate centers on individual workflows: logging experiments with MLflow, building pipelines with Spark ML, and rapidly producing baseline models with AutoML. Professional asks about team-level operational design: production ML pipeline architecture, distributed training tuning, A/B testing, model drift monitoring, and online/offline consistency in the Feature Store. Passing Associate is not a prerequisite, but Associate-level knowledge is assumed.
Related Databricks Certification Articles
Machine Learning Professional: Complete Guide
MLP — production ML system design
Generative AI Engineer Associate: Complete Guide
Next cert — ~30% scope overlap with MLA
Databricks Exam Difficulty Ranking
All 7 exams ranked with study-time estimates
Databricks Certifications Overview
Scope and passing scores at a glance
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)
Complete guide to all 7 Databricks certifications — Data Eng...
Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)
Every Databricks certification ranked by difficulty, with st...
Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)
How to pass Databricks certifications efficiently. Official ...
Databricks Data Engineer Associate: Complete Guide (2026)
Domain-by-domain breakdown of the Databricks Certified Data ...
Databricks Data Engineer Professional: Complete Guide (2026)
Tactics for the Databricks Certified Data Engineer Professio...