Databricks AutoML is a managed machine learning capability that automates data preprocessing, model selection, hyperparameter tuning, and evaluation. It supports 3 task types — classification, regression, and time-series forecasting — and runs the entire flow from searching for the best model to generating a reproducible notebook with one click (UI) or one line of code (API). On the ML Associate exam, 10-15% of questions cover AutoML, focusing on API usage, understanding the generated artifacts, and judging when to apply it.
AutoML automatically runs the 4 steps below. Understanding what each step does and what it produces is the foundation of exam prep.
┌─────────────────────────────────────────────────────────┐
│ AutoML Processing Flow │
│ │
│ Step 1: Data analysis & preprocessing │
│ ├─ Missing-value handling (median/mode/imputation) │
│ ├─ Categorical variable encoding │
│ ├─ Numeric feature normalization │
│ └─ Automatic feature selection │
│ │ │
│ Step 2: Model selection │
│ ├─ Automatic candidate algorithm selection │
│ └─ LightGBM / XGBoost / sklearn / Prophet, etc. │
│ │ │
│ Step 3: Hyperparameter tuning │
│ ├─ Automated search via Hyperopt │
│ └─ Each combination recorded as an MLflow Run │
│ │ │
│ Step 4: Evaluation & ranking │
│ ├─ Evaluation via cross-validation │
│ └─ Automatic best-model selection │
└─────────────────────────────────────────────────────────┘| Task | API method | Supported algorithms | Primary metrics |
|---|---|---|---|
| Classification | databricks.automl.classify() | LightGBM, XGBoost, sklearn (LogisticRegression, RandomForest, DecisionTree) | F1 score, accuracy, log_loss, precision, recall |
| Regression | databricks.automl.regress() | LightGBM, XGBoost, sklearn (LinearRegression, RandomForest, DecisionTree) | RMSE, MAE, R², MSE |
| Forecasting | databricks.automl.forecast() | Prophet, ARIMA | SMAPE, MSE, RMSE, MAE |
Classification and regression automatically try LightGBM, XGBoost, and sklearn algorithms. Forecasting is limited to Prophet and ARIMA — deep-learning-based methods (LSTM, etc.) are not supported by AutoML.
You can run AutoML from the Databricks workspace UI without writing any code.
After the run completes, you can compare each trial's metrics in the MLflow Experiment UI, then open the best model's notebook to review and edit its contents.
import databricks.automl
# Run a classification task
summary = databricks.automl.classify(
dataset="catalog.schema.customer_data", # Unity Catalog table
target_col="churn", # target column
primary_metric="f1", # metric to optimize
timeout_minutes=30, # search-time upper bound
max_trials=50 # maximum number of trials
)
# Inspect the result
print(f"Best trial: {summary.best_trial}")
print(f"Best metric: {summary.best_trial.metrics}")
print(f"MLflow run ID: {summary.best_trial.mlflow_run_id}")
# Load the best model
best_model = summary.best_trial.load_model()
# Regression example
reg_summary = databricks.automl.regress(
dataset=sales_df, # a Spark DataFrame also works as input
target_col="revenue",
primary_metric="rmse",
timeout_minutes=60
)
# Forecasting example
forecast_summary = databricks.automl.forecast(
dataset="catalog.schema.daily_sales",
target_col="sales_amount",
time_col="date", # time column (required)
frequency="d", # "d"=daily, "W"=weekly, "M"=monthly
horizon=30, # forecast horizon (days)
identity_col=["store_id"] # group key for multi-series forecasting
)For dataset you can pass a Unity Catalog table name (3-level namespace) or a Spark DataFrame.primary_metric is the metric used for model optimization — the default for classification is F1. The exam tests the parameter names and roles of each API method.
AutoML's biggest differentiator is that it auto-generates a reproducible Python notebook for each trial. It is not a black box — the code is fully exposed, so data scientists can understand and edit it.
The recommended workflow is to use the generated notebook as a baseline and then customize it with domain knowledge — adding features, changing preprocessing, swapping algorithms, and so on. The exam asks questions like "which of the following is the most appropriate way to use the AutoML-generated notebook?"
Every AutoML trial is automatically recorded as a Run in an MLflow Experiment. You do not need to write MLflow logging code by hand.
| Recorded information | Details |
|---|---|
| Parameters | Hyperparameters of each trial (learning_rate, n_estimators, etc.) |
| Metrics | Cross-validation results for evaluation metrics (F1, accuracy, RMSE, etc.) |
| Artifacts | Trained model, Feature Importance plot, generated notebook |
| Tags | Algorithm name, AutoML version, dataset information |
import mlflow
# After the AutoML run, get the best model from the MLflow Experiment
experiment_id = summary.experiment.experiment_id
best_run = mlflow.search_runs(
experiment_ids=[experiment_id],
order_by=["metrics.val_f1_score DESC"],
max_results=1
).iloc[0]
print(f"Best F1: {best_run['metrics.val_f1_score']:.4f}")
print(f"Algorithm: {best_run['tags.estimator_name']}")
# Register the best model in the Model Registry
mlflow.register_model(
model_uri=f"runs:/{best_run.run_id}/model",
name="catalog.schema.churn_classifier"
)| Comparison | AutoML | Manual ML |
|---|---|---|
| Development speed | Get a baseline model in minutes to tens of minutes | Requires days to weeks of development |
| ML knowledge required | Usable with basic knowledge | Requires deep knowledge of algorithms and tuning |
| Customizability | Constrained by supported algorithms and preprocessing | Fully flexible |
| Supported algorithms | LightGBM, XGBoost, sklearn, Prophet | Any framework, including deep learning |
| Large-scale data | Runs on a single node (applies sampling) | Supports distributed training (Spark ML, Horovod) |
| Reproducibility | Fully reproducible via the generated notebook | Depends on developer discipline |
| Recommended use cases | Baseline construction, PoC, data exploration | Maximizing production-model accuracy, custom pipelines |
AutoML / ML Associate
問題 1
A data scientist built a customer churn prediction model with AutoML. Which combination correctly describes what is produced by an AutoML run?
正解: B
Databricks AutoML automatically records every trial as a Run in an MLflow Experiment. Each Run contains hyperparameters, evaluation metrics (cross-validation results), the trained model, and Feature Importance. In addition, a reproducible Python notebook is auto-generated for each trial, containing the data loading, preprocessing, model definition, training, and evaluation code. Options A (only the best model) and C (manual MLflow setup) are wrong, and D is wrong because metric comparison can be done directly in the MLflow UI.
How can I use the notebook generated by AutoML?
The notebook generated by AutoML contains code for data preprocessing, feature engineering, model training, and hyperparameter settings. The notebook is fully runnable and freely editable, so the recommended workflow is to use the AutoML result as a baseline and then customize it with domain knowledge (adding features, switching algorithms, adjusting preprocessing).
What dataset sizes does AutoML support?
AutoML runs on a single node, so it is optimized for datasets that fit in memory. As a rough guide, it handles up to a few million rows efficiently. For large datasets (billions of rows or more), you should sample or aggregate up front, or consider distributed training frameworks (Spark ML, Horovod, etc.). When the input table exceeds 100 GB, AutoML automatically applies sampling.
How are AutoML results recorded in MLflow?
Every AutoML trial (model candidate) is automatically recorded as a Run in an MLflow Experiment. Each Run includes the hyperparameters, evaluation metrics (accuracy, F1, RMSE, etc.), the trained model, and a link to the generated notebook. You can pick the best model using MLflow UI's metrics comparison, then register it in the Model Registry and proceed to production deployment.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)
Complete guide to all 7 Databricks certifications — Data Eng...
Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)
Every Databricks certification ranked by difficulty, with st...
Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)
How to pass Databricks certifications efficiently. Official ...
Databricks Data Engineer Associate: Complete Guide (2026)
Domain-by-domain breakdown of the Databricks Certified Data ...
Databricks Data Engineer Professional: Complete Guide (2026)
Tactics for the Databricks Certified Data Engineer Professio...