Databricks

Databricks AutoML Complete Guide: Automated ML Usage and Exam Prep

2026-03-21
更新: 2026-03-27
NicheeLab Editorial Team

Databricks AutoML is a managed machine learning capability that automates data preprocessing, model selection, hyperparameter tuning, and evaluation. It supports 3 task types — classification, regression, and time-series forecasting — and runs the entire flow from searching for the best model to generating a reproducible notebook with one click (UI) or one line of code (API). On the ML Associate exam, 10-15% of questions cover AutoML, focusing on API usage, understanding the generated artifacts, and judging when to apply it.

AutoML Processing Flow

AutoML automatically runs the 4 steps below. Understanding what each step does and what it produces is the foundation of exam prep.

┌─────────────────────────────────────────────────────────┐
│                AutoML Processing Flow                    │
│                                                         │
│  Step 1: Data analysis & preprocessing                  │
│  ├─ Missing-value handling (median/mode/imputation)     │
│  ├─ Categorical variable encoding                       │
│  ├─ Numeric feature normalization                       │
│  └─ Automatic feature selection                         │
│          │                                              │
│  Step 2: Model selection                                │
│  ├─ Automatic candidate algorithm selection             │
│  └─ LightGBM / XGBoost / sklearn / Prophet, etc.        │
│          │                                              │
│  Step 3: Hyperparameter tuning                          │
│  ├─ Automated search via Hyperopt                       │
│  └─ Each combination recorded as an MLflow Run          │
│          │                                              │
│  Step 4: Evaluation & ranking                           │
│  ├─ Evaluation via cross-validation                     │
│  └─ Automatic best-model selection                      │
└─────────────────────────────────────────────────────────┘

Supported Tasks and Evaluation Metrics

TaskAPI methodSupported algorithmsPrimary metrics
Classificationdatabricks.automl.classify()LightGBM, XGBoost, sklearn (LogisticRegression, RandomForest, DecisionTree)F1 score, accuracy, log_loss, precision, recall
Regressiondatabricks.automl.regress()LightGBM, XGBoost, sklearn (LinearRegression, RandomForest, DecisionTree)RMSE, MAE, R², MSE
Forecastingdatabricks.automl.forecast()Prophet, ARIMASMAPE, MSE, RMSE, MAE

Classification and regression automatically try LightGBM, XGBoost, and sklearn algorithms. Forecasting is limited to Prophet and ARIMA — deep-learning-based methods (LSTM, etc.) are not supported by AutoML.

UI Execution Steps

You can run AutoML from the Databricks workspace UI without writing any code.

  1. Experiments page — select "Create AutoML Experiment"
  2. Dataset: choose a Unity Catalog table or Delta Table
  3. Prediction target: specify the target column (label)
  4. Problem type: choose Classification / Regression / Forecasting (auto-detection is also available)
  5. Advanced Configuration: set evaluation metric, excluded columns, timeout, and trial count
  6. Click Start to run

After the run completes, you can compare each trial's metrics in the MLflow Experiment UI, then open the best model's notebook to review and edit its contents.

API Execution (Python Code)

import databricks.automl

# Run a classification task
summary = databricks.automl.classify(
    dataset="catalog.schema.customer_data",   # Unity Catalog table
    target_col="churn",                       # target column
    primary_metric="f1",                      # metric to optimize
    timeout_minutes=30,                       # search-time upper bound
    max_trials=50                             # maximum number of trials
)

# Inspect the result
print(f"Best trial: {summary.best_trial}")
print(f"Best metric: {summary.best_trial.metrics}")
print(f"MLflow run ID: {summary.best_trial.mlflow_run_id}")

# Load the best model
best_model = summary.best_trial.load_model()

# Regression example
reg_summary = databricks.automl.regress(
    dataset=sales_df,              # a Spark DataFrame also works as input
    target_col="revenue",
    primary_metric="rmse",
    timeout_minutes=60
)

# Forecasting example
forecast_summary = databricks.automl.forecast(
    dataset="catalog.schema.daily_sales",
    target_col="sales_amount",
    time_col="date",                          # time column (required)
    frequency="d",                            # "d"=daily, "W"=weekly, "M"=monthly
    horizon=30,                               # forecast horizon (days)
    identity_col=["store_id"]                 # group key for multi-series forecasting
)

For dataset you can pass a Unity Catalog table name (3-level namespace) or a Spark DataFrame.primary_metric is the metric used for model optimization — the default for classification is F1. The exam tests the parameter names and roles of each API method.

Structure and Use of the Generated Notebook

AutoML's biggest differentiator is that it auto-generates a reproducible Python notebook for each trial. It is not a black box — the code is fully exposed, so data scientists can understand and edit it.

What the generated notebook contains

  • Data loading: loads the input table and splits it into train/test
  • Preprocessing pipeline: sklearn Pipeline for imputation, encoding, and normalization
  • Model definition: the selected algorithm and its hyperparameters
  • Training and evaluation: cross-validation and metric computation
  • MLflow logging: logs parameters, metrics, and the model
  • Feature Importance: feature importance visualization via SHAP values

The recommended workflow is to use the generated notebook as a baseline and then customize it with domain knowledge — adding features, changing preprocessing, swapping algorithms, and so on. The exam asks questions like "which of the following is the most appropriate way to use the AutoML-generated notebook?"

MLflow Experiment Integration

Every AutoML trial is automatically recorded as a Run in an MLflow Experiment. You do not need to write MLflow logging code by hand.

Recorded informationDetails
ParametersHyperparameters of each trial (learning_rate, n_estimators, etc.)
MetricsCross-validation results for evaluation metrics (F1, accuracy, RMSE, etc.)
ArtifactsTrained model, Feature Importance plot, generated notebook
TagsAlgorithm name, AutoML version, dataset information
import mlflow

# After the AutoML run, get the best model from the MLflow Experiment
experiment_id = summary.experiment.experiment_id
best_run = mlflow.search_runs(
    experiment_ids=[experiment_id],
    order_by=["metrics.val_f1_score DESC"],
    max_results=1
).iloc[0]

print(f"Best F1: {best_run['metrics.val_f1_score']:.4f}")
print(f"Algorithm: {best_run['tags.estimator_name']}")

# Register the best model in the Model Registry
mlflow.register_model(
    model_uri=f"runs:/{best_run.run_id}/model",
    name="catalog.schema.churn_classifier"
)

AutoML vs. Manual ML Comparison

ComparisonAutoMLManual ML
Development speedGet a baseline model in minutes to tens of minutesRequires days to weeks of development
ML knowledge requiredUsable with basic knowledgeRequires deep knowledge of algorithms and tuning
CustomizabilityConstrained by supported algorithms and preprocessingFully flexible
Supported algorithmsLightGBM, XGBoost, sklearn, ProphetAny framework, including deep learning
Large-scale dataRuns on a single node (applies sampling)Supports distributed training (Spark ML, Horovod)
ReproducibilityFully reproducible via the generated notebookDepends on developer discipline
Recommended use casesBaseline construction, PoC, data explorationMaximizing production-model accuracy, custom pipelines

Limitations and Caveats

  • Single-node execution: AutoML does not support distributed training. Large datasets need to be sampled in advance.
  • No deep learning: CNN, RNN, and Transformer-based models are not part of the search space.
  • No unstructured data: Images, text, and audio cannot be passed directly (you must extract features up front).
  • No custom metrics: Only built-in metrics are supported; user-defined metrics cannot be specified.
  • Forecasting constraints: Only Prophet and ARIMA are available; deep-learning sequence models (N-BEATS, etc.) are not supported.

Key Points Tested on the ML Associate Exam

  • Choosing between API methods: parameters and roles of classify() / regress() / forecast()
  • Using the generated notebook: the recommended workflow of customizing it as a baseline
  • Automatic MLflow logging: how every trial is recorded as a Run
  • When to apply: telling apart scenarios where AutoML fits from those that require manual ML
  • Understanding the limitations: single-node, supported algorithms, and data-size constraints
  • Forecast-specific parameters: the role of time_col, frequency, horizon, and identity_col

Sample Question

AutoML / ML Associate

問題 1

A data scientist built a customer churn prediction model with AutoML. Which combination correctly describes what is produced by an AutoML run?

  1. Only the best model. No information about other trials is recorded.
  2. MLflow Experiment with Runs for every trial (parameters, metrics, models), plus a reproducible notebook for each trial.
  3. The best model and a table of hyperparameters. Notebook generation and MLflow logging must be configured manually.
  4. Only the model files for every trial. Metric comparison has to be done in Databricks SQL.

正解: B

Databricks AutoML automatically records every trial as a Run in an MLflow Experiment. Each Run contains hyperparameters, evaluation metrics (cross-validation results), the trained model, and Feature Importance. In addition, a reproducible Python notebook is auto-generated for each trial, containing the data loading, preprocessing, model definition, training, and evaluation code. Options A (only the best model) and C (manual MLflow setup) are wrong, and D is wrong because metric comparison can be done directly in the MLflow UI.

Frequently Asked Questions

How can I use the notebook generated by AutoML?

The notebook generated by AutoML contains code for data preprocessing, feature engineering, model training, and hyperparameter settings. The notebook is fully runnable and freely editable, so the recommended workflow is to use the AutoML result as a baseline and then customize it with domain knowledge (adding features, switching algorithms, adjusting preprocessing).

What dataset sizes does AutoML support?

AutoML runs on a single node, so it is optimized for datasets that fit in memory. As a rough guide, it handles up to a few million rows efficiently. For large datasets (billions of rows or more), you should sample or aggregate up front, or consider distributed training frameworks (Spark ML, Horovod, etc.). When the input table exceeds 100 GB, AutoML automatically applies sampling.

How are AutoML results recorded in MLflow?

Every AutoML trial (model candidate) is automatically recorded as a Run in an MLflow Experiment. Each Run includes the hyperparameters, evaluation metrics (accuracy, F1, RMSE, etc.), the trained model, and a link to the generated notebook. You can pick the best model using MLflow UI's metrics comparison, then register it in the Model Registry and proceed to production deployment.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Databricks

Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)

Complete guide to all 7 Databricks certifications — Data Eng...

Databricks

Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)

Every Databricks certification ranked by difficulty, with st...

Databricks

Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)

How to pass Databricks certifications efficiently. Official ...

Databricks

Databricks Data Engineer Associate: Complete Guide (2026)

Domain-by-domain breakdown of the Databricks Certified Data ...

Databricks

Databricks Data Engineer Professional: Complete Guide (2026)

Tactics for the Databricks Certified Data Engineer Professio...

Browse all Databricks articles (110)
© 2026 NicheeLab All rights reserved.