Databricks

Hyperparameter Tuning: Complete Hyperopt and Optuna Guide

2026-03-26
更新: 2026-03-27
NicheeLab Editorial Team

Hyperparameter tuning is the process of optimizing parameters that aren't learned during training — learning rate, tree depth, regularization strength, and so on. On Databricks, Hyperopt (distributed search via SparkTrials) and Optuna are the main tools, and every trial can be automatically recorded in MLflow Tracking. The ML Associate exam frequently tests the basic Hyperopt API, while ML Professional often asks when to use SparkTrials vs Trials.

Comparing Search Strategies

The algorithms used to find the best combination from a hyperparameter search space fall into three broad categories.

StrategyHow it worksProsConsRepresentative tools
Grid SearchExhaustively tries every specified combinationHighly reproducible; effective when there are few parametersSearch space grows exponentially (curse of dimensionality)scikit-learn GridSearchCV
Random SearchSamples randomly from the search spaceTends to find good solutions with fewer trials than Grid SearchAllocates resources equally to unimportant parametersscikit-learn RandomizedSearchCV
Bayesian OptimizationBuilds a probabilistic model (such as TPE) from past trials and predicts the next point to tryConverges to the optimum with fewer trials; handles high-dimensional spaces wellHas sequential dependencies, so pure parallelization requires careful designHyperopt (TPE), Optuna (TPE)

For both practical work and the exam, the most important strategy on Databricks is Bayesian Optimization (TPE: Tree-structured Parzen Estimator). Hyperopt and Optuna both default to TPE, and they can find strong solutions in high-dimensional spaces with roughly 50-200 trials.

Hyperopt Basics

Hyperopt is the Bayesian optimization library built into Databricks. You can run tuning simply by passing an objective function, a search space, an algorithm, and a maximum number of trials to fmin().

Defining the Search Space

FunctionUse caseExample
hp.choice(label, options)Categorical values (discrete choices)hp.choice("algo", ["rf", "xgb", "lgb"])
hp.uniform(label, low, high)Uniform distribution (continuous values)hp.uniform("dropout", 0.1, 0.5)
hp.loguniform(label, low, high)Log-uniform distribution (parameters that span orders of magnitude, e.g. learning rate)hp.loguniform("lr", log(1e-5), log(1e-1))
hp.quniform(label, low, high, q)Quantized uniform distribution (integer parameters)hp.quniform("max_depth", 3, 15, 1)

Hyperopt + MLflow Integration Example

from hyperopt import fmin, tpe, hp, STATUS_OK, SparkTrials
import mlflow
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Define the search space
search_space = {
    "n_estimators": hp.quniform("n_estimators", 50, 500, 50),
    "max_depth": hp.quniform("max_depth", 3, 15, 1),
    "min_samples_split": hp.quniform("min_samples_split", 2, 20, 1),
    "learning_rate": hp.loguniform("learning_rate", np.log(1e-4), np.log(1e-1)),
}

# Objective function (returns the value to minimize)
def objective(params):
    params["n_estimators"] = int(params["n_estimators"])
    params["max_depth"] = int(params["max_depth"])
    params["min_samples_split"] = int(params["min_samples_split"])

    clf = RandomForestClassifier(**params, random_state=42)
    score = cross_val_score(clf, X_train, y_train, cv=3, scoring="f1").mean()

    # Log metrics to MLflow
    mlflow.log_metrics({"f1_cv": score})

    # Return STATUS_OK and loss (loss is minimized)
    return {"loss": -score, "status": STATUS_OK}

# Run distributed with SparkTrials
spark_trials = SparkTrials(parallelism=8)

with mlflow.start_run(run_name="hyperopt_rf_tuning"):
    best_params = fmin(
        fn=objective,
        space=search_space,
        algo=tpe.suggest,       # TPE (Bayesian Optimization)
        max_evals=100,          # up to 100 trials
        trials=spark_trials,    # run in parallel on Spark executors
    )

The objective function must return a dict in the form {"loss": value, "status": STATUS_OK}. Because loss is the value fmin minimizes, return a negative value (-score) when you want to maximize accuracy.

SparkTrials vs Trials

The Trials class choice determines Hyperopt's execution mode. Whether you tap into the cluster's resources makes a huge difference in performance.

ItemTrialsSparkTrials
Execution locationDriver node (single machine)Spark executors (entire cluster)
ParallelismSequential execution onlyControlled by the parallelism parameter
Suitable modelsSingle-machine ML such as scikit-learnSingle-machine ML such as scikit-learn (each executor runs independently)
MLflow integrationManual logging requiredEach trial is automatically logged as a nested run
Recommended parallelismMatch the cluster's worker count, or use the square root of max_evals
CaveatsEven 100 trials run sequentially on a single driverToo-high parallelism erodes TPE's sequential-optimization advantage

SparkTrials' parallelism involves a tradeoff. Higher values increase raw parallelism, but TPE uses past results to choose the next point. With parallelism too high, you end up choosing the next point while many in-flight trials haven't returned yet, which approaches random search. In practice, roughly the square root of max_evals is the recommended setting.

Optuna Basics

Optuna is a Bayesian optimization framework developed by Japan-based Preferred Networks. Unlike Hyperopt, it supports pruning (early termination) out of the box, letting you abort unpromising trials mid-flight to cut compute costs.

Key Optuna APIs

APIRole
optuna.create_study(direction)Create an optimization study ("minimize" or "maximize")
study.optimize(objective, n_trials)Run the specified number of optimization trials
trial.suggest_int(name, low, high)Search an integer parameter
trial.suggest_float(name, low, high, log)Search a float parameter (log=True for log scale)
trial.suggest_categorical(name, choices)Search categorical values
trial.report(value, step)Report intermediate values (used for pruning decisions)
trial.should_prune()Pruning check (True means terminate early)

Optuna + MLflow Integration Example

import optuna
import mlflow
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

def objective(trial):
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 50, 500),
        "max_depth": trial.suggest_int("max_depth", 3, 15),
        "learning_rate": trial.suggest_float("learning_rate", 1e-4, 1e-1, log=True),
        "subsample": trial.suggest_float("subsample", 0.5, 1.0),
        "min_samples_split": trial.suggest_int("min_samples_split", 2, 20),
    }

    clf = GradientBoostingClassifier(**params, random_state=42)
    score = cross_val_score(clf, X_train, y_train, cv=3, scoring="f1").mean()

    # Log each trial to MLflow
    with mlflow.start_run(nested=True):
        mlflow.log_params(params)
        mlflow.log_metric("f1_cv", score)

    return score

with mlflow.start_run(run_name="optuna_gbm_tuning"):
    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=100)

    # Record the best parameters
    mlflow.log_params(study.best_params)
    mlflow.log_metric("best_f1", study.best_value)

Hyperopt vs Optuna

ItemHyperoptOptuna
Search algorithmsTPE, Random Search, Adaptive TPETPE, CMA-ES, Random Search, Grid Search, GP
Distributed support on DatabricksNative support via SparkTrialsCan be parallelized with Joblib etc., but no Spark integration
Pruning (early termination)Not supported out of the boxMedianPruner, HyperbandPruner, and others built in
Objective return valueA dict containing loss (to minimize) and STATUS_OKA scalar value (direction selects maximize/minimize)
MLflow integrationAutomatic logging when using SparkTrialsManual mlflow.start_run(nested=True)
Relation to AutoMLEngine behind Databricks AutoMLNot used by AutoML
VisualizationUse the MLflow UIVisualize the search process with optuna.visualization
Exam importanceFrequently tested on ML Associate and ML ProfessionalRarely tested directly, but useful for conceptual understanding

Relationship with AutoML

Databricks AutoML automatically performs preprocessing, feature engineering, model selection, and hyperparameter tuning once you hand it data. Internally it uses Hyperopt's TPE algorithm to search hyperparameters for each model, and every trial is automatically logged to an MLflow experiment.

  • Every notebook AutoML generates contains tuning code that uses Hyperopt
  • You can edit the generated notebooks to expand the search space or add custom preprocessing
  • A common production pattern is to use AutoML for a baseline and then fine-tune with Hyperopt
  • Because AutoML results are logged to an MLflow experiment, you can directly compare them with manual tuning results

MLflow Tracking Integration

When you use Hyperopt with SparkTrials, each trial is automatically logged as a child run nested under the parent run. That lets you compare parameters and metrics across all trials in the MLflow UI.

  • Parent run: created with mlflow.start_run(); records metadata for the entire tuning job
  • Child runs: SparkTrials automatically logs each trial as a nested run (parameters, loss, status)
  • Compare feature: plot metrics across child runs to analyze the impact of each parameter
  • Register the best run's model in the Model Registry to move seamlessly to deployment

Best Practices for Distributed Tuning

  • Setting parallelism: use the square root of max_evals as a rule of thumb; for 100 trials, parallelism of around 10 works well. Too high a value hurts TPE's search efficiency
  • Early stopping: use the early_stop_fn parameter of fmin to stop the search once the target accuracy is reached
  • Search-space types: use hp.loguniform for learning rate (parameters that span orders of magnitude) and hp.quniform for tree depth (integer parameters)
  • Cluster sizing: with SparkTrials each trial runs as 1 executor = 1 trial; for GPU models, ensure one GPU per worker
  • Data size and caching: for large training datasets, reduce data-transfer costs with spark.broadcast() or by writing the data to DBFS in advance
  • Ensuring reproducibility: fix seeds with np.random.seed() and the rstate parameter, but note that distributed execution has non-deterministic ordering

Exam Focus Points

ExamScopeKey points
ML AssociateHyperopt basicsMeaning and usage of fmin, hp.choice, hp.loguniform, and STATUS_OK
ML AssociateDifferences between search strategiesCharacteristics of Grid Search vs Random Search vs Bayesian Optimization
ML ProfessionalSparkTrials vs TrialsDifference between distributed and single-machine execution; setting parallelism
ML ProfessionalMLflow integrationAutomatic nested-run logging with SparkTrials and how to compare results
ML ProfessionalRelationship with AutoMLThe fact that AutoML uses Hyperopt internally; using generated notebooks

Check Your Understanding

ML Professional

問題 1

An ML engineer wants to run 100 hyperparameter tuning trials on a scikit-learn random forest on an 8-worker Databricks cluster. Which approach best maximizes cluster resource usage while keeping TPE's search efficiency intact?

  1. Pass a Trials object to fmin() and run all 100 trials sequentially on the driver node
  2. Pass SparkTrials(parallelism=8) to fmin() and run trials in parallel on each executor. The square root of max_evals (about 10) is the recommended parallelism, but matching the worker count at 8 is acceptable
  3. Pass SparkTrials(parallelism=100) to fmin() and run all 100 trials simultaneously
  4. Use Optuna's create_study and distribute across Spark executors with Joblib parallelism

正解: B

SparkTrials distributes trials across Spark executors. parallelism=8 matches the worker count and is close to the square root of max_evals (100), about 10, which is a reasonable choice. Option A doesn't use the cluster's resources. Option C with parallelism=100 effectively wipes out TPE's advantage of using past results to pick the next point, making it equivalent to random search. Optuna in option D lacks SparkTrials integration on Databricks and is not the best fit.

Frequently Asked Questions

Should I use Hyperopt or Optuna?

If integration with the Databricks ecosystem matters most, Hyperopt is the first choice. Its strengths are cluster-wide distributed tuning via SparkTrials, being the engine behind AutoML, and automatic integration with MLflow Tracking. On the other hand, Optuna fits better when you need pruning (early termination) for efficiency, search algorithms beyond TPE (such as CMA-ES), or when you want to run the same code across other clouds and on-prem. For exam prep, Hyperopt is the priority.

What is the difference between SparkTrials and Trials?

Trials is a class that runs trials sequentially on a single machine, using only the resources of one driver node. SparkTrials distributes trials across Spark executors and runs them in parallel across the cluster. For example, an 8-worker cluster can run up to 8 trials concurrently, drastically reducing the time to complete 100 trials. This distinction comes up often on the ML Professional exam.

How does hyperparameter tuning relate to AutoML?

Databricks AutoML uses Hyperopt internally to search hyperparameters. AutoML is a higher-level layer that automates preprocessing, feature engineering, model selection, and tuning, and every trial is automatically logged to MLflow. The notebooks AutoML generates contain Hyperopt code, which you can customize to build your own tuning pipeline.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Databricks

Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)

Complete guide to all 7 Databricks certifications — Data Eng...

Databricks

Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)

Every Databricks certification ranked by difficulty, with st...

Databricks

Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)

How to pass Databricks certifications efficiently. Official ...

Databricks

Databricks Data Engineer Associate: Complete Guide (2026)

Domain-by-domain breakdown of the Databricks Certified Data ...

Databricks

Databricks Data Engineer Professional: Complete Guide (2026)

Tactics for the Databricks Certified Data Engineer Professio...

Browse all Databricks articles (110)
© 2026 NicheeLab All rights reserved.