Running Databricks in production requires automated CI/CD deployment of notebooks, jobs, and pipelines. Since 2024, Databricks has positioned Databricks Asset Bundles (DABs) as the recommended deployment method, and the Data Engineer Professional exam also covers DABs configuration, GitHub Actions integration, and environment separation. This article walks through how to build a Databricks CI/CD pipeline with DABs at the center.
A Databricks CI/CD pipeline is typically structured around the following flow:
┌──────────┐ PR/Merge ┌─────────────┐ databricks ┌──────────────┐
│ Git Repo │ ─────────────► │ CI/CD Tool │ ─── CLI ────► │ Databricks │
│ (GitHub) │ │ (GH Actions)│ bundle │ Workspace │
│ │ ◄───────────── │ │ deploy │ │
│ - src/ │ テスト結果 │ - validate │ │ - dev │
│ - tests/ │ │ - pytest │ │ - staging │
│ - bundle │ │ - deploy │ │ - prod │
└──────────┘ └─────────────┘ └──────────────┘databricks bundle deploy ships the changes to each environment.DABs lets you define Databricks resources (jobs, DLT pipelines, ML models, etc.) declaratively in a databricks.yml file and deploy them with a single CLI command. You define a target per environment and switch between dev, staging, and prod by overriding variables.
bundle:
name: sales-etl-pipeline
workspace:
host: https://adb-xxxx.azuredatabricks.net
variables:
catalog:
default: dev_catalog
warehouse_id:
default: abc123def456
resources:
jobs:
daily_etl:
name: "daily-sales-etl-${bundle.target}"
schedule:
quartz_cron_expression: "0 0 6 * * ?"
timezone_id: "Asia/Tokyo"
tasks:
- task_key: ingest
notebook_task:
notebook_path: ./src/01_ingest.py
existing_cluster_id: ${var.cluster_id}
- task_key: transform
depends_on:
- task_key: ingest
notebook_task:
notebook_path: ./src/02_transform.py
existing_cluster_id: ${var.cluster_id}
- task_key: publish
depends_on:
- task_key: transform
notebook_task:
notebook_path: ./src/03_publish.py
existing_cluster_id: ${var.cluster_id}
pipelines:
dlt_pipeline:
name: "sales-dlt-${bundle.target}"
target: "${var.catalog}.sales"
libraries:
- notebook:
path: ./src/dlt_definitions.py
targets:
dev:
mode: development
default: true
variables:
catalog: dev_catalog
cluster_id: "0123-dev-cluster"
staging:
mode: development
variables:
catalog: staging_catalog
cluster_id: "0456-staging-cluster"
prod:
mode: production
variables:
catalog: prod_catalog
cluster_id: "0789-prod-cluster"
run_as:
service_principal_name: "sp-production-deployer"| Command | Purpose |
|---|---|
databricks bundle init | Initialize a project from a template |
databricks bundle validate | Validate databricks.yml syntax and references |
databricks bundle deploy -t dev | Deploy resources to the specified target |
databricks bundle run -t dev daily_etl | Trigger a deployed job immediately |
databricks bundle destroy -t dev | Delete the deployed resources |
Here is an example GitHub Actions workflow that drives a DABs-based CI/CD pipeline:
# .github/workflows/databricks-cicd.yml
name: Databricks CI/CD
on:
push:
branches: [main]
pull_request:
branches: [main]
env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
jobs:
validate-and-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dependencies
run: |
pip install databricks-cli pytest databricks-connect==15.4.*
pip install -r requirements.txt
- name: Validate bundle
run: databricks bundle validate -t staging
- name: Run unit tests
run: pytest tests/unit/ -v --junitxml=test-results.xml
- name: Run integration tests (staging)
if: github.event_name == 'pull_request'
run: |
databricks bundle deploy -t staging
databricks bundle run -t staging integration_test_job
databricks bundle destroy -t staging
deploy-production:
needs: validate-and-test
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v4
- name: Install Databricks CLI
run: pip install databricks-cli
- name: Deploy to production
run: databricks bundle deploy -t prod# tests/unit/test_transform.py
from pyspark.sql import SparkSession
from src.transforms import clean_sales_data
def test_clean_sales_data(spark: SparkSession):
"""null値を含む行が除外されることを検証"""
input_df = spark.createDataFrame([
(1, "ProductA", 100.0),
(2, None, 200.0),
(3, "ProductC", None),
], ["id", "product_name", "amount"])
result = clean_sales_data(input_df)
assert result.count() == 1
assert result.first()["product_name"] == "ProductA"
# conftest.py
import pytest
from pyspark.sql import SparkSession
@pytest.fixture(scope="session")
def spark():
return SparkSession.builder \
.master("local[2]") \
.appName("unit-tests") \
.getOrCreate()# テストランナーノートブック
test_notebooks = [
"/Repos/main/project/tests/test_ingest",
"/Repos/main/project/tests/test_transform",
]
results = []
for nb in test_notebooks:
try:
result = dbutils.notebook.run(nb, timeout_seconds=600)
results.append({"notebook": nb, "status": "PASS", "result": result})
except Exception as e:
results.append({"notebook": nb, "status": "FAIL", "error": str(e)})
failed = [r for r in results if r["status"] == "FAIL"]
if failed:
raise Exception(f"{len(failed)} test(s) failed: {failed}")| Item | dev | staging | prod |
|---|---|---|---|
| Catalog | dev_catalog | staging_catalog | prod_catalog |
| Deployment trigger | Manual / branch push | Auto on PR merge | Auto after merging to main |
| DABs mode | development | development | production |
| Run-as identity | Individual developer | Service principal | Service principal |
| Data | Sample data | Subset of production data | Production data |
| Schedule | Manual runs only | Test schedule | Production schedule |
Setting mode: production in DABs drops the resource-name prefix and runs jobs as the service principal specified in run_as. With mode: development, resources get a [dev your-username] prefix and schedules are disabled.
Databricks Repos lets you clone Git repositories, switch branches, pull, and commit — all from within the workspace. Developers edit notebooks in Repos, push changes to Git, and open pull requests from there.
| Item | Databricks Asset Bundles | Terraform (databricks provider) |
|---|---|---|
| What it manages | Jobs, DLT pipelines, notebooks | Workspaces, clusters, IAM, jobs — everything |
| Definition language | YAML (databricks.yml) | HCL (.tf) |
| State management | Managed by the workspace | terraform.tfstate (S3, GCS, etc.) |
| Environment switching | Defined in the targets section | Workspaces / var files |
| Learning curve | Low (YAML + Databricks CLI) | High (HCL + Terraform concepts) |
| Infrastructure management | Not supported (Databricks resources only) | Supported (VPC, IAM, storage, etc.) |
| Recommended scenario | Data-engineer-led workload management | Platform-team-led infrastructure + workload management |
CI/CD - Databricks Asset Bundles
問題 1
A data engineering team is building a CI/CD pipeline with Databricks Asset Bundles. For the production (prod) deployment, they want jobs to run as a service principal and they want resource names to have no prefix. Which databricks.yml configuration is correct?
正解: B
Setting mode: production in the DABs targets section removes the resource-name prefix and enables job schedules. Specifying run_as additionally causes jobs to execute under that service principal's permissions. mode: development prefixes resource names with the user name and disables schedules, so it is not appropriate for production. mode is specified inside the targets section, not the workspace section.
Should I use Databricks Asset Bundles (DABs) or Terraform?
DABs is purpose-built for deploying Databricks resources (jobs, pipelines, notebooks). You define resources declaratively in databricks.yml and ship them per environment with databricks bundle deploy. Terraform, on the other hand, is a general-purpose IaC tool for managing entire cloud infrastructures (VPC, IAM, storage, and more), not just Databricks. If you only need to manage Databricks jobs and notebooks, DABs is simpler and has a lower learning curve. If you need end-to-end management that includes infrastructure, Terraform is the better fit. A common pattern is to use both: Terraform for infrastructure and DABs for workloads.
How do Databricks Repos (Git integration) and DABs relate?
Databricks Repos lets you clone Git repositories inside the workspace and sync notebooks and Python files at the branch level. DABs, in contrast, is the mechanism for deploying resources from a CI/CD pipeline via the databricks CLI. Repos shines for interactive development flows (branch switching, pull request reviews), while DABs is the right tool for automated deployment pipelines. The recommended pattern is to edit notebooks on a branch via Repos during development, then deploy to production with DABs after the branch is merged.
How do I run notebook tests in a CI/CD pipeline?
Three approaches work well. (1) Run the target notebook with dbutils.notebook.run() and validate the result via its return value or widget parameters. (2) Extract the notebook's logic into Python modules and run pytest on GitHub Actions (using a mocked SparkSession or Databricks Connect). (3) Use the Databricks Asset Bundles validation feature (databricks bundle validate) to check the YAML syntax, then run the job in a staging environment for an end-to-end test.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)
Complete guide to all 7 Databricks certifications — Data Eng...
Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)
Every Databricks certification ranked by difficulty, with st...
Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)
How to pass Databricks certifications efficiently. Official ...
Databricks Data Engineer Associate: Complete Guide (2026)
Domain-by-domain breakdown of the Databricks Certified Data ...
Databricks Data Engineer Professional: Complete Guide (2026)
Tactics for the Databricks Certified Data Engineer Professio...