Databricks CI/CD Pipelines: Bundles, Repos, GitHub Actions (2026)

Running Databricks in production requires automated CI/CD deployment of notebooks, jobs, and pipelines. Since 2024, Databricks has positioned Databricks Asset Bundles (DABs) as the recommended deployment method, and the Data Engineer Professional exam also covers DABs configuration, GitHub Actions integration, and environment separation. This article walks through how to build a Databricks CI/CD pipeline with DABs at the center.

End-to-End CI/CD Overview

A Databricks CI/CD pipeline is typically structured around the following flow:

┌──────────┐    PR/Merge     ┌─────────────┐   databricks   ┌──────────────┐
│ Git Repo │ ─────────────► │ CI/CD Tool  │ ─── CLI ────► │  Databricks  │
│ (GitHub) │                │ (GH Actions)│   bundle       │  Workspace   │
│          │ ◄───────────── │             │   deploy       │              │
│ - src/   │   テスト結果    │ - validate  │               │ - dev        │
│ - tests/ │                │ - pytest    │               │ - staging    │
│ - bundle │                │ - deploy    │               │ - prod       │
└──────────┘                └─────────────┘               └──────────────┘

Development phase: edit and test notebooks on a branch using Databricks Repos.
CI phase: GitHub Actions runs automated tests (lint, pytest, bundle validate) on PR merge.
CD phase: once tests pass, databricks bundle deploy ships the changes to each environment.

Databricks Asset Bundles (DABs)

DABs lets you define Databricks resources (jobs, DLT pipelines, ML models, etc.) declaratively in a databricks.yml file and deploy them with a single CLI command. You define a target per environment and switch between dev, staging, and prod by overriding variables.

Structure of databricks.yml

bundle:
  name: sales-etl-pipeline

workspace:
  host: https://adb-xxxx.azuredatabricks.net

variables:
  catalog:
    default: dev_catalog
  warehouse_id:
    default: abc123def456

resources:
  jobs:
    daily_etl:
      name: "daily-sales-etl-${bundle.target}"
      schedule:
        quartz_cron_expression: "0 0 6 * * ?"
        timezone_id: "Asia/Tokyo"
      tasks:
        - task_key: ingest
          notebook_task:
            notebook_path: ./src/01_ingest.py
          existing_cluster_id: ${var.cluster_id}
        - task_key: transform
          depends_on:
            - task_key: ingest
          notebook_task:
            notebook_path: ./src/02_transform.py
          existing_cluster_id: ${var.cluster_id}
        - task_key: publish
          depends_on:
            - task_key: transform
          notebook_task:
            notebook_path: ./src/03_publish.py
          existing_cluster_id: ${var.cluster_id}

  pipelines:
    dlt_pipeline:
      name: "sales-dlt-${bundle.target}"
      target: "${var.catalog}.sales"
      libraries:
        - notebook:
            path: ./src/dlt_definitions.py

targets:
  dev:
    mode: development
    default: true
    variables:
      catalog: dev_catalog
      cluster_id: "0123-dev-cluster"

  staging:
    mode: development
    variables:
      catalog: staging_catalog
      cluster_id: "0456-staging-cluster"

  prod:
    mode: production
    variables:
      catalog: prod_catalog
      cluster_id: "0789-prod-cluster"
    run_as:
      service_principal_name: "sp-production-deployer"

Key DABs commands

Command	Purpose
`databricks bundle init`	Initialize a project from a template
`databricks bundle validate`	Validate databricks.yml syntax and references
`databricks bundle deploy -t dev`	Deploy resources to the specified target
`databricks bundle run -t dev daily_etl`	Trigger a deployed job immediately
`databricks bundle destroy -t dev`	Delete the deployed resources

GitHub Actions Integration

Here is an example GitHub Actions workflow that drives a DABs-based CI/CD pipeline:

# .github/workflows/databricks-cicd.yml
name: Databricks CI/CD

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
  DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}

jobs:
  validate-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: |
          pip install databricks-cli pytest databricks-connect==15.4.*
          pip install -r requirements.txt

      - name: Validate bundle
        run: databricks bundle validate -t staging

      - name: Run unit tests
        run: pytest tests/unit/ -v --junitxml=test-results.xml

      - name: Run integration tests (staging)
        if: github.event_name == 'pull_request'
        run: |
          databricks bundle deploy -t staging
          databricks bundle run -t staging integration_test_job
          databricks bundle destroy -t staging

  deploy-production:
    needs: validate-and-test
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4

      - name: Install Databricks CLI
        run: pip install databricks-cli

      - name: Deploy to production
        run: databricks bundle deploy -t prod

Test Automation

Unit tests (pytest)

# tests/unit/test_transform.py
from pyspark.sql import SparkSession
from src.transforms import clean_sales_data

def test_clean_sales_data(spark: SparkSession):
    """null値を含む行が除外されることを検証"""
    input_df = spark.createDataFrame([
        (1, "ProductA", 100.0),
        (2, None, 200.0),
        (3, "ProductC", None),
    ], ["id", "product_name", "amount"])

    result = clean_sales_data(input_df)

    assert result.count() == 1
    assert result.first()["product_name"] == "ProductA"


# conftest.py
import pytest
from pyspark.sql import SparkSession

@pytest.fixture(scope="session")
def spark():
    return SparkSession.builder \
        .master("local[2]") \
        .appName("unit-tests") \
        .getOrCreate()

Notebook tests (dbutils.notebook.run)

# テストランナーノートブック
test_notebooks = [
    "/Repos/main/project/tests/test_ingest",
    "/Repos/main/project/tests/test_transform",
]

results = []
for nb in test_notebooks:
    try:
        result = dbutils.notebook.run(nb, timeout_seconds=600)
        results.append({"notebook": nb, "status": "PASS", "result": result})
    except Exception as e:
        results.append({"notebook": nb, "status": "FAIL", "error": str(e)})

failed = [r for r in results if r["status"] == "FAIL"]
if failed:
    raise Exception(f"{len(failed)} test(s) failed: {failed}")

Environment Separation (dev / staging / prod)

Item	dev	staging	prod
Catalog	dev_catalog	staging_catalog	prod_catalog
Deployment trigger	Manual / branch push	Auto on PR merge	Auto after merging to main
DABs mode	development	development	production
Run-as identity	Individual developer	Service principal	Service principal
Data	Sample data	Subset of production data	Production data
Schedule	Manual runs only	Test schedule	Production schedule

Setting mode: production in DABs drops the resource-name prefix and runs jobs as the service principal specified in run_as. With mode: development, resources get a [dev your-username] prefix and schedules are disabled.

Databricks Repos (Git Integration)

Databricks Repos lets you clone Git repositories, switch branches, pull, and commit — all from within the workspace. Developers edit notebooks in Repos, push changes to Git, and open pull requests from there.

Supported Git providers: GitHub, Azure DevOps, GitLab, Bitbucket
Supported file types: native support for .py, .sql, .r, .scala, and .ipynb
Limits: 10 GB repository size limit, with a cap on the number of file changes per commit

Terraform vs DABs Comparison

Item	Databricks Asset Bundles	Terraform (databricks provider)
What it manages	Jobs, DLT pipelines, notebooks	Workspaces, clusters, IAM, jobs — everything
Definition language	YAML (databricks.yml)	HCL (.tf)
State management	Managed by the workspace	terraform.tfstate (S3, GCS, etc.)
Environment switching	Defined in the targets section	Workspaces / var files
Learning curve	Low (YAML + Databricks CLI)	High (HCL + Terraform concepts)
Infrastructure management	Not supported (Databricks resources only)	Supported (VPC, IAM, storage, etc.)
Recommended scenario	Data-engineer-led workload management	Platform-team-led infrastructure + workload management

Sample Question

CI/CD - Databricks Asset Bundles

問題 1

A data engineering team is building a CI/CD pipeline with Databricks Asset Bundles. For the production (prod) deployment, they want jobs to run as a service principal and they want resource names to have no prefix. Which databricks.yml configuration is correct?

Set mode: development in the targets section and specify the service principal in run_as
Set mode: production in the targets section and specify the service principal in run_as
Set mode: production in the targets section and grant the service principal CAN_MANAGE via permissions
Set mode: production in the workspace section and specify the service principal in run_as under the bundle section

正解: B

Setting mode: production in the DABs targets section removes the resource-name prefix and enables job schedules. Specifying run_as additionally causes jobs to execute under that service principal's permissions. mode: development prefixes resource names with the user name and disables schedules, so it is not appropriate for production. mode is specified inside the targets section, not the workspace section.

Frequently Asked Questions

Should I use Databricks Asset Bundles (DABs) or Terraform?

DABs is purpose-built for deploying Databricks resources (jobs, pipelines, notebooks). You define resources declaratively in databricks.yml and ship them per environment with databricks bundle deploy. Terraform, on the other hand, is a general-purpose IaC tool for managing entire cloud infrastructures (VPC, IAM, storage, and more), not just Databricks. If you only need to manage Databricks jobs and notebooks, DABs is simpler and has a lower learning curve. If you need end-to-end management that includes infrastructure, Terraform is the better fit. A common pattern is to use both: Terraform for infrastructure and DABs for workloads.

How do Databricks Repos (Git integration) and DABs relate?

Databricks Repos lets you clone Git repositories inside the workspace and sync notebooks and Python files at the branch level. DABs, in contrast, is the mechanism for deploying resources from a CI/CD pipeline via the databricks CLI. Repos shines for interactive development flows (branch switching, pull request reviews), while DABs is the right tool for automated deployment pipelines. The recommended pattern is to edit notebooks on a branch via Repos during development, then deploy to production with DABs after the branch is merged.

How do I run notebook tests in a CI/CD pipeline?

Three approaches work well. (1) Run the target notebook with dbutils.notebook.run() and validate the result via its return value or widget parameters. (2) Extract the notebook's logic into Python modules and run pytest on GitHub Actions (using a mocked SparkSession or Databricks Connect). (3) Use the Databricks Asset Bundles validation feature (databricks bundle validate) to check the YAML syntax, then run the job in a staging environment for an end-to-end test.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

Databricks CI/CD Pipeline Guide: DABs + GitHub Actions