Databricks

Databricks CI/CD Pipeline Guide: DABs + GitHub Actions

2026-03-21
更新: 2026-03-27
NicheeLab Editorial Team

Running Databricks in production requires automated CI/CD deployment of notebooks, jobs, and pipelines. Since 2024, Databricks has positioned Databricks Asset Bundles (DABs) as the recommended deployment method, and the Data Engineer Professional exam also covers DABs configuration, GitHub Actions integration, and environment separation. This article walks through how to build a Databricks CI/CD pipeline with DABs at the center.

End-to-End CI/CD Overview

A Databricks CI/CD pipeline is typically structured around the following flow:

┌──────────┐    PR/Merge     ┌─────────────┐   databricks   ┌──────────────┐
│ Git Repo │ ─────────────► │ CI/CD Tool  │ ─── CLI ────► │  Databricks  │
│ (GitHub) │                │ (GH Actions)│   bundle       │  Workspace   │
│          │ ◄───────────── │             │   deploy       │              │
│ - src/   │   テスト結果    │ - validate  │               │ - dev        │
│ - tests/ │                │ - pytest    │               │ - staging    │
│ - bundle │                │ - deploy    │               │ - prod       │
└──────────┘                └─────────────┘               └──────────────┘
  • Development phase: edit and test notebooks on a branch using Databricks Repos.
  • CI phase: GitHub Actions runs automated tests (lint, pytest, bundle validate) on PR merge.
  • CD phase: once tests pass, databricks bundle deploy ships the changes to each environment.

Databricks Asset Bundles (DABs)

DABs lets you define Databricks resources (jobs, DLT pipelines, ML models, etc.) declaratively in a databricks.yml file and deploy them with a single CLI command. You define a target per environment and switch between dev, staging, and prod by overriding variables.

Structure of databricks.yml

bundle:
  name: sales-etl-pipeline

workspace:
  host: https://adb-xxxx.azuredatabricks.net

variables:
  catalog:
    default: dev_catalog
  warehouse_id:
    default: abc123def456

resources:
  jobs:
    daily_etl:
      name: "daily-sales-etl-${bundle.target}"
      schedule:
        quartz_cron_expression: "0 0 6 * * ?"
        timezone_id: "Asia/Tokyo"
      tasks:
        - task_key: ingest
          notebook_task:
            notebook_path: ./src/01_ingest.py
          existing_cluster_id: ${var.cluster_id}
        - task_key: transform
          depends_on:
            - task_key: ingest
          notebook_task:
            notebook_path: ./src/02_transform.py
          existing_cluster_id: ${var.cluster_id}
        - task_key: publish
          depends_on:
            - task_key: transform
          notebook_task:
            notebook_path: ./src/03_publish.py
          existing_cluster_id: ${var.cluster_id}

  pipelines:
    dlt_pipeline:
      name: "sales-dlt-${bundle.target}"
      target: "${var.catalog}.sales"
      libraries:
        - notebook:
            path: ./src/dlt_definitions.py

targets:
  dev:
    mode: development
    default: true
    variables:
      catalog: dev_catalog
      cluster_id: "0123-dev-cluster"

  staging:
    mode: development
    variables:
      catalog: staging_catalog
      cluster_id: "0456-staging-cluster"

  prod:
    mode: production
    variables:
      catalog: prod_catalog
      cluster_id: "0789-prod-cluster"
    run_as:
      service_principal_name: "sp-production-deployer"

Key DABs commands

CommandPurpose
databricks bundle initInitialize a project from a template
databricks bundle validateValidate databricks.yml syntax and references
databricks bundle deploy -t devDeploy resources to the specified target
databricks bundle run -t dev daily_etlTrigger a deployed job immediately
databricks bundle destroy -t devDelete the deployed resources

GitHub Actions Integration

Here is an example GitHub Actions workflow that drives a DABs-based CI/CD pipeline:

# .github/workflows/databricks-cicd.yml
name: Databricks CI/CD

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
  DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}

jobs:
  validate-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: |
          pip install databricks-cli pytest databricks-connect==15.4.*
          pip install -r requirements.txt

      - name: Validate bundle
        run: databricks bundle validate -t staging

      - name: Run unit tests
        run: pytest tests/unit/ -v --junitxml=test-results.xml

      - name: Run integration tests (staging)
        if: github.event_name == 'pull_request'
        run: |
          databricks bundle deploy -t staging
          databricks bundle run -t staging integration_test_job
          databricks bundle destroy -t staging

  deploy-production:
    needs: validate-and-test
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4

      - name: Install Databricks CLI
        run: pip install databricks-cli

      - name: Deploy to production
        run: databricks bundle deploy -t prod

Test Automation

Unit tests (pytest)

# tests/unit/test_transform.py
from pyspark.sql import SparkSession
from src.transforms import clean_sales_data

def test_clean_sales_data(spark: SparkSession):
    """null値を含む行が除外されることを検証"""
    input_df = spark.createDataFrame([
        (1, "ProductA", 100.0),
        (2, None, 200.0),
        (3, "ProductC", None),
    ], ["id", "product_name", "amount"])

    result = clean_sales_data(input_df)

    assert result.count() == 1
    assert result.first()["product_name"] == "ProductA"


# conftest.py
import pytest
from pyspark.sql import SparkSession

@pytest.fixture(scope="session")
def spark():
    return SparkSession.builder \
        .master("local[2]") \
        .appName("unit-tests") \
        .getOrCreate()

Notebook tests (dbutils.notebook.run)

# テストランナーノートブック
test_notebooks = [
    "/Repos/main/project/tests/test_ingest",
    "/Repos/main/project/tests/test_transform",
]

results = []
for nb in test_notebooks:
    try:
        result = dbutils.notebook.run(nb, timeout_seconds=600)
        results.append({"notebook": nb, "status": "PASS", "result": result})
    except Exception as e:
        results.append({"notebook": nb, "status": "FAIL", "error": str(e)})

failed = [r for r in results if r["status"] == "FAIL"]
if failed:
    raise Exception(f"{len(failed)} test(s) failed: {failed}")

Environment Separation (dev / staging / prod)

Itemdevstagingprod
Catalogdev_catalogstaging_catalogprod_catalog
Deployment triggerManual / branch pushAuto on PR mergeAuto after merging to main
DABs modedevelopmentdevelopmentproduction
Run-as identityIndividual developerService principalService principal
DataSample dataSubset of production dataProduction data
ScheduleManual runs onlyTest scheduleProduction schedule

Setting mode: production in DABs drops the resource-name prefix and runs jobs as the service principal specified in run_as. With mode: development, resources get a [dev your-username] prefix and schedules are disabled.

Databricks Repos (Git Integration)

Databricks Repos lets you clone Git repositories, switch branches, pull, and commit — all from within the workspace. Developers edit notebooks in Repos, push changes to Git, and open pull requests from there.

  • Supported Git providers: GitHub, Azure DevOps, GitLab, Bitbucket
  • Supported file types: native support for .py, .sql, .r, .scala, and .ipynb
  • Limits: 10 GB repository size limit, with a cap on the number of file changes per commit

Terraform vs DABs Comparison

ItemDatabricks Asset BundlesTerraform (databricks provider)
What it managesJobs, DLT pipelines, notebooksWorkspaces, clusters, IAM, jobs — everything
Definition languageYAML (databricks.yml)HCL (.tf)
State managementManaged by the workspaceterraform.tfstate (S3, GCS, etc.)
Environment switchingDefined in the targets sectionWorkspaces / var files
Learning curveLow (YAML + Databricks CLI)High (HCL + Terraform concepts)
Infrastructure managementNot supported (Databricks resources only)Supported (VPC, IAM, storage, etc.)
Recommended scenarioData-engineer-led workload managementPlatform-team-led infrastructure + workload management

Sample Question

CI/CD - Databricks Asset Bundles

問題 1

A data engineering team is building a CI/CD pipeline with Databricks Asset Bundles. For the production (prod) deployment, they want jobs to run as a service principal and they want resource names to have no prefix. Which databricks.yml configuration is correct?

  1. Set mode: development in the targets section and specify the service principal in run_as
  2. Set mode: production in the targets section and specify the service principal in run_as
  3. Set mode: production in the targets section and grant the service principal CAN_MANAGE via permissions
  4. Set mode: production in the workspace section and specify the service principal in run_as under the bundle section

正解: B

Setting mode: production in the DABs targets section removes the resource-name prefix and enables job schedules. Specifying run_as additionally causes jobs to execute under that service principal's permissions. mode: development prefixes resource names with the user name and disables schedules, so it is not appropriate for production. mode is specified inside the targets section, not the workspace section.

Frequently Asked Questions

Should I use Databricks Asset Bundles (DABs) or Terraform?

DABs is purpose-built for deploying Databricks resources (jobs, pipelines, notebooks). You define resources declaratively in databricks.yml and ship them per environment with databricks bundle deploy. Terraform, on the other hand, is a general-purpose IaC tool for managing entire cloud infrastructures (VPC, IAM, storage, and more), not just Databricks. If you only need to manage Databricks jobs and notebooks, DABs is simpler and has a lower learning curve. If you need end-to-end management that includes infrastructure, Terraform is the better fit. A common pattern is to use both: Terraform for infrastructure and DABs for workloads.

How do Databricks Repos (Git integration) and DABs relate?

Databricks Repos lets you clone Git repositories inside the workspace and sync notebooks and Python files at the branch level. DABs, in contrast, is the mechanism for deploying resources from a CI/CD pipeline via the databricks CLI. Repos shines for interactive development flows (branch switching, pull request reviews), while DABs is the right tool for automated deployment pipelines. The recommended pattern is to edit notebooks on a branch via Repos during development, then deploy to production with DABs after the branch is merged.

How do I run notebook tests in a CI/CD pipeline?

Three approaches work well. (1) Run the target notebook with dbutils.notebook.run() and validate the result via its return value or widget parameters. (2) Extract the notebook's logic into Python modules and run pytest on GitHub Actions (using a mocked SparkSession or Databricks Connect). (3) Use the Databricks Asset Bundles validation feature (databricks bundle validate) to check the YAML syntax, then run the job in a staging environment for an end-to-end test.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Databricks

Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)

Complete guide to all 7 Databricks certifications — Data Eng...

Databricks

Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)

Every Databricks certification ranked by difficulty, with st...

Databricks

Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)

How to pass Databricks certifications efficiently. Official ...

Databricks

Databricks Data Engineer Associate: Complete Guide (2026)

Domain-by-domain breakdown of the Databricks Certified Data ...

Databricks

Databricks Data Engineer Professional: Complete Guide (2026)

Tactics for the Databricks Certified Data Engineer Professio...

Browse all Databricks articles (110)
© 2026 NicheeLab All rights reserved.