Databricks

Medallion Architecture Explained: Bronze/Silver/Gold Design Pattern

2026-03-21
更新: 2026-03-27
NicheeLab Editorial Team

Medallion Architecture is the recommended data design pattern for the Databricks Lakehouse Platform. It organizes data into three layers — Bronze (raw), Silver (cleansed), and Gold (business-optimized) — progressively improving data quality, reliability, and usability. It is the most frequently tested architecture topic on the Data Engineer Associate exam.

The Big Picture

Raw data arriving from data sources (applications, IoT, external APIs, files) is first stored as-is in the Bronze layer, then progressively cleansed, joined, and aggregated to be promoted to Silver and Gold. Every layer is managed as a Delta Table.

Data SourcesApps / IoT / External APIs / FilesBronze LayerStore raw data as-is (Raw / Landing)Auto Loader / COPY INTOSilver LayerCleansing, joins, type conversion (Cleansed)MERGE / DLT / StreamingGold LayerAggregation, BI optimization, features (Curated)Batch aggregation / Materialized ViewsBI / ML / API / ReportingDashboards, training, API servingData Sources -> Bronze -> Silver -> Gold -> BI/ML/API

Bronze Layer (Raw Data Storage)

The role of the Bronze layer is to "store raw data arriving from data sources without transformation". Data quality is not enforced; missing values, duplicates, and type mismatches are all allowed. The goal is to keep a complete record of "what was actually received".

  • Store data in its original form (no transformation)
  • Add metadata (ingest timestamp, source file name, etc.)
  • Schema depends on the source (inferred or explicitly defined)
  • Ingest tools: Auto Loader, COPY INTO, Kafka Connector
  • Data quality: low (NULLs, duplicates, and type mismatches all allowed)
# Ingest into Bronze layer (Auto Loader)
(spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.schemaLocation", "/checkpoint/bronze/schema")
  .load("/data/landing/events/")
  .withColumn("_ingested_at", current_timestamp())
  .withColumn("_source_file", input_file_name())
  .writeStream
  .option("checkpointLocation", "/checkpoint/bronze/events")
  .trigger(availableNow=True)
  .toTable("bronze.raw_events")
)

Silver Layer (Cleansing and Joins)

The Silver layer takes the raw data from Bronze and refines it through "general-purpose cleansing that does not depend on any business domain". This is where you handle NULLs, deduplicate, convert types, and join tables (e.g., events + a user master).

  • Deduplication
  • Type conversion (STRING -> TIMESTAMP, DECIMAL conversion, etc.)
  • NULL handling (filter or default value)
  • Table joins (Fact + Dimension)
  • Data quality checks (DLT Expectations / CHECK constraints)
  • Tools: MERGE INTO, Delta Live Tables, Structured Streaming + foreachBatch
# Transform into Silver layer (MERGE)
MERGE INTO silver.customers t
USING (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn
  FROM bronze.raw_customers
  WHERE _ingested_at > current_timestamp() - INTERVAL 1 HOUR
  QUALIFY rn = 1
) s
ON t.customer_id = s.customer_id
WHEN MATCHED AND s.updated_at > t.updated_at THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

Gold Layer (Business Optimization)

The Gold layer takes Silver data and "optimizes it for a specific business use case". Summary tables for BI dashboards, feature tables for ML, and wide tables for reporting all live here.

  • Aggregate tables (daily/monthly sales summaries, etc.)
  • Wide tables (denormalized, pre-joined tables)
  • ML feature tables (for Feature Store)
  • KPI tables (directly powering dashboards)
-- Gold layer aggregate table
CREATE OR REPLACE TABLE gold.daily_sales_summary AS
SELECT
  date_trunc('day', order_date) AS sale_date,
  region,
  product_category,
  COUNT(*) AS order_count,
  SUM(amount) AS total_amount,
  AVG(amount) AS avg_amount
FROM silver.orders o
JOIN silver.customers c ON o.customer_id = c.customer_id
GROUP BY 1, 2, 3;

Layer-by-Layer Comparison

DimensionBronzeSilverGold
PurposeComplete record of raw dataGeneral-purpose cleansingBusiness optimization
Data qualityLow (as-is)Medium-High (quality-checked)High (business rules applied)
Primary consumersData engineersData engineers / AnalystsAnalysts / BI / ML
Update frequencyHigh (synced with source)Medium (scheduled batch or streaming)Low-Medium (often daily batch)
Ingest toolsAuto Loader / COPY INTOMERGE / DLT / StreamingBatch SQL / Materialized Views

Relationship with Delta Live Tables (DLT)

Delta Live Tables (DLT) is a Databricks feature for implementing Medallion Architecture declaratively. You define the tables for each layer with the @dlt.table decorator, and DLT automatically manages dependencies between layers. Data quality checks (Expectations) are built in, so quality assurance in the Silver layer becomes very concise.

import dlt
from pyspark.sql.functions import *

@dlt.table(comment="Bronze: raw events")
def bronze_events():
    return (spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.format", "json")
      .load("/data/landing/")
    )

@dlt.table(comment="Silver: cleaned events")
@dlt.expect_or_drop("valid_amount", "amount > 0")
def silver_events():
    return dlt.read_stream("bronze_events").filter("event_type IS NOT NULL")

What the Exam Asks About

  • The role and differences between Bronze / Silver / Gold layers
  • Expected data quality at each layer
  • Why Auto Loader is the recommended ingest tool for the Bronze layer
  • Primary transformations in the Silver layer (deduplication, type conversion, joins)
  • That the Gold layer is the one optimized for a specific use case
  • The relationship between DLT and Medallion

Check Your Understanding

Data Engineer Associate

問題 1

You are designing how to ingest e-commerce order data into the Lakehouse. Order events (JSON) land in S3 and ultimately need to feed a daily sales summary on a BI dashboard. Which is the correct data flow under Medallion Architecture?

  1. Ingest raw JSON into a Bronze Delta table with Auto Loader -> cleanse and join with the customer master in the Silver layer -> aggregate the daily sales summary in the Gold layer
  2. INSERT the JSON files directly into a summary table in the Gold layer
  3. Ingest only into the Silver layer and skip the Bronze and Gold layers
  4. Use Auto Loader to ingest JSON in the Gold layer and aggregate in the Silver layer

正解: A

Under Medallion Architecture, the correct flow is to first store raw data as-is in Bronze, cleanse and join in Silver, then optimize for the business (aggregate) in Gold. Skipping Bronze means you cannot recover the raw data, and using Auto Loader in the Gold layer inverts the role of the layers.

Frequently Asked Questions

Does Medallion Architecture always require three Bronze/Silver/Gold layers?

No. Three layers is the most common pattern, but you can collapse it to two (Raw + Curated) or extend it to four or more depending on your requirements. What matters is the design principle of separating the layer that stores raw data as-is from the layer that applies business logic. The Databricks official documentation also explains that Bronze/Silver/Gold is a recommended pattern, not a strict rule.

The line between Silver and Gold feels fuzzy. How do I tell them apart?

The Silver layer handles general-purpose cleansing and joins that are not tied to any specific business domain: NULL removal, type conversion, deduplication, and table joins. The Gold layer handles aggregations and transformations optimized for a specific business requirement or use case: BI summary tables, ML feature tables, wide tables for reporting. Remember Silver as general-purpose and Gold as use-case-specific.

How is Medallion Architecture tested on the exam?

It is the most frequent architecture question on Data Engineer Associate. You'll be tested on the role of each layer (Bronze = raw data storage, Silver = cleansing and joins, Gold = aggregation and BI optimization), the technologies used in each (Bronze = Auto Loader, Silver = MERGE/DLT, Gold = batch aggregation/MVs), and which layer enforces data quality (Bronze = low quality OK, Silver = quality-checked, Gold = business rules applied).

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Databricks

Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)

Complete guide to all 7 Databricks certifications — Data Eng...

Databricks

Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)

Every Databricks certification ranked by difficulty, with st...

Databricks

Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)

How to pass Databricks certifications efficiently. Official ...

Databricks

Databricks Data Engineer Associate: Complete Guide (2026)

Domain-by-domain breakdown of the Databricks Certified Data ...

Databricks

Databricks Data Engineer Professional: Complete Guide (2026)

Tactics for the Databricks Certified Data Engineer Professio...

Browse all Databricks articles (110)
© 2026 NicheeLab All rights reserved.