Medallion Architecture: Bronze, Silver, Gold Layers (2026)

Medallion Architecture is the recommended data design pattern for the Databricks Lakehouse Platform. It organizes data into three layers — Bronze (raw), Silver (cleansed), and Gold (business-optimized) — progressively improving data quality, reliability, and usability. It is the most frequently tested architecture topic on the Data Engineer Associate exam.

The Big Picture

Raw data arriving from data sources (applications, IoT, external APIs, files) is first stored as-is in the Bronze layer, then progressively cleansed, joined, and aggregated to be promoted to Silver and Gold. Every layer is managed as a Delta Table.

Bronze Layer (Raw Data Storage)

The role of the Bronze layer is to "store raw data arriving from data sources without transformation". Data quality is not enforced; missing values, duplicates, and type mismatches are all allowed. The goal is to keep a complete record of "what was actually received".

Store data in its original form (no transformation)
Add metadata (ingest timestamp, source file name, etc.)
Schema depends on the source (inferred or explicitly defined)
Ingest tools: Auto Loader, COPY INTO, Kafka Connector
Data quality: low (NULLs, duplicates, and type mismatches all allowed)

# Ingest into Bronze layer (Auto Loader)
(spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.schemaLocation", "/checkpoint/bronze/schema")
  .load("/data/landing/events/")
  .withColumn("_ingested_at", current_timestamp())
  .withColumn("_source_file", input_file_name())
  .writeStream
  .option("checkpointLocation", "/checkpoint/bronze/events")
  .trigger(availableNow=True)
  .toTable("bronze.raw_events")
)

Silver Layer (Cleansing and Joins)

The Silver layer takes the raw data from Bronze and refines it through "general-purpose cleansing that does not depend on any business domain". This is where you handle NULLs, deduplicate, convert types, and join tables (e.g., events + a user master).

Deduplication
Type conversion (STRING -> TIMESTAMP, DECIMAL conversion, etc.)
NULL handling (filter or default value)
Table joins (Fact + Dimension)
Data quality checks (DLT Expectations / CHECK constraints)
Tools: MERGE INTO, Delta Live Tables, Structured Streaming + foreachBatch

# Transform into Silver layer (MERGE)
MERGE INTO silver.customers t
USING (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn
  FROM bronze.raw_customers
  WHERE _ingested_at > current_timestamp() - INTERVAL 1 HOUR
  QUALIFY rn = 1
) s
ON t.customer_id = s.customer_id
WHEN MATCHED AND s.updated_at > t.updated_at THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

Gold Layer (Business Optimization)

The Gold layer takes Silver data and "optimizes it for a specific business use case". Summary tables for BI dashboards, feature tables for ML, and wide tables for reporting all live here.

Aggregate tables (daily/monthly sales summaries, etc.)
Wide tables (denormalized, pre-joined tables)
ML feature tables (for Feature Store)
KPI tables (directly powering dashboards)

-- Gold layer aggregate table
CREATE OR REPLACE TABLE gold.daily_sales_summary AS
SELECT
  date_trunc('day', order_date) AS sale_date,
  region,
  product_category,
  COUNT(*) AS order_count,
  SUM(amount) AS total_amount,
  AVG(amount) AS avg_amount
FROM silver.orders o
JOIN silver.customers c ON o.customer_id = c.customer_id
GROUP BY 1, 2, 3;

Layer-by-Layer Comparison

Dimension	Bronze	Silver	Gold
Purpose	Complete record of raw data	General-purpose cleansing	Business optimization
Data quality	Low (as-is)	Medium-High (quality-checked)	High (business rules applied)
Primary consumers	Data engineers	Data engineers / Analysts	Analysts / BI / ML
Update frequency	High (synced with source)	Medium (scheduled batch or streaming)	Low-Medium (often daily batch)
Ingest tools	Auto Loader / COPY INTO	MERGE / DLT / Streaming	Batch SQL / Materialized Views

Relationship with Delta Live Tables (DLT)

Delta Live Tables (DLT) is a Databricks feature for implementing Medallion Architecture declaratively. You define the tables for each layer with the @dlt.table decorator, and DLT automatically manages dependencies between layers. Data quality checks (Expectations) are built in, so quality assurance in the Silver layer becomes very concise.

import dlt
from pyspark.sql.functions import *

@dlt.table(comment="Bronze: raw events")
def bronze_events():
    return (spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.format", "json")
      .load("/data/landing/")
    )

@dlt.table(comment="Silver: cleaned events")
@dlt.expect_or_drop("valid_amount", "amount > 0")
def silver_events():
    return dlt.read_stream("bronze_events").filter("event_type IS NOT NULL")

What the Exam Asks About

The role and differences between Bronze / Silver / Gold layers
Expected data quality at each layer
Why Auto Loader is the recommended ingest tool for the Bronze layer
Primary transformations in the Silver layer (deduplication, type conversion, joins)
That the Gold layer is the one optimized for a specific use case
The relationship between DLT and Medallion

Check Your Understanding

Data Engineer Associate

問題 1

You are designing how to ingest e-commerce order data into the Lakehouse. Order events (JSON) land in S3 and ultimately need to feed a daily sales summary on a BI dashboard. Which is the correct data flow under Medallion Architecture?

Ingest raw JSON into a Bronze Delta table with Auto Loader -> cleanse and join with the customer master in the Silver layer -> aggregate the daily sales summary in the Gold layer
INSERT the JSON files directly into a summary table in the Gold layer
Ingest only into the Silver layer and skip the Bronze and Gold layers
Use Auto Loader to ingest JSON in the Gold layer and aggregate in the Silver layer

正解: A

Under Medallion Architecture, the correct flow is to first store raw data as-is in Bronze, cleanse and join in Silver, then optimize for the business (aggregate) in Gold. Skipping Bronze means you cannot recover the raw data, and using Auto Loader in the Gold layer inverts the role of the layers.

Frequently Asked Questions

Does Medallion Architecture always require three Bronze/Silver/Gold layers?

No. Three layers is the most common pattern, but you can collapse it to two (Raw + Curated) or extend it to four or more depending on your requirements. What matters is the design principle of separating the layer that stores raw data as-is from the layer that applies business logic. The Databricks official documentation also explains that Bronze/Silver/Gold is a recommended pattern, not a strict rule.

The line between Silver and Gold feels fuzzy. How do I tell them apart?

The Silver layer handles general-purpose cleansing and joins that are not tied to any specific business domain: NULL removal, type conversion, deduplication, and table joins. The Gold layer handles aggregations and transformations optimized for a specific business requirement or use case: BI summary tables, ML feature tables, wide tables for reporting. Remember Silver as general-purpose and Gold as use-case-specific.

How is Medallion Architecture tested on the exam?

It is the most frequent architecture question on Data Engineer Associate. You'll be tested on the role of each layer (Bronze = raw data storage, Silver = cleansing and joins, Gold = aggregation and BI optimization), the technologies used in each (Bronze = Auto Loader, Silver = MERGE/DLT, Gold = batch aggregation/MVs), and which layer enforces data quality (Bronze = low quality OK, Silver = quality-checked, Gold = business rules applied).

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

Medallion Architecture Explained: Bronze/Silver/Gold Design Pattern