Medallion Architecture is the recommended data design pattern for the Databricks Lakehouse Platform. It organizes data into three layers — Bronze (raw), Silver (cleansed), and Gold (business-optimized) — progressively improving data quality, reliability, and usability. It is the most frequently tested architecture topic on the Data Engineer Associate exam.
Raw data arriving from data sources (applications, IoT, external APIs, files) is first stored as-is in the Bronze layer, then progressively cleansed, joined, and aggregated to be promoted to Silver and Gold. Every layer is managed as a Delta Table.
The role of the Bronze layer is to "store raw data arriving from data sources without transformation". Data quality is not enforced; missing values, duplicates, and type mismatches are all allowed. The goal is to keep a complete record of "what was actually received".
# Ingest into Bronze layer (Auto Loader)
(spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "/checkpoint/bronze/schema")
.load("/data/landing/events/")
.withColumn("_ingested_at", current_timestamp())
.withColumn("_source_file", input_file_name())
.writeStream
.option("checkpointLocation", "/checkpoint/bronze/events")
.trigger(availableNow=True)
.toTable("bronze.raw_events")
)The Silver layer takes the raw data from Bronze and refines it through "general-purpose cleansing that does not depend on any business domain". This is where you handle NULLs, deduplicate, convert types, and join tables (e.g., events + a user master).
# Transform into Silver layer (MERGE)
MERGE INTO silver.customers t
USING (
SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn
FROM bronze.raw_customers
WHERE _ingested_at > current_timestamp() - INTERVAL 1 HOUR
QUALIFY rn = 1
) s
ON t.customer_id = s.customer_id
WHEN MATCHED AND s.updated_at > t.updated_at THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;The Gold layer takes Silver data and "optimizes it for a specific business use case". Summary tables for BI dashboards, feature tables for ML, and wide tables for reporting all live here.
-- Gold layer aggregate table
CREATE OR REPLACE TABLE gold.daily_sales_summary AS
SELECT
date_trunc('day', order_date) AS sale_date,
region,
product_category,
COUNT(*) AS order_count,
SUM(amount) AS total_amount,
AVG(amount) AS avg_amount
FROM silver.orders o
JOIN silver.customers c ON o.customer_id = c.customer_id
GROUP BY 1, 2, 3;| Dimension | Bronze | Silver | Gold |
|---|---|---|---|
| Purpose | Complete record of raw data | General-purpose cleansing | Business optimization |
| Data quality | Low (as-is) | Medium-High (quality-checked) | High (business rules applied) |
| Primary consumers | Data engineers | Data engineers / Analysts | Analysts / BI / ML |
| Update frequency | High (synced with source) | Medium (scheduled batch or streaming) | Low-Medium (often daily batch) |
| Ingest tools | Auto Loader / COPY INTO | MERGE / DLT / Streaming | Batch SQL / Materialized Views |
Delta Live Tables (DLT) is a Databricks feature for implementing Medallion Architecture declaratively. You define the tables for each layer with the @dlt.table decorator, and DLT automatically manages dependencies between layers. Data quality checks (Expectations) are built in, so quality assurance in the Silver layer becomes very concise.
import dlt
from pyspark.sql.functions import *
@dlt.table(comment="Bronze: raw events")
def bronze_events():
return (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.load("/data/landing/")
)
@dlt.table(comment="Silver: cleaned events")
@dlt.expect_or_drop("valid_amount", "amount > 0")
def silver_events():
return dlt.read_stream("bronze_events").filter("event_type IS NOT NULL")Data Engineer Associate
問題 1
You are designing how to ingest e-commerce order data into the Lakehouse. Order events (JSON) land in S3 and ultimately need to feed a daily sales summary on a BI dashboard. Which is the correct data flow under Medallion Architecture?
正解: A
Under Medallion Architecture, the correct flow is to first store raw data as-is in Bronze, cleanse and join in Silver, then optimize for the business (aggregate) in Gold. Skipping Bronze means you cannot recover the raw data, and using Auto Loader in the Gold layer inverts the role of the layers.
Does Medallion Architecture always require three Bronze/Silver/Gold layers?
No. Three layers is the most common pattern, but you can collapse it to two (Raw + Curated) or extend it to four or more depending on your requirements. What matters is the design principle of separating the layer that stores raw data as-is from the layer that applies business logic. The Databricks official documentation also explains that Bronze/Silver/Gold is a recommended pattern, not a strict rule.
The line between Silver and Gold feels fuzzy. How do I tell them apart?
The Silver layer handles general-purpose cleansing and joins that are not tied to any specific business domain: NULL removal, type conversion, deduplication, and table joins. The Gold layer handles aggregations and transformations optimized for a specific business requirement or use case: BI summary tables, ML feature tables, wide tables for reporting. Remember Silver as general-purpose and Gold as use-case-specific.
How is Medallion Architecture tested on the exam?
It is the most frequent architecture question on Data Engineer Associate. You'll be tested on the role of each layer (Bronze = raw data storage, Silver = cleansing and joins, Gold = aggregation and BI optimization), the technologies used in each (Bronze = Auto Loader, Silver = MERGE/DLT, Gold = batch aggregation/MVs), and which layer enforces data quality (Bronze = low quality OK, Silver = quality-checked, Gold = business rules applied).
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)
Complete guide to all 7 Databricks certifications — Data Eng...
Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)
Every Databricks certification ranked by difficulty, with st...
Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)
How to pass Databricks certifications efficiently. Official ...
Databricks Data Engineer Associate: Complete Guide (2026)
Domain-by-domain breakdown of the Databricks Certified Data ...
Databricks Data Engineer Professional: Complete Guide (2026)
Tactics for the Databricks Certified Data Engineer Professio...