When referencing raw data in dbt, the standard practice is to use source() instead of writing the table name directly. This is not just a stylistic difference — it affects environment portability, documentation, lineage, tests, and freshness management.
This article walks through the basics of dbt's Sources definition and source() function, the benefits of adopting them, best practices for defining sources, tests and freshness checks, operational patterns, and the points most likely to appear on the certification exam.
source() is a Jinja function that references external tables (such as raw data) that are not managed by dbt. It is called with two arguments — source('source_name', 'table_name') — both of which must match the names declared in the YAML sources definition (use identifier to map a logical name to a different physical table name).
ref() expresses dependencies between models that dbt builds, while source() expresses dependencies on external data. Both feed dbt's lineage graph and are resolved to the correct database, schema, and table name at compile time.
The big picture of lineage (Sources → Staging → Marts)
Defining Sources and referencing them through source() gives you environment-independent resolution, automatic lineage and documentation, source-level tests and freshness checks, and an identifier that absorbs differences in permissions and naming — all at once. These are the evaluation points that come up most often in real work and on the exam.
Freshness in particular is unique to Sources: you can monitor the most recent ingestion timestamp and detect lag early. Defining Sources properly as the starting point of observability is the most cost-effective choice over the medium and long term.
| Approach | Main purpose | Portability | Lineage / Documentation |
|---|---|---|---|
| source() | Reference external tables | High (resolved via YAML) | Included (as Sources) |
| ref() | Reference dbt models | High (resolved by adapter) | Included (as models) |
| Hard-coded schema | One-off references | Low (environment-bound) | Not included |
Sources are declared in schema.yml (version: 2). source: name is the logical name and tables[].name is the logical name used as the second argument of source(). When the physical table name differs, specify identifier. database and schema are often overridden per target through env/vars strategies.
As a best practice, separate sources per landing zone (such as raw), and document the description, owner, and read-permission scope in a docs block or meta. When ingestion lag from external systems is expected, narrow the freshness filter to the current day to avoid false positives.
| Setting | Role | Notes |
|---|---|---|
| name / tables.name | Logical name used in source() / docs | Pair with identifier when it differs from the physical name |
| identifier | Specifies the physical table name | Absorbs case and symbol differences |
| freshness | Freshness thresholds | Narrow the time range with filter to reduce execution cost |
Example Sources definition in YAML and model reference
# schema.yml (version: 2)
version: 2
sources:
- name: raw # logical name (first arg to source())
database: {{ target.database }} # resolved per environment
schema: raw
description: Raw landing zone (externally managed)
freshness:
warn_after: {count: 2, period: hour}
error_after: {count: 6, period: hour}
filter: "_ingested_at >= dateadd('day', -1, current_timestamp())"
tables:
- name: orders_raw # logical name (second arg to source())
identifier: ext_orders # physical table name
loaded_at_field: _ingested_at
columns:
- name: order_id
tests:
- not_null
-- models/stg_orders.sql
with src as (
select *
from {{ source('raw', 'orders_raw') }}
)
select
cast(order_id as string) as order_id,
customer_id,
order_ts,
_ingested_at
from src;Generic tests on source columns such as not_null and unique are declared in YAML the same way as model column tests. Running dbt test validates source columns as well, surfacing anomalies early.
Freshness is evaluated with the dbt source freshness command. It measures the gap between loaded_at_field and the current time and changes status when warn_after / error_after is exceeded. Tune it to match the data — for example, use filter to exclude derived views or long historical ranges and target only the current day's data.
| Setting / Command | Function | Notes |
|---|---|---|
| columns[].tests | Quality tests for source columns | not_null, unique, and so on |
| loaded_at_field | Reference column for freshness | Specify the column that holds the ingestion timestamp |
| dbt source freshness | Run freshness evaluation | Evaluates warn_after / error_after |
The foundation of stable operations is a three-layer pattern: "declare external data with Sources," "shape it in the stg layer," and "connect downstream layers with ref()." Resolve differences between environments and adapters (Snowflake / Databricks / BigQuery, etc.) within Sources and profiles.yml, and keep models abstracted behind source()/ref().
Avoid hard-coded schemas and fragile dependencies that assume physical table names won't change. With identifier, naming differences are absorbed and downstream impact is minimized.
| Pattern | Strengths | Risks / Notes |
|---|---|---|
| source() + stg shaping | High observability and maintainability | Upfront definition cost pays off long term |
| Hard-coded references | Quick short-term experiments | Breaks during environment migration, audits, or changes |
| Use of identifier | Localizes the impact of naming changes | Watch out for conflating logical and physical names |
The exam frequently tests when to use source() versus ref() and the YAML definition of Sources (the role of name vs. identifier, loaded_at_field, freshness). Questions on environment portability and lineage often mark hard-coded schemas as incorrect.
For code-reading questions, expect to be asked whether source()'s two arguments refer to YAML names, what warn_after / error_after of freshness mean, and the scope of the filter.
| Keyword | What it tests | What to remember |
|---|---|---|
| source() | Representing external references | The two arguments (source name, table name) are YAML names |
| freshness | Freshness monitoring | warn_after / error_after / filter / loaded_at_field |
| identifier | Absorbs physical-name differences | Bridges logical and physical names |
Analytics Engineer
問題 1
You want to reference billing_transactions in the raw dataset on BigQuery from a dbt model. The table is not managed by dbt. Which is the correct reference?
正解: A
External tables should be defined as Sources and referenced via source() — that's the correct answer. This enables environment portability, lineage, tests, and freshness. ref() is for dbt-managed models. Hard-coded schemas or variable concatenation forfeit the lineage and freshness benefits.
Can I mix source() and ref() in the same model?
Yes. Use source() for the ingestion of external data and ref() for downstream internal dependencies. A common pattern is to call source() in stg models, then switch to ref() for marts/dim/fact models.
Can freshness checks be used against views or external tables?
Yes, as long as loaded_at_field can be resolved and the adapter can perform timestamp comparisons. When the query is expensive, use filter to narrow the time range and reduce load.
Can the same configuration be shared across different DWHs like Snowflake and Databricks?
Yes. Switch database/schema via profiles.yml or target variables, and always reference data through source()/ref() to keep models portable. Absorb case sensitivity and naming differences with identifier and quoting settings.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
dbt Models: SQL-Defined Transformation Units (2026)
Model fundamentals — SELECT-based definitions, naming, refs,...
dbt Analytics Engineering Exam: Complete Guide (2026)
Pass the AE Certification — scope, weighting, sample questio...
dbt Cloud vs dbt Core: Feature & Cost Comparison (2026)
Honest comparison of dbt Cloud vs. dbt Core — IDE, scheduler...
dbt Project Structure: models/seeds/macros Layout (2026)
Recommended dbt project layout — models, seeds, macros, snap...
dbt_project.yml Explained: Every Config (2026)
Every dbt_project.yml setting that matters — paths, vars, ma...