dbt source() Function: Upstream Table Reference (2026)

When referencing raw data in dbt, the standard practice is to use source() instead of writing the table name directly. This is not just a stylistic difference — it affects environment portability, documentation, lineage, tests, and freshness management.

This article walks through the basics of dbt's Sources definition and source() function, the benefits of adopting them, best practices for defining sources, tests and freshness checks, operational patterns, and the points most likely to appear on the certification exam.

source() Basics and Role

source() is a Jinja function that references external tables (such as raw data) that are not managed by dbt. It is called with two arguments — source('source_name', 'table_name') — both of which must match the names declared in the YAML sources definition (use identifier to map a logical name to a different physical table name).

ref() expresses dependencies between models that dbt builds, while source() expresses dependencies on external data. Both feed dbt's lineage graph and are resolved to the correct database, schema, and table name at compile time.

When to use: models that reference raw tables already in the DWH (e.g., raw logs, SaaS sync tables)
Avoid: hard-coding external tables with a schema prefix (fragile across environments and absent from lineage)
Use identifier to absorb differences between YAML name and the physical table name (keeps lineage readable)

The big picture of lineage (Sources → Staging → Marts)

Benefits of Sources (Portability, Lineage, Tests, Freshness)

Defining Sources and referencing them through source() gives you environment-independent resolution, automatic lineage and documentation, source-level tests and freshness checks, and an identifier that absorbs differences in permissions and naming — all at once. These are the evaluation points that come up most often in real work and on the exam.

Freshness in particular is unique to Sources: you can monitor the most recent ingestion timestamp and detect lag early. Defining Sources properly as the starting point of observability is the most cost-effective choice over the medium and long term.

Portability: absorb database/schema differences in YAML; Jinja always references via source()
Lineage and docs: Sources and their dependencies show up in dbt docs
Tests: attach generic tests like not_null and unique to source columns
Freshness: control monitoring scope with warn_after / error_after and filter

Approach	Main purpose	Portability	Lineage / Documentation
source()	Reference external tables	High (resolved via YAML)	Included (as Sources)
ref()	Reference dbt models	High (resolved by adapter)	Included (as models)
Hard-coded schema	One-off references	Low (environment-bound)	Not included

Definition Approach and Best Practices

Sources are declared in schema.yml (version: 2). source: name is the logical name and tables[].name is the logical name used as the second argument of source(). When the physical table name differs, specify identifier. database and schema are often overridden per target through env/vars strategies.

As a best practice, separate sources per landing zone (such as raw), and document the description, owner, and read-permission scope in a docs block or meta. When ingestion lag from external systems is expected, narrow the freshness filter to the current day to avoid false positives.

name is the logical name; identifier is the physical table name (don't conflate them)
Document description, owner, and SLA under docs/meta
Absorb environment differences for database/schema via target variables or profiles.yml
Project out unnecessary columns in the downstream stg model to keep query load low

Setting	Role	Notes
name / tables.name	Logical name used in source() / docs	Pair with identifier when it differs from the physical name
identifier	Specifies the physical table name	Absorbs case and symbol differences
freshness	Freshness thresholds	Narrow the time range with filter to reduce execution cost

Example Sources definition in YAML and model reference

# schema.yml (version: 2)
version: 2
sources:
  - name: raw          # logical name (first arg to source())
    database: {{ target.database }}   # resolved per environment
    schema: raw
    description: Raw landing zone (externally managed)
    freshness:
      warn_after: {count: 2, period: hour}
      error_after: {count: 6, period: hour}
      filter: "_ingested_at >= dateadd('day', -1, current_timestamp())"
    tables:
      - name: orders_raw      # logical name (second arg to source())
        identifier: ext_orders # physical table name
        loaded_at_field: _ingested_at
        columns:
          - name: order_id
            tests:
              - not_null

-- models/stg_orders.sql
with src as (
  select *
  from {{ source('raw', 'orders_raw') }}
)
select
  cast(order_id as string) as order_id,
  customer_id,
  order_ts,
  _ingested_at
from src;

Tests and Freshness Checks (dbt test / dbt source freshness)

Generic tests on source columns such as not_null and unique are declared in YAML the same way as model column tests. Running dbt test validates source columns as well, surfacing anomalies early.

Freshness is evaluated with the dbt source freshness command. It measures the gap between loaded_at_field and the current time and changes status when warn_after / error_after is exceeded. Tune it to match the data — for example, use filter to exclude derived views or long historical ranges and target only the current day's data.

dbt test: enforces column quality (early detection of null contamination, etc.)
dbt source freshness: monitors ingestion lag (detects SLA breaches)
loaded_at_field should ideally be a timestamp type; cast strings appropriately when needed
When run on a schedule, persist results as artifacts and surface them via dbt docs

Setting / Command	Function	Notes
columns[].tests	Quality tests for source columns	not_null, unique, and so on
loaded_at_field	Reference column for freshness	Specify the column that holds the ingestion timestamp
dbt source freshness	Run freshness evaluation	Evaluates warn_after / error_after

Operational Patterns and Anti-Patterns

The foundation of stable operations is a three-layer pattern: "declare external data with Sources," "shape it in the stg layer," and "connect downstream layers with ref()." Resolve differences between environments and adapters (Snowflake / Databricks / BigQuery, etc.) within Sources and profiles.yml, and keep models abstracted behind source()/ref().

Avoid hard-coded schemas and fragile dependencies that assume physical table names won't change. With identifier, naming differences are absorbed and downstream impact is minimized.

Recommended: three layers raw → stg → marts; source() externally, ref() internally
Recommended: use identifier to absorb physical-name drift; keep name as a business logical name
Discouraged: hard-coded schemas; embedding environment-specific DB names in models
Discouraged: giving up freshness monitoring because loaded_at_field isn't set

Pattern	Strengths	Risks / Notes
source() + stg shaping	High observability and maintainability	Upfront definition cost pays off long term
Hard-coded references	Quick short-term experiments	Breaks during environment migration, audits, or changes
Use of identifier	Localizes the impact of naming changes	Watch out for conflating logical and physical names

Key Points for the Analytics Engineer Exam

The exam frequently tests when to use source() versus ref() and the YAML definition of Sources (the role of name vs. identifier, loaded_at_field, freshness). Questions on environment portability and lineage often mark hard-coded schemas as incorrect.

For code-reading questions, expect to be asked whether source()'s two arguments refer to YAML names, what warn_after / error_after of freshness mean, and the scope of the filter.

External data uses source(); internal models use ref()
Don't confuse name and identifier (source() references name)
Freshness is evaluated based on loaded_at_field
Hard-coded references don't appear in lineage and tend to be wrong answers

Keyword	What it tests	What to remember
source()	Representing external references	The two arguments (source name, table name) are YAML names
freshness	Freshness monitoring	warn_after / error_after / filter / loaded_at_field
identifier	Absorbs physical-name differences	Bridges logical and physical names

Check Your Understanding

Analytics Engineer

問題 1

You want to reference billing_transactions in the raw dataset on BigQuery from a dbt model. The table is not managed by dbt. Which is the correct reference?

select * from {{ source('raw', 'billing_transactions') }}
select * from {{ ref('billing_transactions') }}
select * from raw.billing_transactions
select * from {{ var('raw_schema') }}.billing_transactions

正解: A

External tables should be defined as Sources and referenced via source() — that's the correct answer. This enables environment portability, lineage, tests, and freshness. ref() is for dbt-managed models. Hard-coded schemas or variable concatenation forfeit the lineage and freshness benefits.

Frequently Asked Questions

Can I mix source() and ref() in the same model?

Yes. Use source() for the ingestion of external data and ref() for downstream internal dependencies. A common pattern is to call source() in stg models, then switch to ref() for marts/dim/fact models.

Can freshness checks be used against views or external tables?

Yes, as long as loaded_at_field can be resolved and the adapter can perform timestamp comparisons. When the query is expensive, use filter to narrow the time range and reduce load.

Can the same configuration be shared across different DWHs like Snowflake and Databricks?

Yes. Switch database/schema via profiles.yml or target variables, and always reference data through source()/ref() to keep models portable. Absorb case sensitivity and naming differences with identifier and quoting settings.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

Using dbt's source() Function Correctly: Sources Benefits and Exam Tips