dbt

Using dbt's source() Function Correctly: Sources Benefits and Exam Tips

2026-04-19
NicheeLab Editorial Team

When referencing raw data in dbt, the standard practice is to use source() instead of writing the table name directly. This is not just a stylistic difference — it affects environment portability, documentation, lineage, tests, and freshness management.

This article walks through the basics of dbt's Sources definition and source() function, the benefits of adopting them, best practices for defining sources, tests and freshness checks, operational patterns, and the points most likely to appear on the certification exam.

source() Basics and Role

source() is a Jinja function that references external tables (such as raw data) that are not managed by dbt. It is called with two arguments — source('source_name', 'table_name') — both of which must match the names declared in the YAML sources definition (use identifier to map a logical name to a different physical table name).

ref() expresses dependencies between models that dbt builds, while source() expresses dependencies on external data. Both feed dbt's lineage graph and are resolved to the correct database, schema, and table name at compile time.

  • When to use: models that reference raw tables already in the DWH (e.g., raw logs, SaaS sync tables)
  • Avoid: hard-coding external tables with a schema prefix (fragile across environments and absent from lineage)
  • Use identifier to absorb differences between YAML name and the physical table name (keeps lineage readable)

The big picture of lineage (Sources → Staging → Marts)

raw (source)orders_raw (id...)stg_orders (model)source('raw','orders_raw')dim_ordersref('stg_orders')

Benefits of Sources (Portability, Lineage, Tests, Freshness)

Defining Sources and referencing them through source() gives you environment-independent resolution, automatic lineage and documentation, source-level tests and freshness checks, and an identifier that absorbs differences in permissions and naming — all at once. These are the evaluation points that come up most often in real work and on the exam.

Freshness in particular is unique to Sources: you can monitor the most recent ingestion timestamp and detect lag early. Defining Sources properly as the starting point of observability is the most cost-effective choice over the medium and long term.

  • Portability: absorb database/schema differences in YAML; Jinja always references via source()
  • Lineage and docs: Sources and their dependencies show up in dbt docs
  • Tests: attach generic tests like not_null and unique to source columns
  • Freshness: control monitoring scope with warn_after / error_after and filter
ApproachMain purposePortabilityLineage / Documentation
source()Reference external tablesHigh (resolved via YAML)Included (as Sources)
ref()Reference dbt modelsHigh (resolved by adapter)Included (as models)
Hard-coded schemaOne-off referencesLow (environment-bound)Not included

Definition Approach and Best Practices

Sources are declared in schema.yml (version: 2). source: name is the logical name and tables[].name is the logical name used as the second argument of source(). When the physical table name differs, specify identifier. database and schema are often overridden per target through env/vars strategies.

As a best practice, separate sources per landing zone (such as raw), and document the description, owner, and read-permission scope in a docs block or meta. When ingestion lag from external systems is expected, narrow the freshness filter to the current day to avoid false positives.

  • name is the logical name; identifier is the physical table name (don't conflate them)
  • Document description, owner, and SLA under docs/meta
  • Absorb environment differences for database/schema via target variables or profiles.yml
  • Project out unnecessary columns in the downstream stg model to keep query load low
SettingRoleNotes
name / tables.nameLogical name used in source() / docsPair with identifier when it differs from the physical name
identifierSpecifies the physical table nameAbsorbs case and symbol differences
freshnessFreshness thresholdsNarrow the time range with filter to reduce execution cost

Example Sources definition in YAML and model reference

# schema.yml (version: 2)
version: 2
sources:
  - name: raw          # logical name (first arg to source())
    database: {{ target.database }}   # resolved per environment
    schema: raw
    description: Raw landing zone (externally managed)
    freshness:
      warn_after: {count: 2, period: hour}
      error_after: {count: 6, period: hour}
      filter: "_ingested_at >= dateadd('day', -1, current_timestamp())"
    tables:
      - name: orders_raw      # logical name (second arg to source())
        identifier: ext_orders # physical table name
        loaded_at_field: _ingested_at
        columns:
          - name: order_id
            tests:
              - not_null

-- models/stg_orders.sql
with src as (
  select *
  from {{ source('raw', 'orders_raw') }}
)
select
  cast(order_id as string) as order_id,
  customer_id,
  order_ts,
  _ingested_at
from src;

Tests and Freshness Checks (dbt test / dbt source freshness)

Generic tests on source columns such as not_null and unique are declared in YAML the same way as model column tests. Running dbt test validates source columns as well, surfacing anomalies early.

Freshness is evaluated with the dbt source freshness command. It measures the gap between loaded_at_field and the current time and changes status when warn_after / error_after is exceeded. Tune it to match the data — for example, use filter to exclude derived views or long historical ranges and target only the current day's data.

  • dbt test: enforces column quality (early detection of null contamination, etc.)
  • dbt source freshness: monitors ingestion lag (detects SLA breaches)
  • loaded_at_field should ideally be a timestamp type; cast strings appropriately when needed
  • When run on a schedule, persist results as artifacts and surface them via dbt docs
Setting / CommandFunctionNotes
columns[].testsQuality tests for source columnsnot_null, unique, and so on
loaded_at_fieldReference column for freshnessSpecify the column that holds the ingestion timestamp
dbt source freshnessRun freshness evaluationEvaluates warn_after / error_after

Operational Patterns and Anti-Patterns

The foundation of stable operations is a three-layer pattern: "declare external data with Sources," "shape it in the stg layer," and "connect downstream layers with ref()." Resolve differences between environments and adapters (Snowflake / Databricks / BigQuery, etc.) within Sources and profiles.yml, and keep models abstracted behind source()/ref().

Avoid hard-coded schemas and fragile dependencies that assume physical table names won't change. With identifier, naming differences are absorbed and downstream impact is minimized.

  • Recommended: three layers raw → stg → marts; source() externally, ref() internally
  • Recommended: use identifier to absorb physical-name drift; keep name as a business logical name
  • Discouraged: hard-coded schemas; embedding environment-specific DB names in models
  • Discouraged: giving up freshness monitoring because loaded_at_field isn't set
PatternStrengthsRisks / Notes
source() + stg shapingHigh observability and maintainabilityUpfront definition cost pays off long term
Hard-coded referencesQuick short-term experimentsBreaks during environment migration, audits, or changes
Use of identifierLocalizes the impact of naming changesWatch out for conflating logical and physical names

Key Points for the Analytics Engineer Exam

The exam frequently tests when to use source() versus ref() and the YAML definition of Sources (the role of name vs. identifier, loaded_at_field, freshness). Questions on environment portability and lineage often mark hard-coded schemas as incorrect.

For code-reading questions, expect to be asked whether source()'s two arguments refer to YAML names, what warn_after / error_after of freshness mean, and the scope of the filter.

  • External data uses source(); internal models use ref()
  • Don't confuse name and identifier (source() references name)
  • Freshness is evaluated based on loaded_at_field
  • Hard-coded references don't appear in lineage and tend to be wrong answers
KeywordWhat it testsWhat to remember
source()Representing external referencesThe two arguments (source name, table name) are YAML names
freshnessFreshness monitoringwarn_after / error_after / filter / loaded_at_field
identifierAbsorbs physical-name differencesBridges logical and physical names

Check Your Understanding

Analytics Engineer

問題 1

You want to reference billing_transactions in the raw dataset on BigQuery from a dbt model. The table is not managed by dbt. Which is the correct reference?

  1. select * from {{ source('raw', 'billing_transactions') }}
  2. select * from {{ ref('billing_transactions') }}
  3. select * from raw.billing_transactions
  4. select * from {{ var('raw_schema') }}.billing_transactions

正解: A

External tables should be defined as Sources and referenced via source() — that's the correct answer. This enables environment portability, lineage, tests, and freshness. ref() is for dbt-managed models. Hard-coded schemas or variable concatenation forfeit the lineage and freshness benefits.

Frequently Asked Questions

Can I mix source() and ref() in the same model?

Yes. Use source() for the ingestion of external data and ref() for downstream internal dependencies. A common pattern is to call source() in stg models, then switch to ref() for marts/dim/fact models.

Can freshness checks be used against views or external tables?

Yes, as long as loaded_at_field can be resolved and the adapter can perform timestamp comparisons. When the query is expensive, use filter to narrow the time range and reduce load.

Can the same configuration be shared across different DWHs like Snowflake and Databricks?

Yes. Switch database/schema via profiles.yml or target variables, and always reference data through source()/ref() to keep models portable. Absorb case sensitivity and naming differences with identifier and quoting settings.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
dbt

dbt Models: SQL-Defined Transformation Units (2026)

Model fundamentals — SELECT-based definitions, naming, refs,...

dbt

dbt Analytics Engineering Exam: Complete Guide (2026)

Pass the AE Certification — scope, weighting, sample questio...

dbt

dbt Cloud vs dbt Core: Feature & Cost Comparison (2026)

Honest comparison of dbt Cloud vs. dbt Core — IDE, scheduler...

dbt

dbt Project Structure: models/seeds/macros Layout (2026)

Recommended dbt project layout — models, seeds, macros, snap...

dbt

dbt_project.yml Explained: Every Config (2026)

Every dbt_project.yml setting that matters — paths, vars, ma...

Browse all dbt articles (101)
© 2026 NicheeLab All rights reserved.