dbt

Guarding Required Columns with the dbt not_null Test: Practice & Analytics Engineer Prep

2026-04-19
NicheeLab Editorial Team

not_null is dbt's built-in generic test that guarantees a column contains zero NULL values.

Applying it at both the source and model layers to catch missing values early is important both in practice and on the exam.

not_null Basics and What the Exam Tests Most

not_null is a generic test that verifies a target column has zero NULLs. dbt runs every test as a SELECT that returns failing rows: 0 rows means pass, 1+ rows means fail. It is a stable feature of the official documentation and applies to models, sources, and seeds.

On the Analytics Engineer exam, expect questions on where to place not_null (both sources and staging/intermediate/fact layers), test configuration (severity, store_failures, where), and how to choose between related tests (unique, relationships, accepted_values).

  • Pass condition: failing rows = 0
  • Applies to: models, sources, seeds
  • Common pairings: unique (primary key), relationships (foreign-key style integrity)
  • Operations: roll out incrementally — severity=error for critical columns, warn for observation columns

Where not_null sits in the data quality chain

Sourcesnot_null on raw colsStagingnot_null on cleaned keysCore/Fact Modelsnot_null on business keysBI/Export(consumers)CI: dbt test (failure = deploy block)

YAML Implementation Patterns (models, sources, seeds)

The minimal setup is to list not_null under a column's tests section. Use severity=error for high-importance columns and warn while you are still in observation mode, rolling CI out gradually. The where clause can exclude logically deleted rows. Enabling store_failures persists failing rows to an audit table, making root-cause analysis easier (the storage schema and naming depend on the adapter).

Define not_null at the source layer too so that upstream NULLs are caught early. From staging onward, the standard practice is to lock down primary keys, foreign keys, and business keys with not_null.

  • models: target business keys, primary keys, and foreign keys
  • sources: monitor upstream quality on columns that must always have a value
  • seeds: apply to required columns of reference master data
  • where: exclude rows with conditions like is_deleted=false
  • store_failures: persist failing rows to shorten root-cause investigation

YAML example (applying not_null on models and sources)

version: 2

models:
  - name: fct_orders
    description: Final fact table for orders
    columns:
      - name: order_id
        description: Primary key
        tests:
          - unique
          - not_null:  # critical columns use error
              severity: error
              store_failures: true
      - name: customer_id
        description: Foreign key to dim_customers
        tests:
          - not_null:
              severity: error
              where: "is_deleted = false"  # exclude logically deleted rows
          - relationships:
              to: ref('dim_customers')
              field: customer_id

sources:
  - name: app
    schema: raw_app
    tables:
      - name: orders_raw
        columns:
          - name: order_id
            tests:
              - not_null  # catch upstream NULLs early
          - name: created_at
            tests:
              - not_null: { severity: warn }  # monitor while rolling out gradually

Running and Tuning (severity, store_failures, selective execution)

dbt test can be scoped to specific nodes. Specifying a model name and running only the tests attached to it is the most reliable workflow. In CI, the easiest policy is to block on critical columns with severity=error and just log warn-level issues during observation.

store_failures keeps failing rows for investigation. To avoid long-term accumulation, use a scheduled cleanup policy or a dedicated schema.

  • Basic run: dbt test -s fct_orders
  • By tag: tag critical tests and run dbt test -s tag:critical
  • All generic tests: dbt test -s test_type:generic (selectors vary by adapter/version — check the official docs)
  • CI: failure = deploy block (severity=error); warn only logs

Common Pitfalls and How to Avoid Them

Empty strings are not NULLs: in most warehouses (Snowflake, BigQuery, Redshift, etc.) '' and NULL are different. not_null does not catch empty strings, so when upstream may send them, normalize first in staging with NULLIF(TRIM(col), '') and only then apply not_null.

JOINs can introduce NULLs: a LEFT JOIN produces NULL on the right side for unmatched rows. When a column that must satisfy not_null ends up NULL after a join, reconsider the join strategy (switch to INNER JOIN, pre-filter, or apply COALESCE).

Casts and derived columns: failed CASTs or conditional branches can quietly produce NULLs. Either default the value with COALESCE immediately after derivation, or reinforce quality by pairing with accepted_values to constrain the expected range.

  • Empty strings: normalize with NULLIF(TRIM(col), '') before applying not_null
  • Joins: switch to INNER JOIN, pre-align data, or COALESCE
  • Derived columns: COALESCE / default values + supplemental tests
  • Large tables: narrow scope with where (combine with operational policy)

Designing and Governing Required Columns

not_null at the source layer is for early detection, while not_null in staging and core layers enforces the data contract. In particular, monitor primary keys, foreign keys, and business keys continuously by combining not_null with unique and relationships.

In CI/CD, make a not_null failure on a critical column a deploy-blocking condition. Persist failing rows to an audit schema via store_failures, and document the SLA from detection to recovery.

  • Define required columns per layer and stack tests on top of each other
  • Block on critical columns with severity=error
  • Persist failing rows and alert on them to shorten MTTR
  • Documentation: spell out which columns are required and why

Choosing Between not_null and Related Tests (Comparison Table)

not_null guarantees existence, but full integrity comes from combining it with other tests. Knowing how it compares to the main alternatives makes the exam's multiple-choice questions much easier.

  • Primary key: not_null + unique
  • Foreign-key style integrity: not_null + relationships
  • Code values: not_null + accepted_values
TestPurposeTypical failure exampleOperational tip
not_nullEliminate NULLs (guarantee existence)Join misses turn rows into NULL; empty strings not normalized to NULLNormalize NULLs in staging, then apply not_null
uniqueEliminate duplicates (uniqueness)The same ID appears on multiple rowsPair with not_null to guarantee a primary key
relationshipsReferential integrity (foreign key)An ID that does not exist in the parent tableApply not_null on both parent and child to detect orphans
accepted_valuesAdherence to a code taxonomyUnexpected category valuesIf you want to treat NULL separately, mind the order in which tests are combined

Check Your Understanding

Analytics Engineer

問題 1

Which configuration best satisfies all of the following requirements at once? Requirements: 1) fct_orders.order_id is required, so a CI failure must block the deploy; 2) rows where is_deleted=true must be excluded from validation; 3) failing rows must be persisted for investigation.

  1. Define not_null on order_id with severity=error, where="is_deleted = false", and store_failures=true
  2. Define not_null on order_id with severity=warn, where="is_deleted = true", and store_failures=false
  3. Define only unique on order_id and skip tests in CI
  4. Define accepted_values on order_id and exclude NULL from the allowed values

正解: A

Blocking in CI requires severity=error, the exclusion condition is is_deleted=false, and persisting failing rows requires store_failures=true. not_null is the best fit for these requirements.

Frequently Asked Questions

Does not_null detect empty strings ('')?

In most data warehouses, '' is not NULL. not_null only detects NULLs. To guard against empty strings, normalize them in staging with NULLIF(TRIM(col), '').

not_null runs slowly on large tables. Can I speed it up?

Practical options include narrowing scope with where, filtering on partition columns, and running only critical columns first (selected via tags). Run full-table validation on a periodic batch schedule for safety.

Should I define not_null on sources or on models?

Both. Use it on sources to catch upstream NULLs early, and re-verify required columns (primary keys, foreign keys, business keys) on models to enforce the contract twice.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
dbt

dbt Models: SQL-Defined Transformation Units (2026)

Model fundamentals — SELECT-based definitions, naming, refs,...

dbt

dbt Analytics Engineering Exam: Complete Guide (2026)

Pass the AE Certification — scope, weighting, sample questio...

dbt

dbt Cloud vs dbt Core: Feature & Cost Comparison (2026)

Honest comparison of dbt Cloud vs. dbt Core — IDE, scheduler...

dbt

dbt Project Structure: models/seeds/macros Layout (2026)

Recommended dbt project layout — models, seeds, macros, snap...

dbt

dbt_project.yml Explained: Every Config (2026)

Every dbt_project.yml setting that matters — paths, vars, ma...

Browse all dbt articles (101)
© 2026 NicheeLab All rights reserved.