dbt Seeds: Loading CSVs into the Warehouse (2026)

Seeds is the dbt feature that loads CSVs bundled with your project into the target data warehouse as tables. It lets you ship small to medium reference data under version control, insulated from changes in external systems.

This article walks through where to place files, how to configure them, type definitions, operational best practices, and when to choose Seeds versus other dbt features — calling out the points most often tested on the exam.

Seeds Basics: Building Stable Reference Tables from CSV

Seeds takes CSVs placed in your project's seeds directory and creates or updates them as physical tables in the warehouse. Unlike models, you don't write SQL — the CSV contents become the row data directly. The table name is normally derived from the filename (without the extension), and other models reference it via ref('...').

Typical use cases are small, static, shared reference data: country code dictionaries, category normalization tables, threshold and grouping definitions. The dbt seed command deploys them reproducibly, so they behave reliably in CI as well.

Location: under your project's seeds/ directory (older projects may still use data/)
Output: a physical table (not a view)
Reference: use ref('seed_name') from other models for JOINs and filters
CSVs require a header row — column names become the table's column names
Unless you specify types, columns are usually created as text. In practice, declaring types explicitly via column_types is strongly recommended

Where Seeds fits in the pipeline

Quick commands

dbt seed
# 特定のファイルだけ
# dbt seed --select path:seeds/countries.csv
# 他モデルから参照（SQL）
# select * from {{ ref('countries') }};

Folder Structure and Basic dbt_project.yml Settings

Place CSVs under your project's seeds/ directory. Subdirectories double as namespace organization, and the table name comes from the filename (without the extension). Schema, types, and identifier quoting are configured in the seeds section of dbt_project.yml.

Configuration precedence follows the standard resource rules — deeper scopes (project → package → directory → individual file) override shallower ones. In practice, decide three things up front for stability: carve out a dedicated schema for seeds, lock identifier quoting to match your environment, and always declare types explicitly.

Schema is set with +schema (e.g., reference). Switch by environment by driving schema rules off target.name
Column-name quoting (+quote_columns) changes how case and reserved words are handled per warehouse — pin it to your organization's standard
Declare column types explicitly with +column_types (effectively required when any non-text columns are involved)
CSVs require a header row. The default delimiter is a comma — adjust via delimiter if needed

Example dbt_project.yml (Seeds settings)

name: my_project
profile: my_profile

seeds:
  +schema: reference
  +quote_columns: false
  my_project:
    geo:  # seeds/geo/ 配下
      +column_types:
        country_code: varchar(2)
        region: varchar(32)
        is_eu: boolean
    lookup:
      +schema: reference
      # 必要に応じて区切り文字を変更
      # +delimiter: ";"

Type Definitions and Load Behavior: column_types and Identifier Handling

On most adapters, columns without a declared type land as text (STRING/VARCHAR). To make calculations and joins reliable, declare numeric, date, and boolean columns explicitly via column_types. Type names follow the adapter (Snowflake, BigQuery, Databricks, etc.).

Column-name quoting (quote_columns) drives how case and reserved words are handled. On Snowflake, enabling quoting preserves the case of column names, while disabling it upper-cases them. Align with your organization's naming convention (all lowercase, underscores, etc.) and stay consistent between Seeds and models.

Re-running dbt seed recreates or replaces the target table (per the adapter's implementation) and overwrites it with the CSV contents. There is no incremental application.

Recommended: declare every non-text column explicitly via column_types
For datetimes and time zones, verify your warehouse's types and comparison semantics in advance
Strip whitespace, NULLs, and duplicates from join keys on the Seeds side (a quality-assurance point often tested on exams)

Minimal column_types example

seeds:
  my_project:
    business:
      +column_types:
        id: bigint
        threshold: numeric(10,2)
        valid_from: date
        active: boolean

Operations: Update Flow, Environment Separation, and CI in Practice

With Seeds, CSV diffs are the data diffs. Keep them small for easy review and avoid stray columns or blank rows. Oversized Seeds inflate load time and warehouse maintenance overhead — as a rule of thumb, once you exceed tens of thousands to a hundred-plus thousand rows, consider external loading or staging through a regular model.

In CI/CD, run dbt seed early and follow with build for the dependent models. Narrow the scope with selectors to speed things up. If grants are required, apply rules to seeds as well.

Limit Seeds to small, stable reference data
Strictly separate schemas by environment (only reviewed CSVs reach prod)
Selective execution: seed only what's changed
Rebuild dependents with dbt build --select +ref('seed_name')

Commands you'll use often

# すべての Seeds
 dbt seed
# 変更のあった Seeds だけ（例: ディレクトリ指定）
 dbt seed --select path:seeds/geo/**
# Seeds 実行後に依存モデルをビルド
 dbt build --select +ref:countries
# タグで束ねて実行
 dbt seed --select config:tags:static

Testing and Model References: ref and Data Quality

Models reference Seeds via ref('seed_name'). Common patterns are JOINing them as dictionary tables or using them as accepted-value lists with IN. You can define data tests — not_null, unique, accepted_values, relationships — against Seeds just like any other model.

For exam prep, questions that ask you to distinguish Seeds (static reference data), Sources (variable source data), and Snapshots (history retention) come up often.

Lock down JOIN keys with both unique and not_null tests
Manage accepted-value lists via accepted_values (with Seeds as the single source of truth)
Always use ref inside models — never hard-code fully qualified names

Reference and test example

-- models/fct_orders.sql
select
  o.order_id,
  o.country_code,
  c.region
from {{ ref('stg_orders') }} as o
left join {{ ref('countries') }} as c
  on o.country_code = c.country_code;

# models/schema.yml（抜粋）
version: 2
seeds:
  - name: countries
    columns:
      - name: country_code
        tests: [not_null, unique]
      - name: region
        tests:
          - accepted_values:
              values: ["EMEA", "AMER", "APAC"]

Choosing Between Seeds, Sources, Snapshots, and External Loading

Seeds is the right choice when you need to deliver small, static, project-bundled data quickly and reliably. For variable data from external systems, cases requiring history retention, or large data volumes, other features are the standard answer.

Static, small, reference-use: Seeds
Continuously ingested from external DBs or object storage: Sources + models
Row-level versioning or Slowly Changing Dimensions: Snapshots
Hundreds of thousands of rows or more: ingest via external loading or ELT pipelines, then model

Use case / dimension	Seeds	Sources	Snapshots
Primary purpose	Create static reference tables from CSV	Ingest and reference raw data from external systems	Retain row-level history of a model (change tracking)
Data volume	Small to medium (a few to tens of thousands of rows)	Medium to large (assumes continuous ingestion)	Depends on the model (grows with history)
Update method	Replace/recreate via dbt seed	Load externally, then reference via source definitions	Apply diffs by running dbt snapshot
Type management	Declare explicitly via column_types (recommended)	Depends on the upstream schema	Depends on the snapshot strategy's key/column definitions
Typical exam focus	Choose Seeds for static dictionaries/accepted values	External table connections and source freshness	When to apply history retention and how to design keys

Check Your Understanding

Analytics Engineer

問題 1

You want to manage a country-code-to-region mapping in CSV and reference it identically on every deploy. Which dbt feature fits best?

A. Seeds
B. Sources
C. Snapshots
D. Macros

正解: A

Seeds creates DWH tables from CSVs bundled with the project — the perfect fit for static dictionary or reference data. Sources reference data from external systems, Snapshots retain row history, and Macros are for reusing logic. None of those match this use case.

Frequently Asked Questions

Can I use formats other than CSV (JSON, Parquet, etc.) for Seeds?

No. Seeds are CSV-based (you can adjust the delimiter). For other formats, the standard approach is to ingest them through external loads or staging, then treat them as regular models.

Are large Seeds (hundreds of thousands of rows or more) a problem?

Yes. Load time, transactions, and compile time all grow, dragging down your CI/CD. The exact threshold depends on your environment, but once you exceed tens of thousands to a hundred-plus thousand rows, consider ELT ingestion or external tables plus modeling instead.

What happens to dbt seed if a table with the same name already exists?

Depending on the adapter implementation, dbt replaces or recreates the table and overwrites it with the CSV contents. To avoid collisions, set up a dedicated schema and, if needed, separate namespaces via alias or schema configuration.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

How to Use dbt Seeds: Building Static Tables from CSV for Production and Exams