Databricks Glossary: 100 Essential Terms for Certification Exams (2026)

Databricks certification exams frequently test specialized vocabulary around Delta Lake, Unity Catalog, MLflow, and Spark. A precise grasp of these terms is the foundation for passing. This article organizes 100 carefully selected exam-critical terms into 7 category tables. Use it as your study dictionary.

Delta Lake Terms (20)

Delta Lake is Databricks' storage layer and the single most heavily tested topic across all exams. Lock down the vocabulary around transaction management, data quality, and performance optimization.

Term (JP)	Term (EN)	Definition
Delta Lake	Delta Lake	Open-source storage layer that adds ACID transactions, schema management, and Time Travel on top of Parquet files. The default table format on Databricks.
Transaction Log	Transaction Log (_delta_log)	The core of Delta Lake; records every change to a table as JSON. A Parquet checkpoint is automatically created every 10 commits.
Time Travel	Time Travel	Feature that lets you query or restore past table states using VERSION AS OF or TIMESTAMP AS OF. Default retention is 30 days.
OPTIMIZE	OPTIMIZE	Command that compacts many small files into larger ones. Often combined with Z-ORDER to improve query performance.
Z-ORDER	Z-ORDER	Data layout optimization that co-locates rows with similar column values into the same files, improving file-skipping rates for filter queries.
Liquid Clustering	Liquid Clustering	Evolution of Z-ORDER. Specified via a CLUSTER BY clause; clustering is applied automatically at write time, so manual OPTIMIZE is not required.
VACUUM	VACUUM	Command that physically deletes obsolete data files. Removes files older than 7 days by default; Time Travel into that window becomes unavailable after running it.
Schema Evolution	Schema Evolution	Feature that automatically adds new columns when mergeSchema=true is set. Existing rows receive null for the new columns.
Schema Enforcement	Schema Enforcement	Feature that rejects writes that don't match the table schema. Enabled by default in Delta Lake.
MERGE INTO	MERGE INTO	Upsert operation that matches source and target on a join condition, then updates/deletes matched rows and inserts unmatched ones. Also used to implement SCDs.
Change Data Feed	Change Data Feed (CDF)	Feature that exposes INSERT/UPDATE/DELETE operations on a table as a stream of change records. Used to build incremental processing pipelines.
Medallion Architecture	Medallion Architecture	Design pattern that progressively improves data quality across three layers: Bronze (raw), Silver (cleansed), and Gold (aggregated and analytics-ready).
Delta Live Tables	Delta Live Tables (DLT)	Declarative pipeline definition framework. Transformations are defined with the @dlt.table decorator, while dependency resolution and quality checks are automated.
Expectations	Expectations	DLT data-quality constraints. Three variants: @dlt.expect (warn), @dlt.expect_or_drop (drop bad rows), and @dlt.expect_or_fail (fail the pipeline).
Photon	Photon	Vectorized query engine implemented in C++. Delivers up to 12x performance improvement over the standard Spark engine.
Delta Clone	Delta Clone	Feature for cloning a table. Two variants: SHALLOW CLONE (metadata only) and DEEP CLONE (full copy of metadata and data).
Delta Constraints	Delta Constraints	Guarantees data quality via CHECK constraints (reject rows that don't satisfy a predicate) and NOT NULL constraints (reject null values).
Materialized View	Materialized View	A view whose query results are physically persisted. Reads return the precomputed result instead of recomputing on access, so they are fast.
Streaming Table	Streaming Table	An append-only table defined in DLT. Processed incrementally via spark.readStream.
Predictive Optimization	Predictive Optimization	Feature that learns a table's usage patterns and automatically runs OPTIMIZE, VACUUM, and statistics collection at the optimal time.

Unity Catalog Terms (15)

Unity Catalog is Databricks' unified governance layer and shows up frequently on the DEA and DEP exams. Understand the three-level namespace, access control, and data lineage concepts.

Term (JP)	Term (EN)	Definition
Unity Catalog	Unity Catalog	Unified governance solution that centrally manages access control, auditing, lineage, and discovery across data and AI assets.
Metastore	Metastore	The top-level container in Unity Catalog. One metastore is created per region; it manages the catalog → schema → table hierarchy.
Catalog	Catalog	Top level of the three-level namespace (catalog.schema.table). Commonly used to separate production from development environments.
Schema	Schema (Database)	Second level of the three-level namespace. Logically groups tables, views, and functions. Synonymous with "database" in SQL.
External Location	External Location	A cloud storage path under Unity Catalog management. Access to S3 or ADLS is mediated through Unity Catalog.
Storage Credential	Storage Credential	The IAM role or service principal used to access cloud storage. Bound to an external location.
Managed Table	Managed Table	Table whose data and metadata are both managed by Databricks. Running DROP TABLE deletes the underlying data as well.
External Table	External Table	Table whose metadata is managed by Unity Catalog but whose data lives in external storage. DROP TABLE leaves the data intact.
Data Lineage	Data Lineage	Feature that automatically tracks and visualizes data origin and transformation history. Records inter-table dependencies for impact analysis.
GRANT / REVOKE	GRANT / REVOKE	GRANT SELECT ON TABLE grants read access to a table. Privileges are inherited down the catalog → schema → table hierarchy.
Dynamic View	Dynamic View	View that uses current_user() or is_member() to filter rows and columns per user, enabling row- and column-level access control.
Volume	Volume	Non-tabular file storage governed by Unity Catalog. Holds images, PDFs, CSVs, and so on. Comes in managed and external variants.
Row Filter / Column Mask	Row Filter / Column Mask	Access control where Row Filter limits visible rows and Column Mask masks column values (e.g., replacing part of an email with ****).
Information Schema	Information Schema	System schema that exposes metadata such as tables, columns, and privileges within a catalog so it can be queried via SQL.
Delta Sharing	Delta Sharing	Open protocol for securely sharing data across organizations without physical copies. Accessible even from non-Databricks environments.

Spark / PySpark Terms (15)

Apache Spark is the execution engine behind Databricks. It is mandatory knowledge for the Spark Developer exam, and the DEA and MLA exams also test the fundamentals.

Term (JP)	Term (EN)	Definition
Apache Spark	Apache Spark	Distributed processing engine for large-scale data. A unified, in-memory framework that supports both batch and streaming workloads.
SparkSession	SparkSession	The entry point of a Spark application. Available automatically as the spark variable in Databricks. Used to create DataFrames, run SQL, and manage configuration.
DataFrame	DataFrame	Distributed dataset composed of named columns. PySpark's primary data structure, supporting API operations such as select, filter, groupBy, and join.
Transformation	Transformation	Lazily evaluated operation (select, filter, groupBy, join, etc.). Doesn't execute until an Action is called. Categorized as Narrow or Wide.
Action	Action	Operation that triggers computation (show, count, collect, write). Calling an Action causes all upstream Transformations to execute.
Partition	Partition	Unit of data partitioning. Spark parallelizes processing at the partition level. Use repartition() and coalesce() to change the partition count.
Shuffle	Shuffle	Redistributes data across workers. A common bottleneck triggered by Wide Transformations such as groupBy, join, and repartition.
Catalyst Optimizer	Catalyst Optimizer	Unified engine that optimizes queries through four stages: logical plan, optimized logical plan, physical plan, and code generation.
Adaptive Query Execution	Adaptive Query Execution (AQE)	Spark 3.0+ optimization that uses runtime statistics to dynamically switch join strategies, coalesce partitions, and split skewed partitions.
Broadcast Join	Broadcast Join	Join strategy that copies a small table to every Executor so the join can be performed without a shuffle. Can be requested explicitly via the broadcast() hint.
Cache (persist)	Cache (persist)	Feature that caches a DataFrame in memory or on disk for reuse. cache() targets memory only; persist() lets you choose the storage level.
Spark SQL	Spark SQL	Module for manipulating Spark data using SQL syntax. spark.sql() runs a query and returns the result as a DataFrame.
Window Functions	Window Functions	Functions that perform ranking, moving aggregates, and cumulative calculations within partitions. ROW_NUMBER, RANK, LAG, and LEAD are common examples.
UDF	UDF (User Defined Function)	User-defined function. Plain Python UDFs carry high serialization overhead, so Pandas UDFs (vectorized UDFs) are recommended where possible.
Pandas API on Spark	Pandas API on Spark	Compatibility layer (import via pyspark.pandas) that runs essentially unmodified Pandas code in a distributed fashion. A different approach than Pandas UDFs.

ML / MLflow Terms (15)

Machine learning and MLflow vocabulary that appears frequently on the ML Associate and ML Professional exams.

Term (JP)	Term (EN)	Definition
MLflow	MLflow	Open-source platform for managing the ML lifecycle. Composed of four components: Tracking, Models, Registry, and Model Serving.
Experiment	Experiment	Logical container that groups related Runs. Typically one Experiment per project, used to compare different approaches.
Run	Run	A record of a single training execution. Captures parameters, metrics, artifacts, and tags. Started with mlflow.start_run().
Model Registry	Model Registry	Versioned registry for trained models. The Unity Catalog-integrated version uses Champion/Challenger aliases.
Autolog	Autolog	Automatic logging enabled via mlflow.autolog(). Supports scikit-learn, TensorFlow, and PyTorch. Enabled by default on Databricks.
Feature Store	Feature Store	Repository for managing and sharing ML features. Managed as Feature Tables under Unity Catalog. Supports both offline and online access.
AutoML	AutoML	Automatically performs preprocessing, feature engineering, model selection, and tuning given just the input data. Results are logged to MLflow.
Hyperopt	Hyperopt	Library for efficient hyperparameter search using Bayesian optimization (TPE). Can run distributed via SparkTrials.
Spark MLlib	Spark MLlib	Spark's distributed ML library. Builds ML workflows around the Pipeline, Transformer, and Estimator abstractions.
Pipeline (ML Pipeline)	Pipeline (ML Pipeline)	Workflow that chains Transformers and Estimators together. pipeline.fit() runs everything at once, ensuring reproducibility and portability.
Model Signature	Model Signature	Defines the input/output schema of an MLflow model. Can be inferred automatically with infer_signature(). Used by Model Serving for input validation.
Model Flavor	Model Flavor	MLflow's model storage format. Examples include sklearn, tensorflow, pytorch, and pyfunc. pyfunc is the generic flavor.
Lakehouse Monitoring	Lakehouse Monitoring	Feature that automatically detects drift in table statistics and degradation in ML model prediction performance, then sends alerts.
Model Serving	Model Serving	Deploys registered models as REST API endpoints. Scales serverlessly and supports A/B testing.
Pandas UDF	Pandas UDF	Vectorized UDF built on Apache Arrow. Has lower serialization overhead than a regular UDF and runs faster.

Compute / Cluster Terms (10)

Vocabulary for Databricks compute resources. Cluster types, configuration, and cost management appear on every exam.

Term (JP)	Term (EN)	Definition
Cluster	Cluster	Unit of Spark compute. Composed of a driver node plus worker nodes. Supports autoscaling and auto-termination.
Driver Node	Driver Node	Node that hosts the SparkSession and plans/coordinates jobs. The result of collect() is held in the driver's memory.
Worker Node	Worker Node	Node that runs Executors (the execution processes), parallelizing data processing at the task level. Cluster throughput scales with the worker count.
All-purpose Cluster	All-purpose Cluster	Cluster for interactive notebook development. Can be shared by multiple users. Billed as DBU plus underlying infrastructure cost.
Job Cluster	Job Cluster	Cluster created automatically when a Workflows job runs and deleted automatically when the job ends. Roughly 30% cheaper than an all-purpose cluster.
SQL Warehouse	SQL Warehouse	Compute resource for Databricks SQL. Comes in three tiers: Serverless, Pro, and Classic. Serverless has the fastest startup and is recommended.
Serverless Compute	Serverless Compute	Compute model where Databricks fully manages the infrastructure. Startup, scaling, and patching are all automated.
Cluster Policy	Cluster Policy	Rules an administrator uses to restrict and standardize cluster configuration. Controls instance type, max worker count, and runtime version.
Shared Cluster	Shared Cluster	Cluster that can be used concurrently by multiple users. Some features (such as dbutils.credentials) are restricted.
Single User Cluster	Single User Cluster	Cluster dedicated to a single user. All features are available. Unity Catalog Table ACLs work on both shared and single-user clusters.

ETL / Pipeline Terms (15)

Pipeline construction and operations vocabulary that appears frequently on the DEA and DEP exams.

Term (JP)	Term (EN)	Definition
Workflows	Databricks Workflows	Orchestration service that defines tasks as a DAG (directed acyclic graph) and handles scheduling, dependency management, and error handling.
Auto Loader	Auto Loader (cloudFiles)	Structured Streaming source that automatically detects and incrementally processes new files in cloud storage. Supports schema inference and evolution.
Structured Streaming	Structured Streaming	Spark engine for stream processing using the same DataFrame API as batch. Reads and writes via readStream/writeStream. Provides exactly-once guarantees.
ETL / ELT	ETL / ELT	ETL transforms before loading; ELT loads first and transforms after. Databricks, as a data lakehouse, recommends the ELT pattern.
Checkpoint	Checkpoint	Mechanism that records Structured Streaming progress. Required for fault recovery and exactly-once guarantees. Specified via checkpointLocation.
Trigger	Trigger	Controls when a streaming query fires. Options include processingTime (periodic) and availableNow (process all available data and stop — recommended).
Slowly Changing Dimension	SCD (Slowly Changing Dimension)	Pattern for managing change history in dimension tables. Type 1 overwrites; Type 2 retains history. Implemented in DLT via APPLY CHANGES INTO.
COPY INTO	COPY INTO	Idempotent SQL statement that loads data from cloud storage into Delta Lake. Re-loading the same file produces no duplicates.
Data Skew	Data Skew	Condition where data concentrates on specific key values, causing uneven work across partitions. Mitigated by AQE skew splitting or salting keys.
Idempotency	Idempotency	Property that running the same operation any number of times yields the same result. Necessary to prevent duplicate data on pipeline re-runs.
Multi-task Job	Multi-task Job	Workflows configuration that runs multiple tasks with dependencies. Supports inter-task parameter passing (task values), conditional branching, and retries.
Asset Bundle	Databricks Asset Bundle (DAB)	CI/CD tool that packages code, configuration, and resources via YAML. Automates deployments across environments.
Repos	Repos (Git Integration)	Feature for operating on Git repositories directly inside Databricks. Supports GitHub, GitLab, and Bitbucket. Used for notebook version control.
Secret	Secret	Sensitive values such as API keys and passwords stored securely in a Secret Scope. Retrieved with dbutils.secrets.get(scope, key).
dbutils	dbutils	Suite of notebook utilities providing file operations (fs), secrets, widgets, and notebook control (notebook).

GenAI Terms (10)

Generative AI vocabulary tested on the GenAI Engineer Associate exam and the ML Professional exam.

Term (JP)	Term (EN)	Definition
Retrieval-Augmented Generation	RAG (Retrieval-Augmented Generation)	Technique that retrieves information from an external knowledge base to augment an LLM's responses. Reduces hallucinations and brings in up-to-date information.
Vector Search	Vector Search	Databricks' managed vector database that vectorizes text and performs similarity search. Used as the retriever in RAG.
Embedding	Embedding	Process of converting text, images, and similar inputs into fixed-length numeric vectors. Semantically similar data maps to nearby vectors.
Foundation Model API	Foundation Model API	Calls external LLMs (GPT-4, Claude, etc.) and open-source models (Llama, DBRX, etc.) through a unified API. Accessed via Model Serving.
Prompt Engineering	Prompt Engineering	Discipline of optimizing prompts to elicit the desired output from an LLM. Techniques include Zero-shot, Few-shot, and Chain-of-Thought prompting.
Fine-tuning	Fine-tuning	Technique of further training a pretrained LLM for a specific domain. Parameter-efficient methods such as LoRA and QLoRA are in scope for the exam.
DBRX	DBRX	Open-source LLM developed by Databricks. Uses a Mixture of Experts (MoE) architecture to combine high performance with efficient inference.
LLM Chain	LLM Chain	Pattern that chains multiple LLM calls and tool invocations together. Implemented with frameworks like LangChain and logged/traced via MLflow.
Guardrails	Guardrails	Safety mechanism that controls and filters LLM output. Harmful-content blocking and output-format constraints can be built into Model Serving.
MLflow Tracing	MLflow Tracing	Feature that traces and visualizes the execution flow of an LLM application. Records per-step latency and inputs/outputs in a RAG pipeline for debugging.

Check Your Understanding

Databricks

問題 1

Which statement about the Delta Lake VACUUM command is correct?

It reduces the size of the transaction log and improves metadata read performance
It physically deletes obsolete data files older than 7 days (168 hours) by default, after which Time Travel into those versions is no longer possible
It recomputes table statistics and improves the accuracy of Catalyst Optimizer optimizations
It compacts small files into larger ones to optimize read performance

正解: B

VACUUM physically deletes obsolete data files from a Delta Lake table. By default, files older than 7 days (168 hours) are deleted, and after VACUUM you can no longer Time Travel (VERSION AS OF / TIMESTAMP AS OF) into those older versions. Option A is wrong: reducing transaction-log size isn't VACUUM's job — that's handled by the checkpoint mechanism (a Parquet file written every 10 commits). Option C describes ANALYZE TABLE COMPUTE STATISTICS. Option D describes OPTIMIZE (file compaction). The distinction between VACUUM and OPTIMIZE is a frequent exam topic, so make sure you can tell them apart.

Frequently Asked Questions

In what order should I learn these terms?

Priority depends on which exam you're taking. For Data Engineer Associate (DEA), prioritize Delta Lake, Unity Catalog, and pipeline-related terms. For ML Associate, ML and MLflow terms matter most. The efficient approach is to first check the Exam Guide for your target exam, then start learning terms from the highest-weighted domains.

Should I memorize technical terms in English or in my native language?

Even if the exam is available in your language, we recommend learning the terms in English. Official documentation and error messages are in English, and most real-world work uses the English terms. Use translations as a comprehension aid, but memorize proper nouns like "Delta Lake", "Unity Catalog", and "MLflow" in English.

Can I pass the exam by memorizing the glossary alone?

Memorizing definitions alone is not enough to pass. Databricks exams focus on conceptual understanding and applied problem-solving, and many questions cannot be answered just by recalling definitions. Use the glossary as a foundation, then build on it by reading the official documentation and drilling with practice questions — that's the fastest route to passing.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

Databricks Glossary: 100 Essential Exam Terms