Databricks

Databricks Glossary: 100 Essential Exam Terms

2026-03-21
更新: 2026-03-27
NicheeLab Editorial Team

Databricks certification exams frequently test specialized vocabulary around Delta Lake, Unity Catalog, MLflow, and Spark. A precise grasp of these terms is the foundation for passing. This article organizes 100 carefully selected exam-critical terms into 7 category tables. Use it as your study dictionary.

Delta Lake Terms (20)

Delta Lake is Databricks' storage layer and the single most heavily tested topic across all exams. Lock down the vocabulary around transaction management, data quality, and performance optimization.

Term (JP)Term (EN)Definition
Delta LakeDelta LakeOpen-source storage layer that adds ACID transactions, schema management, and Time Travel on top of Parquet files. The default table format on Databricks.
Transaction LogTransaction Log (_delta_log)The core of Delta Lake; records every change to a table as JSON. A Parquet checkpoint is automatically created every 10 commits.
Time TravelTime TravelFeature that lets you query or restore past table states using VERSION AS OF or TIMESTAMP AS OF. Default retention is 30 days.
OPTIMIZEOPTIMIZECommand that compacts many small files into larger ones. Often combined with Z-ORDER to improve query performance.
Z-ORDERZ-ORDERData layout optimization that co-locates rows with similar column values into the same files, improving file-skipping rates for filter queries.
Liquid ClusteringLiquid ClusteringEvolution of Z-ORDER. Specified via a CLUSTER BY clause; clustering is applied automatically at write time, so manual OPTIMIZE is not required.
VACUUMVACUUMCommand that physically deletes obsolete data files. Removes files older than 7 days by default; Time Travel into that window becomes unavailable after running it.
Schema EvolutionSchema EvolutionFeature that automatically adds new columns when mergeSchema=true is set. Existing rows receive null for the new columns.
Schema EnforcementSchema EnforcementFeature that rejects writes that don't match the table schema. Enabled by default in Delta Lake.
MERGE INTOMERGE INTOUpsert operation that matches source and target on a join condition, then updates/deletes matched rows and inserts unmatched ones. Also used to implement SCDs.
Change Data FeedChange Data Feed (CDF)Feature that exposes INSERT/UPDATE/DELETE operations on a table as a stream of change records. Used to build incremental processing pipelines.
Medallion ArchitectureMedallion ArchitectureDesign pattern that progressively improves data quality across three layers: Bronze (raw), Silver (cleansed), and Gold (aggregated and analytics-ready).
Delta Live TablesDelta Live Tables (DLT)Declarative pipeline definition framework. Transformations are defined with the @dlt.table decorator, while dependency resolution and quality checks are automated.
ExpectationsExpectationsDLT data-quality constraints. Three variants: @dlt.expect (warn), @dlt.expect_or_drop (drop bad rows), and @dlt.expect_or_fail (fail the pipeline).
PhotonPhotonVectorized query engine implemented in C++. Delivers up to 12x performance improvement over the standard Spark engine.
Delta CloneDelta CloneFeature for cloning a table. Two variants: SHALLOW CLONE (metadata only) and DEEP CLONE (full copy of metadata and data).
Delta ConstraintsDelta ConstraintsGuarantees data quality via CHECK constraints (reject rows that don't satisfy a predicate) and NOT NULL constraints (reject null values).
Materialized ViewMaterialized ViewA view whose query results are physically persisted. Reads return the precomputed result instead of recomputing on access, so they are fast.
Streaming TableStreaming TableAn append-only table defined in DLT. Processed incrementally via spark.readStream.
Predictive OptimizationPredictive OptimizationFeature that learns a table's usage patterns and automatically runs OPTIMIZE, VACUUM, and statistics collection at the optimal time.

Unity Catalog Terms (15)

Unity Catalog is Databricks' unified governance layer and shows up frequently on the DEA and DEP exams. Understand the three-level namespace, access control, and data lineage concepts.

Term (JP)Term (EN)Definition
Unity CatalogUnity CatalogUnified governance solution that centrally manages access control, auditing, lineage, and discovery across data and AI assets.
MetastoreMetastoreThe top-level container in Unity Catalog. One metastore is created per region; it manages the catalog → schema → table hierarchy.
CatalogCatalogTop level of the three-level namespace (catalog.schema.table). Commonly used to separate production from development environments.
SchemaSchema (Database)Second level of the three-level namespace. Logically groups tables, views, and functions. Synonymous with "database" in SQL.
External LocationExternal LocationA cloud storage path under Unity Catalog management. Access to S3 or ADLS is mediated through Unity Catalog.
Storage CredentialStorage CredentialThe IAM role or service principal used to access cloud storage. Bound to an external location.
Managed TableManaged TableTable whose data and metadata are both managed by Databricks. Running DROP TABLE deletes the underlying data as well.
External TableExternal TableTable whose metadata is managed by Unity Catalog but whose data lives in external storage. DROP TABLE leaves the data intact.
Data LineageData LineageFeature that automatically tracks and visualizes data origin and transformation history. Records inter-table dependencies for impact analysis.
GRANT / REVOKEGRANT / REVOKEGRANT SELECT ON TABLE grants read access to a table. Privileges are inherited down the catalog → schema → table hierarchy.
Dynamic ViewDynamic ViewView that uses current_user() or is_member() to filter rows and columns per user, enabling row- and column-level access control.
VolumeVolumeNon-tabular file storage governed by Unity Catalog. Holds images, PDFs, CSVs, and so on. Comes in managed and external variants.
Row Filter / Column MaskRow Filter / Column MaskAccess control where Row Filter limits visible rows and Column Mask masks column values (e.g., replacing part of an email with ****).
Information SchemaInformation SchemaSystem schema that exposes metadata such as tables, columns, and privileges within a catalog so it can be queried via SQL.
Delta SharingDelta SharingOpen protocol for securely sharing data across organizations without physical copies. Accessible even from non-Databricks environments.

Spark / PySpark Terms (15)

Apache Spark is the execution engine behind Databricks. It is mandatory knowledge for the Spark Developer exam, and the DEA and MLA exams also test the fundamentals.

Term (JP)Term (EN)Definition
Apache SparkApache SparkDistributed processing engine for large-scale data. A unified, in-memory framework that supports both batch and streaming workloads.
SparkSessionSparkSessionThe entry point of a Spark application. Available automatically as the spark variable in Databricks. Used to create DataFrames, run SQL, and manage configuration.
DataFrameDataFrameDistributed dataset composed of named columns. PySpark's primary data structure, supporting API operations such as select, filter, groupBy, and join.
TransformationTransformationLazily evaluated operation (select, filter, groupBy, join, etc.). Doesn't execute until an Action is called. Categorized as Narrow or Wide.
ActionActionOperation that triggers computation (show, count, collect, write). Calling an Action causes all upstream Transformations to execute.
PartitionPartitionUnit of data partitioning. Spark parallelizes processing at the partition level. Use repartition() and coalesce() to change the partition count.
ShuffleShuffleRedistributes data across workers. A common bottleneck triggered by Wide Transformations such as groupBy, join, and repartition.
Catalyst OptimizerCatalyst OptimizerUnified engine that optimizes queries through four stages: logical plan, optimized logical plan, physical plan, and code generation.
Adaptive Query ExecutionAdaptive Query Execution (AQE)Spark 3.0+ optimization that uses runtime statistics to dynamically switch join strategies, coalesce partitions, and split skewed partitions.
Broadcast JoinBroadcast JoinJoin strategy that copies a small table to every Executor so the join can be performed without a shuffle. Can be requested explicitly via the broadcast() hint.
Cache (persist)Cache (persist)Feature that caches a DataFrame in memory or on disk for reuse. cache() targets memory only; persist() lets you choose the storage level.
Spark SQLSpark SQLModule for manipulating Spark data using SQL syntax. spark.sql() runs a query and returns the result as a DataFrame.
Window FunctionsWindow FunctionsFunctions that perform ranking, moving aggregates, and cumulative calculations within partitions. ROW_NUMBER, RANK, LAG, and LEAD are common examples.
UDFUDF (User Defined Function)User-defined function. Plain Python UDFs carry high serialization overhead, so Pandas UDFs (vectorized UDFs) are recommended where possible.
Pandas API on SparkPandas API on SparkCompatibility layer (import via pyspark.pandas) that runs essentially unmodified Pandas code in a distributed fashion. A different approach than Pandas UDFs.

ML / MLflow Terms (15)

Machine learning and MLflow vocabulary that appears frequently on the ML Associate and ML Professional exams.

Term (JP)Term (EN)Definition
MLflowMLflowOpen-source platform for managing the ML lifecycle. Composed of four components: Tracking, Models, Registry, and Model Serving.
ExperimentExperimentLogical container that groups related Runs. Typically one Experiment per project, used to compare different approaches.
RunRunA record of a single training execution. Captures parameters, metrics, artifacts, and tags. Started with mlflow.start_run().
Model RegistryModel RegistryVersioned registry for trained models. The Unity Catalog-integrated version uses Champion/Challenger aliases.
AutologAutologAutomatic logging enabled via mlflow.autolog(). Supports scikit-learn, TensorFlow, and PyTorch. Enabled by default on Databricks.
Feature StoreFeature StoreRepository for managing and sharing ML features. Managed as Feature Tables under Unity Catalog. Supports both offline and online access.
AutoMLAutoMLAutomatically performs preprocessing, feature engineering, model selection, and tuning given just the input data. Results are logged to MLflow.
HyperoptHyperoptLibrary for efficient hyperparameter search using Bayesian optimization (TPE). Can run distributed via SparkTrials.
Spark MLlibSpark MLlibSpark's distributed ML library. Builds ML workflows around the Pipeline, Transformer, and Estimator abstractions.
Pipeline (ML Pipeline)Pipeline (ML Pipeline)Workflow that chains Transformers and Estimators together. pipeline.fit() runs everything at once, ensuring reproducibility and portability.
Model SignatureModel SignatureDefines the input/output schema of an MLflow model. Can be inferred automatically with infer_signature(). Used by Model Serving for input validation.
Model FlavorModel FlavorMLflow's model storage format. Examples include sklearn, tensorflow, pytorch, and pyfunc. pyfunc is the generic flavor.
Lakehouse MonitoringLakehouse MonitoringFeature that automatically detects drift in table statistics and degradation in ML model prediction performance, then sends alerts.
Model ServingModel ServingDeploys registered models as REST API endpoints. Scales serverlessly and supports A/B testing.
Pandas UDFPandas UDFVectorized UDF built on Apache Arrow. Has lower serialization overhead than a regular UDF and runs faster.

Compute / Cluster Terms (10)

Vocabulary for Databricks compute resources. Cluster types, configuration, and cost management appear on every exam.

Term (JP)Term (EN)Definition
ClusterClusterUnit of Spark compute. Composed of a driver node plus worker nodes. Supports autoscaling and auto-termination.
Driver NodeDriver NodeNode that hosts the SparkSession and plans/coordinates jobs. The result of collect() is held in the driver's memory.
Worker NodeWorker NodeNode that runs Executors (the execution processes), parallelizing data processing at the task level. Cluster throughput scales with the worker count.
All-purpose ClusterAll-purpose ClusterCluster for interactive notebook development. Can be shared by multiple users. Billed as DBU plus underlying infrastructure cost.
Job ClusterJob ClusterCluster created automatically when a Workflows job runs and deleted automatically when the job ends. Roughly 30% cheaper than an all-purpose cluster.
SQL WarehouseSQL WarehouseCompute resource for Databricks SQL. Comes in three tiers: Serverless, Pro, and Classic. Serverless has the fastest startup and is recommended.
Serverless ComputeServerless ComputeCompute model where Databricks fully manages the infrastructure. Startup, scaling, and patching are all automated.
Cluster PolicyCluster PolicyRules an administrator uses to restrict and standardize cluster configuration. Controls instance type, max worker count, and runtime version.
Shared ClusterShared ClusterCluster that can be used concurrently by multiple users. Some features (such as dbutils.credentials) are restricted.
Single User ClusterSingle User ClusterCluster dedicated to a single user. All features are available. Unity Catalog Table ACLs work on both shared and single-user clusters.

ETL / Pipeline Terms (15)

Pipeline construction and operations vocabulary that appears frequently on the DEA and DEP exams.

Term (JP)Term (EN)Definition
WorkflowsDatabricks WorkflowsOrchestration service that defines tasks as a DAG (directed acyclic graph) and handles scheduling, dependency management, and error handling.
Auto LoaderAuto Loader (cloudFiles)Structured Streaming source that automatically detects and incrementally processes new files in cloud storage. Supports schema inference and evolution.
Structured StreamingStructured StreamingSpark engine for stream processing using the same DataFrame API as batch. Reads and writes via readStream/writeStream. Provides exactly-once guarantees.
ETL / ELTETL / ELTETL transforms before loading; ELT loads first and transforms after. Databricks, as a data lakehouse, recommends the ELT pattern.
CheckpointCheckpointMechanism that records Structured Streaming progress. Required for fault recovery and exactly-once guarantees. Specified via checkpointLocation.
TriggerTriggerControls when a streaming query fires. Options include processingTime (periodic) and availableNow (process all available data and stop — recommended).
Slowly Changing DimensionSCD (Slowly Changing Dimension)Pattern for managing change history in dimension tables. Type 1 overwrites; Type 2 retains history. Implemented in DLT via APPLY CHANGES INTO.
COPY INTOCOPY INTOIdempotent SQL statement that loads data from cloud storage into Delta Lake. Re-loading the same file produces no duplicates.
Data SkewData SkewCondition where data concentrates on specific key values, causing uneven work across partitions. Mitigated by AQE skew splitting or salting keys.
IdempotencyIdempotencyProperty that running the same operation any number of times yields the same result. Necessary to prevent duplicate data on pipeline re-runs.
Multi-task JobMulti-task JobWorkflows configuration that runs multiple tasks with dependencies. Supports inter-task parameter passing (task values), conditional branching, and retries.
Asset BundleDatabricks Asset Bundle (DAB)CI/CD tool that packages code, configuration, and resources via YAML. Automates deployments across environments.
ReposRepos (Git Integration)Feature for operating on Git repositories directly inside Databricks. Supports GitHub, GitLab, and Bitbucket. Used for notebook version control.
SecretSecretSensitive values such as API keys and passwords stored securely in a Secret Scope. Retrieved with dbutils.secrets.get(scope, key).
dbutilsdbutilsSuite of notebook utilities providing file operations (fs), secrets, widgets, and notebook control (notebook).

GenAI Terms (10)

Generative AI vocabulary tested on the GenAI Engineer Associate exam and the ML Professional exam.

Term (JP)Term (EN)Definition
Retrieval-Augmented GenerationRAG (Retrieval-Augmented Generation)Technique that retrieves information from an external knowledge base to augment an LLM's responses. Reduces hallucinations and brings in up-to-date information.
Vector SearchVector SearchDatabricks' managed vector database that vectorizes text and performs similarity search. Used as the retriever in RAG.
EmbeddingEmbeddingProcess of converting text, images, and similar inputs into fixed-length numeric vectors. Semantically similar data maps to nearby vectors.
Foundation Model APIFoundation Model APICalls external LLMs (GPT-4, Claude, etc.) and open-source models (Llama, DBRX, etc.) through a unified API. Accessed via Model Serving.
Prompt EngineeringPrompt EngineeringDiscipline of optimizing prompts to elicit the desired output from an LLM. Techniques include Zero-shot, Few-shot, and Chain-of-Thought prompting.
Fine-tuningFine-tuningTechnique of further training a pretrained LLM for a specific domain. Parameter-efficient methods such as LoRA and QLoRA are in scope for the exam.
DBRXDBRXOpen-source LLM developed by Databricks. Uses a Mixture of Experts (MoE) architecture to combine high performance with efficient inference.
LLM ChainLLM ChainPattern that chains multiple LLM calls and tool invocations together. Implemented with frameworks like LangChain and logged/traced via MLflow.
GuardrailsGuardrailsSafety mechanism that controls and filters LLM output. Harmful-content blocking and output-format constraints can be built into Model Serving.
MLflow TracingMLflow TracingFeature that traces and visualizes the execution flow of an LLM application. Records per-step latency and inputs/outputs in a RAG pipeline for debugging.

Check Your Understanding

Databricks

問題 1

Which statement about the Delta Lake VACUUM command is correct?

  1. It reduces the size of the transaction log and improves metadata read performance
  2. It physically deletes obsolete data files older than 7 days (168 hours) by default, after which Time Travel into those versions is no longer possible
  3. It recomputes table statistics and improves the accuracy of Catalyst Optimizer optimizations
  4. It compacts small files into larger ones to optimize read performance

正解: B

VACUUM physically deletes obsolete data files from a Delta Lake table. By default, files older than 7 days (168 hours) are deleted, and after VACUUM you can no longer Time Travel (VERSION AS OF / TIMESTAMP AS OF) into those older versions. Option A is wrong: reducing transaction-log size isn't VACUUM's job — that's handled by the checkpoint mechanism (a Parquet file written every 10 commits). Option C describes ANALYZE TABLE COMPUTE STATISTICS. Option D describes OPTIMIZE (file compaction). The distinction between VACUUM and OPTIMIZE is a frequent exam topic, so make sure you can tell them apart.

Frequently Asked Questions

In what order should I learn these terms?

Priority depends on which exam you're taking. For Data Engineer Associate (DEA), prioritize Delta Lake, Unity Catalog, and pipeline-related terms. For ML Associate, ML and MLflow terms matter most. The efficient approach is to first check the Exam Guide for your target exam, then start learning terms from the highest-weighted domains.

Should I memorize technical terms in English or in my native language?

Even if the exam is available in your language, we recommend learning the terms in English. Official documentation and error messages are in English, and most real-world work uses the English terms. Use translations as a comprehension aid, but memorize proper nouns like "Delta Lake", "Unity Catalog", and "MLflow" in English.

Can I pass the exam by memorizing the glossary alone?

Memorizing definitions alone is not enough to pass. Databricks exams focus on conceptual understanding and applied problem-solving, and many questions cannot be answered just by recalling definitions. Use the glossary as a foundation, then build on it by reading the official documentation and drilling with practice questions — that's the fastest route to passing.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Databricks

Databricks Certifications: All 7 Exams, Difficulty & Study Plan (2026)

Complete guide to all 7 Databricks certifications — Data Eng...

Databricks

Databricks Exam Difficulty Ranking: All 7 Certs Compared (2026)

Every Databricks certification ranked by difficulty, with st...

Databricks

Databricks Study Guide: Fastest Pass Route & Time Estimates (2026)

How to pass Databricks certifications efficiently. Official ...

Databricks

Databricks Data Engineer Associate: Complete Guide (2026)

Domain-by-domain breakdown of the Databricks Certified Data ...

Databricks

Databricks Data Engineer Professional: Complete Guide (2026)

Tactics for the Databricks Certified Data Engineer Professio...

Browse all Databricks articles (110)
© 2026 NicheeLab All rights reserved.