Snowpark Explained: Python/Java/Scala on Snowflake (2026)

Snowpark is a development framework for processing data inside Snowflake programmatically from Python, Scala, and Java. Every DataFrame operation runs on a Snowflake warehouse, so your data never has to leave Snowflake. Lazy evaluation means no SQL is issued until you call an action, and the Snowflake Optimizer generates the best execution plan for you.

SnowPro Specialty: Snowpark Exam Overview

Item	Details
Questions	65
Time limit	115 minutes
Exam fee	$225 USD
Passing score	750 / 1000
Prerequisite	SnowPro Core Certification
Validity	2 years (recertification available)

Exam Domains and Weights

Domain	Weight	Key topics
Snowflake Core Knowledge	~15%	Architecture, warehouses, caching, security
DataFrame API	~30%	Session, DataFrame operations, lazy eval, window functions
UDF / UDTF / Stored Procedure	~25%	Scalar UDF, Vectorized UDF, UDTF, Sproc, permissions
Data Engineering	~15%	File operations, Dynamic Tables, Tasks, Streams
ML Integration	~15%	Snowpark ML, Model Registry, Feature Store

Core DataFrame API Methods

Snowpark's DataFrame API lets you write data processing as Spark-style method chains. Every method compiles down to SQL, so the Snowflake Optimizer handles the optimization for you.

Method	Purpose	SQL equivalent
select()	Column selection	SELECT col1, col2
filter() / where()	Row filtering	WHERE condition
group_by().agg()	Group aggregation	GROUP BY + aggregate functions
join()	Table join	JOIN ... ON
with_column()	Add or transform column	SELECT ..., expr AS alias
sort() / order_by()	Sort	ORDER BY
collect()	Fetch results (action)	Execute query + return results
show()	Display results (action)	Execute query + console output
write.save_as_table()	Save table (action)	CREATE TABLE AS SELECT

from snowflake.snowpark import Session
from snowflake.snowpark.functions import col, sum as sum_, avg, count

session = Session.builder.configs(connection_params).create()

# Lazy evaluation: no SQL is executed at this point
df = session.table("SALES")
result = (
    df.filter(col("SALE_DATE") >= "2026-01-01")
      .group_by("REGION", "PRODUCT_CATEGORY")
      .agg(
          sum_("AMOUNT").alias("TOTAL_SALES"),
          avg("AMOUNT").alias("AVG_SALE"),
          count("ORDER_ID").alias("ORDER_COUNT"),
      )
      .sort(col("TOTAL_SALES").desc())
)

# collect() issues the SQL and runs it on Snowflake
rows = result.collect()
for row in rows:
    print(f"{row['REGION']}: {row['TOTAL_SALES']}")

UDF / UDTF / Stored Procedure Comparison

Aspect	UDF	UDTF	Stored Procedure
Invocation	Inside SELECT (scalar function)	FROM clause (TABLE(func()))	CALL statement
Return value	Scalar value (one value per row)	Table (multiple rows)	Arbitrary (side effects are the main goal)
Side effects	Not allowed (pure function)	Not allowed (pure function)	Allowed (DDL/DML supported)
Permission model	Caller Rights	Caller Rights	Caller Rights / Owner Rights
Languages	Python / Scala / Java / SQL	Python / Scala / Java / SQL	Python / Scala / Java / SQL / JavaScript

UDF Definition Example (Python)

from snowflake.snowpark.functions import udf
from snowflake.snowpark.types import StringType, IntegerType

# Inline UDF
@udf(name="calculate_tax", return_type=IntegerType(),
     input_types=[IntegerType()], replace=True, is_permanent=False)
def calculate_tax(amount: int) -> int:
    return int(amount * 0.1)

# SQL call: SELECT calculate_tax(AMOUNT) FROM ORDERS;

# --- UDTF definition example ---
from snowflake.snowpark.functions import udtf
from snowflake.snowpark.types import StructType, StructField

class SplitTags:
    def process(self, tags: str):
        for tag in tags.split(","):
            yield (tag.strip(),)

session.udtf.register(
    SplitTags,
    output_schema=StructType([StructField("TAG", StringType())]),
    input_types=[StringType()],
    name="split_tags",
    replace=True,
)
# SQL call: SELECT * FROM TABLE(split_tags('ml,ai,data'));

Stored Procedure Definition Example

from snowflake.snowpark.functions import sproc

@sproc(name="refresh_summary", replace=True, is_permanent=True,
       stage_location="@DEPLOY_STAGE",
       packages=["snowflake-snowpark-python"])
def refresh_summary(session: Session) -> str:
    source = session.table("RAW_EVENTS")
    summary = (
        source.group_by("EVENT_TYPE")
              .agg(count("*").alias("CNT"))
    )
    summary.write.save_as_table("EVENT_SUMMARY", mode="overwrite")
    return "EVENT_SUMMARY refreshed"

# Execution: CALL refresh_summary();

Vectorized UDF (Pandas UDF)

A Vectorized UDF takes pandas.Series as input and output, processing in batches instead of row by row, so it runs much faster than a regular UDF. It is well-suited to performance-sensitive numerical computation and ML inference.

from snowflake.snowpark.functions import pandas_udf
from snowflake.snowpark.types import PandasSeriesType, IntegerType
import pandas as pd

@pandas_udf(name="batch_normalize", return_type=PandasSeriesType(IntegerType()),
            input_types=[PandasSeriesType(IntegerType())],
            replace=True)
def batch_normalize(series: pd.Series) -> pd.Series:
    mean = series.mean()
    std = series.std()
    return ((series - mean) / std * 100).astype(int)

# SQL: SELECT batch_normalize(SCORE) FROM EXAM_RESULTS;
# Batch processing is dozens of times faster than a row-wise UDF

Snowpark ML

Snowpark ML is a library that lets you train and serve ML models entirely inside Snowflake. It exposes a scikit-learn-compatible API for fit/predict, and models are registered to the Snowflake Model Registry for production deployment.

from snowflake.ml.modeling.linear_model import LogisticRegression
from snowflake.ml.registry import Registry

# Training
train_df = session.table("TRAINING_DATA")
model = LogisticRegression(
    input_cols=["FEATURE_A", "FEATURE_B", "FEATURE_C"],
    label_cols=["LABEL"],
    output_cols=["PREDICTION"],
)
model.fit(train_df)

# Inference
test_df = session.table("TEST_DATA")
predictions = model.predict(test_df)
predictions.write.save_as_table("PREDICTIONS", mode="overwrite")

# Model registration
reg = Registry(session)
mv = reg.log_model(
    model,
    model_name="churn_classifier",
    version_name="v1",
    sample_input_data=train_df.limit(10),
)

Snowpark ML component	Role
snowflake.ml.modeling	scikit-learn-compatible ML API (preprocessing, training, inference)
Model Registry	Model version management and deployment
Feature Store	Feature management, sharing, and point-in-time joins

Snowpark Container Services (SPCS)

SPCS is a managed service for running Docker containers inside Snowflake. While Snowpark focuses on DataFrame operations and UDF/Sproc execution, SPCS runs general-purpose workloads such as GPU-based model training, custom REST APIs, and full-stack web applications.

Aspect	Snowpark UDF/Sproc	SPCS
Execution model	SQL function / CALL statement	Long-running container service
GPU	Not supported	Supported
External network	Restricted	Allowed via External Access Integration
Use cases	Data transformation, lightweight ML inference	Large-scale ML training, custom APIs, web apps

Study Roadmap

Period	Phase	What to study
Months 1-2	Core review + Python fundamentals	Snowflake architecture, basic Python/pandas operations
Months 3-4	Focused DataFrame API study	Session setup, DataFrame operations, window functions, file I/O
Months 5-6	UDF / UDTF / Sproc	Different definition styles, Vectorized UDF, permission model, package management
Months 7-8	Data Engineering + ML	Dynamic Tables, Streams/Tasks, Snowpark ML, Model Registry
Months 9-12	Mock exams + targeted review	Take multiple full-length 65-question mock exams and revisit the domains you missed

Sample Question

Snowpark

問題 1

You want to build a Python UDF in Snowpark and apply it to a column in a SELECT statement to perform a custom transformation on each row's text data. The UDF returns a single scalar value. To process a large number of rows quickly, you want batch execution instead of row-by-row. Which implementation is most appropriate?

Use a Vectorized UDF (@pandas_udf) with pandas.Series as input and output
Define it with the regular @udf decorator and speed it up by setting is_permanent=True
Define it as a UDTF and call it via TABLE() in the FROM clause
Define it as a Stored Procedure and run it with a CALL statement

正解: A

A Vectorized UDF (@pandas_udf) is ideal for batch processing many rows. Using pandas.Series for input and output removes the per-row function-call overhead and is dramatically faster than a regular UDF. Option B's is_permanent simply controls whether the UDF is persisted and has no effect on speed. Option C's UDTF returns a table (multiple rows), which does not match the scalar return requirement. Option D's Stored Procedure is invoked with a CALL statement and cannot be applied per row inside a SELECT statement.

Frequently Asked Questions

How does Snowpark's DataFrame API differ from Spark's DataFrame API?

Snowpark's DataFrame API offers Spark-like syntax (select, filter, group_by, join, and so on), but where the work runs is fundamentally different. Spark distributes processing across drivers and executors, whereas Snowpark translates every DataFrame operation into SQL that runs on a Snowflake warehouse. The data never leaves Snowflake and benefits from the Snowflake Optimizer. Snowpark also uses lazy evaluation: no SQL is actually issued until you call an action such as collect() or show().

When should you use UDF vs UDTF vs Stored Procedure?

A UDF (User-Defined Function) is a scalar function used inside a SELECT statement, ideal for per-row transformations and custom calculations. A UDTF (User-Defined Table Function) is called in the FROM clause and returns multiple rows per input row — useful for JSON expansion or log parsing. A Stored Procedure is invoked with a CALL statement and can issue DDL/DML or orchestrate multi-step workflows. Anything with side effects (creating tables, mutating data) belongs in a Stored Procedure. The exam frequently tests the differences in invocation style and return values across these three.

What is Snowpark Container Services, and how does it relate to Snowpark?

Snowpark Container Services (SPCS) is a managed service for running Docker containers inside Snowflake. While Snowpark handles DataFrame operations and UDF/UDTF/Stored Procedure execution, SPCS runs more general-purpose workloads — full-stack applications, GPU-based ML model training, custom REST APIs — all within Snowflake. A common integration pattern is to invoke an SPCS service from a Snowpark Stored Procedure. On the exam, SPCS shows up as the answer for any 'container-based processing' option.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

SnowPro Specialty: Snowpark Complete Guide — Python/Scala Data Processing