Snowflake

SnowPro Specialty: Snowpark Complete Guide — Python/Scala Data Processing

2026-03-21
更新: 2026-03-27
NicheeLab Editorial Team

Snowpark is a development framework for processing data inside Snowflake programmatically from Python, Scala, and Java. Every DataFrame operation runs on a Snowflake warehouse, so your data never has to leave Snowflake. Lazy evaluation means no SQL is issued until you call an action, and the Snowflake Optimizer generates the best execution plan for you.

SnowPro Specialty: Snowpark Exam Overview

ItemDetails
Questions65
Time limit115 minutes
Exam fee$225 USD
Passing score750 / 1000
PrerequisiteSnowPro Core Certification
Validity2 years (recertification available)

Exam Domains and Weights

DomainWeightKey topics
Snowflake Core Knowledge~15%Architecture, warehouses, caching, security
DataFrame API~30%Session, DataFrame operations, lazy eval, window functions
UDF / UDTF / Stored Procedure~25%Scalar UDF, Vectorized UDF, UDTF, Sproc, permissions
Data Engineering~15%File operations, Dynamic Tables, Tasks, Streams
ML Integration~15%Snowpark ML, Model Registry, Feature Store

Core DataFrame API Methods

Snowpark's DataFrame API lets you write data processing as Spark-style method chains. Every method compiles down to SQL, so the Snowflake Optimizer handles the optimization for you.

MethodPurposeSQL equivalent
select()Column selectionSELECT col1, col2
filter() / where()Row filteringWHERE condition
group_by().agg()Group aggregationGROUP BY + aggregate functions
join()Table joinJOIN ... ON
with_column()Add or transform columnSELECT ..., expr AS alias
sort() / order_by()SortORDER BY
collect()Fetch results (action)Execute query + return results
show()Display results (action)Execute query + console output
write.save_as_table()Save table (action)CREATE TABLE AS SELECT
from snowflake.snowpark import Session
from snowflake.snowpark.functions import col, sum as sum_, avg, count

session = Session.builder.configs(connection_params).create()

# Lazy evaluation: no SQL is executed at this point
df = session.table("SALES")
result = (
    df.filter(col("SALE_DATE") >= "2026-01-01")
      .group_by("REGION", "PRODUCT_CATEGORY")
      .agg(
          sum_("AMOUNT").alias("TOTAL_SALES"),
          avg("AMOUNT").alias("AVG_SALE"),
          count("ORDER_ID").alias("ORDER_COUNT"),
      )
      .sort(col("TOTAL_SALES").desc())
)

# collect() issues the SQL and runs it on Snowflake
rows = result.collect()
for row in rows:
    print(f"{row['REGION']}: {row['TOTAL_SALES']}")

UDF / UDTF / Stored Procedure Comparison

AspectUDFUDTFStored Procedure
InvocationInside SELECT (scalar function)FROM clause (TABLE(func()))CALL statement
Return valueScalar value (one value per row)Table (multiple rows)Arbitrary (side effects are the main goal)
Side effectsNot allowed (pure function)Not allowed (pure function)Allowed (DDL/DML supported)
Permission modelCaller RightsCaller RightsCaller Rights / Owner Rights
LanguagesPython / Scala / Java / SQLPython / Scala / Java / SQLPython / Scala / Java / SQL / JavaScript

UDF Definition Example (Python)

from snowflake.snowpark.functions import udf
from snowflake.snowpark.types import StringType, IntegerType

# Inline UDF
@udf(name="calculate_tax", return_type=IntegerType(),
     input_types=[IntegerType()], replace=True, is_permanent=False)
def calculate_tax(amount: int) -> int:
    return int(amount * 0.1)

# SQL call: SELECT calculate_tax(AMOUNT) FROM ORDERS;

# --- UDTF definition example ---
from snowflake.snowpark.functions import udtf
from snowflake.snowpark.types import StructType, StructField

class SplitTags:
    def process(self, tags: str):
        for tag in tags.split(","):
            yield (tag.strip(),)

session.udtf.register(
    SplitTags,
    output_schema=StructType([StructField("TAG", StringType())]),
    input_types=[StringType()],
    name="split_tags",
    replace=True,
)
# SQL call: SELECT * FROM TABLE(split_tags('ml,ai,data'));

Stored Procedure Definition Example

from snowflake.snowpark.functions import sproc

@sproc(name="refresh_summary", replace=True, is_permanent=True,
       stage_location="@DEPLOY_STAGE",
       packages=["snowflake-snowpark-python"])
def refresh_summary(session: Session) -> str:
    source = session.table("RAW_EVENTS")
    summary = (
        source.group_by("EVENT_TYPE")
              .agg(count("*").alias("CNT"))
    )
    summary.write.save_as_table("EVENT_SUMMARY", mode="overwrite")
    return "EVENT_SUMMARY refreshed"

# Execution: CALL refresh_summary();

Vectorized UDF (Pandas UDF)

A Vectorized UDF takes pandas.Series as input and output, processing in batches instead of row by row, so it runs much faster than a regular UDF. It is well-suited to performance-sensitive numerical computation and ML inference.

from snowflake.snowpark.functions import pandas_udf
from snowflake.snowpark.types import PandasSeriesType, IntegerType
import pandas as pd

@pandas_udf(name="batch_normalize", return_type=PandasSeriesType(IntegerType()),
            input_types=[PandasSeriesType(IntegerType())],
            replace=True)
def batch_normalize(series: pd.Series) -> pd.Series:
    mean = series.mean()
    std = series.std()
    return ((series - mean) / std * 100).astype(int)

# SQL: SELECT batch_normalize(SCORE) FROM EXAM_RESULTS;
# Batch processing is dozens of times faster than a row-wise UDF

Snowpark ML

Snowpark ML is a library that lets you train and serve ML models entirely inside Snowflake. It exposes a scikit-learn-compatible API for fit/predict, and models are registered to the Snowflake Model Registry for production deployment.

from snowflake.ml.modeling.linear_model import LogisticRegression
from snowflake.ml.registry import Registry

# Training
train_df = session.table("TRAINING_DATA")
model = LogisticRegression(
    input_cols=["FEATURE_A", "FEATURE_B", "FEATURE_C"],
    label_cols=["LABEL"],
    output_cols=["PREDICTION"],
)
model.fit(train_df)

# Inference
test_df = session.table("TEST_DATA")
predictions = model.predict(test_df)
predictions.write.save_as_table("PREDICTIONS", mode="overwrite")

# Model registration
reg = Registry(session)
mv = reg.log_model(
    model,
    model_name="churn_classifier",
    version_name="v1",
    sample_input_data=train_df.limit(10),
)
Snowpark ML componentRole
snowflake.ml.modelingscikit-learn-compatible ML API (preprocessing, training, inference)
Model RegistryModel version management and deployment
Feature StoreFeature management, sharing, and point-in-time joins

Snowpark Container Services (SPCS)

SPCS is a managed service for running Docker containers inside Snowflake. While Snowpark focuses on DataFrame operations and UDF/Sproc execution, SPCS runs general-purpose workloads such as GPU-based model training, custom REST APIs, and full-stack web applications.

AspectSnowpark UDF/SprocSPCS
Execution modelSQL function / CALL statementLong-running container service
GPUNot supportedSupported
External networkRestrictedAllowed via External Access Integration
Use casesData transformation, lightweight ML inferenceLarge-scale ML training, custom APIs, web apps

Study Roadmap

PeriodPhaseWhat to study
Months 1-2Core review + Python fundamentalsSnowflake architecture, basic Python/pandas operations
Months 3-4Focused DataFrame API studySession setup, DataFrame operations, window functions, file I/O
Months 5-6UDF / UDTF / SprocDifferent definition styles, Vectorized UDF, permission model, package management
Months 7-8Data Engineering + MLDynamic Tables, Streams/Tasks, Snowpark ML, Model Registry
Months 9-12Mock exams + targeted reviewTake multiple full-length 65-question mock exams and revisit the domains you missed

Sample Question

Snowpark

問題 1

You want to build a Python UDF in Snowpark and apply it to a column in a SELECT statement to perform a custom transformation on each row's text data. The UDF returns a single scalar value. To process a large number of rows quickly, you want batch execution instead of row-by-row. Which implementation is most appropriate?

  1. Use a Vectorized UDF (@pandas_udf) with pandas.Series as input and output
  2. Define it with the regular @udf decorator and speed it up by setting is_permanent=True
  3. Define it as a UDTF and call it via TABLE() in the FROM clause
  4. Define it as a Stored Procedure and run it with a CALL statement

正解: A

A Vectorized UDF (@pandas_udf) is ideal for batch processing many rows. Using pandas.Series for input and output removes the per-row function-call overhead and is dramatically faster than a regular UDF. Option B's is_permanent simply controls whether the UDF is persisted and has no effect on speed. Option C's UDTF returns a table (multiple rows), which does not match the scalar return requirement. Option D's Stored Procedure is invoked with a CALL statement and cannot be applied per row inside a SELECT statement.

Frequently Asked Questions

How does Snowpark's DataFrame API differ from Spark's DataFrame API?

Snowpark's DataFrame API offers Spark-like syntax (select, filter, group_by, join, and so on), but where the work runs is fundamentally different. Spark distributes processing across drivers and executors, whereas Snowpark translates every DataFrame operation into SQL that runs on a Snowflake warehouse. The data never leaves Snowflake and benefits from the Snowflake Optimizer. Snowpark also uses lazy evaluation: no SQL is actually issued until you call an action such as collect() or show().

When should you use UDF vs UDTF vs Stored Procedure?

A UDF (User-Defined Function) is a scalar function used inside a SELECT statement, ideal for per-row transformations and custom calculations. A UDTF (User-Defined Table Function) is called in the FROM clause and returns multiple rows per input row — useful for JSON expansion or log parsing. A Stored Procedure is invoked with a CALL statement and can issue DDL/DML or orchestrate multi-step workflows. Anything with side effects (creating tables, mutating data) belongs in a Stored Procedure. The exam frequently tests the differences in invocation style and return values across these three.

What is Snowpark Container Services, and how does it relate to Snowpark?

Snowpark Container Services (SPCS) is a managed service for running Docker containers inside Snowflake. While Snowpark handles DataFrame operations and UDF/UDTF/Stored Procedure execution, SPCS runs more general-purpose workloads — full-stack applications, GPU-based ML model training, custom REST APIs — all within Snowflake. A common integration pattern is to invoke an SPCS service from a Snowpark Stored Procedure. On the exam, SPCS shows up as the answer for any 'container-based processing' option.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Snowflake

Snowflake Certifications: All 11 Exams Explained (2026)

Every SnowPro certification — Associate, Core, Specialty, Ad...

Snowflake

Snowflake Exam Difficulty Ranking: All 11 Certs Compared (2026)

All 11 SnowPro exams ranked by difficulty with study-time es...

Snowflake

Snowflake Study Guide: Fastest Pass Route by Exam (2026)

How to pass SnowPro certifications efficiently — official ma...

Snowflake

SnowPro Core (COF-C03): Complete Exam Guide (2026)

Pass the SnowPro Core exam — six domains, scope, sample ques...

Snowflake

SnowPro Associate Platform (SOL-C01): Complete Guide (2026)

The entry-level SnowPro Associate exam — scope, weighting, s...

Browse all Snowflake articles (103)
© 2026 NicheeLab All rights reserved.