PySpark Performance: 10 Optimization Tips for Data Engineers | Databricks Sword Blog | Databricks Sword

PySpark Performance: 10 Optimization Tips for Data Engineers

Writing PySpark code that works is one thing. Writing PySpark code that runs fast at scale is another. This guide covers the optimization techniques that separate junior data engineers from senior ones.

1. Understand Lazy Evaluation

Spark transformations are lazy — they don't execute until an action is called.

# All these transformations are lazy
df = spark.read.parquet("/data/events")
df_filtered = df.filter(col("event_type") == "purchase")
df_selected = df_filtered.select("user_id", "amount")

# NOW everything executes
df_selected.show()

Common Actions That Trigger Execution

| Action | Description | |--------|-------------| | show() | Display rows | | collect() | Return all data to driver | | count() | Count rows | | write() | Write to storage |

2. Avoid `collect()` on Large DataFrames

collect() brings all data to the driver node. This causes OOM errors.

# BAD
all_data = df.collect()

# GOOD - use limit
sample = df.limit(100).collect()

3. Use the Right Join Strategy

Broadcast Join (Small + Large)

from pyspark.sql.functions import broadcast

result = large_df.join(broadcast(small_df), "customer_id")

Verify Join Strategy

result.explain()
# Look for BroadcastHashJoin

4. Prefer Built-in Functions Over UDFs

# SLOW: Python UDF
@udf(StringType())
def upper_udf(s):
    return s.upper()

# FAST: Built-in function
from pyspark.sql.functions import upper
df.select(upper(col("name")))

5. Optimize DataFrame Caching

Cache only when:

DataFrame used multiple times
Recomputing is expensive

df_processed.cache()
df_processed.count()  # Force materialization

6. Control Partition Count

Rules of thumb:

2-4 partitions per CPU core
128MB - 1GB per partition

# Repartition (full shuffle)
df = df.repartition(100)

# Coalesce (minimize shuffle)
df = df.coalesce(10)

7. Filter Early, Project Early

# GOOD: Filter first
df_filtered = df.filter(col("date") >= "2024-01-01")
df_joined = df_filtered.join(other_df, "id")

# BAD: Join then filter
df_joined = df.join(other_df, "id")
df_filtered = df_joined.filter(col("date") >= "2024-01-01")

8. Handle Data Skew

Enable Adaptive Query Execution:

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

9. Use Column and Partition Pruning

# Only reads needed columns and partitions
df = spark.read.parquet("/data/events") \
    .select("user_id", "event_type") \
    .filter(col("date") == "2024-01-15")

10. Monitor with Spark UI

Key things to check:

Jobs Tab — Duration
Stages Tab — Task distribution (skew)
SQL Tab — Query plans

Quick Reference Checklist

[ ] Filtered early
[ ] Selected only needed columns
[ ] Used broadcast joins for small tables
[ ] Avoided Python UDFs
[ ] Checked partition count
[ ] Enabled AQE

Level up your PySpark skills! ⚔️