Fabric Pandas Performance Troubleshooting

Diagnose and resolve pandas-related performance issues in Microsoft Fabric Spark notebooks, including memory exhaustion, slow conversions, and suboptimal pandas API on Spark usage.

When to Use This Skill

Notebook cells hang or timeout during pandas operations
toPandas() fails with OutOfMemoryError or Java heap space errors
collect() crashes the driver node
Pandas API on Spark (pyspark.pandas / ps) runs slower than expected
DataFrame conversion between Spark and pandas causes memory spikes
Notebook kernel restarts unexpectedly during data processing
Large dataset operations exhaust driver memory on Fabric capacity
Need to choose between pandas, Spark DataFrame, or pandas API on Spark

Prerequisites

Microsoft Fabric workspace with Data Engineering experience
Fabric capacity F2 or higher (F64+ recommended for large datasets)
PySpark notebook with Spark session active
Basic familiarity with pandas and PySpark DataFrames

Quick Diagnosis

Symptom-to-Solution Map

Symptom	Likely Cause	Jump To
`toPandas()` OOM error	Dataset too large for driver	toPandas Optimization
Kernel restart during pandas op	Driver memory exhausted	Driver Memory Tuning
`pyspark.pandas` slower than native pandas	Spark overhead on small data	Right-Size Your Approach
Slow groupby/merge in pandas API on Spark	Excessive shuffling	Shuffle Optimization
Cell timeout on DataFrame conversion	Large collect to driver	Incremental Processing
`ArrowInvalid` or conversion errors	Schema mismatch / nulls	Arrow Conversion Fixes
High memory but slow pandas operations	GC pressure / fragmentation	Memory Profiling

Right-Size Your Approach

Critical Decision: Choose the right DataFrame API for your data size and workload.

Dataset Size Decision Tree:
─────────────────────────────────────────────────────────
< 100 MB          → Native pandas (pd.DataFrame)
100 MB - 1 GB     → pandas API on Spark (ps.DataFrame)
> 1 GB            → PySpark DataFrame (spark.DataFrame)
> 10 GB           → PySpark + partitioning + Delta optimization

Mixed workload?   → Process in Spark, convert final aggregation to pandas
Visualization?    → Aggregate in Spark first, toPandas() on summary only
ML feature eng?   → Spark for transforms, pandas for final model input

API Comparison

Operation	Native pandas	pandas API on Spark	PySpark DataFrame
Memory model	Single-node (driver)	Distributed	Distributed
Max practical size	~2-4 GB	10s-100s GB	TB+
Startup overhead	None	Spark session	Spark session
groupby speed (small)	Fast	Slower (shuffle)	Slower (shuffle)
groupby speed (large)	OOM risk	Fast	Fast
Interop with Spark	`.toPandas()`	`.to_spark()`	Native

toPandas Optimization

Problem

toPandas() collects the entire distributed DataFrame to the single driver node. This is the #1 cause of OOM in Fabric notebooks.

Solutions (Progressive)

1. Reduce data BEFORE conversion

# BAD - converts entire table
pdf = spark_df.toPandas()

# GOOD - filter and select first
pdf = (spark_df
    .filter("date >= '2024-01-01'")
    .select("customer_id", "revenue", "region")
    .toPandas())

2. Aggregate in Spark, convert summary

# BAD - convert raw data then aggregate in pandas
pdf = spark_df.toPandas()
result = pdf.groupby('region')['revenue'].sum()

# GOOD - aggregate in Spark first
summary = spark_df.groupBy("region").agg(F.sum("revenue").alias("total_revenue"))
pdf = summary.toPandas()  # Only converting small aggregated result

3. Enable Apache Arrow for faster conversion

# Enable Arrow-based columnar transfer (3-100x faster)
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", "true")

# Now toPandas() uses Arrow columnar format
pdf = spark_df.toPandas()

4. Use sampling for exploration

# Sample before converting (for EDA/visualization)
pdf = spark_df.sample(fraction=0.01, seed=42).toPandas()

# Or limit rows
pdf = spark_df.limit(100000).toPandas()

5. Chunk large conversions

# Process in chunks using partitioning
def process_in_chunks(spark_df, chunk_col="date", process_fn=None):
    """Convert Spark DF to pandas in manageable chunks."""
    chunks = [row[chunk_col] for row in spark_df.select(chunk_col).distinct().collect()]
    results = []
    for chunk_val in chunks:
        chunk_pdf = spark_df.filter(F.col(chunk_col) == chunk_val).toPandas()
        if process_fn:
            chunk_pdf = process_fn(chunk_pdf)
        results.append(chunk_pdf)
    return pd.concat(results, ignore_index=True)

Driver Memory Tuning

Fabric Driver Memory by Node Size

Node Size	vCores	Memory	Recommended Max toPandas()
Small	4	32 GB	~4-6 GB
Medium	8	64 GB	~10-12 GB
Large	16	128 GB	~20-25 GB
X-Large	32	256 GB	~40-50 GB

Rule of thumb: toPandas() safe limit ≈ 15-20% of total driver memory (pandas creates copies during operations).

Configure Driver Memory

# Check current driver memory
print(f"Driver memory: {spark.conf.get('spark.driver.memory', 'default')}")

# Set via environment Spark properties (before session starts)
# In Fabric Environment > Spark properties:
# spark.driver.memory = 28g  (for Medium nodes)

# Or override in notebook (must be first cell)
%%configure
{
    "driverMemory": "28g",
    "driverCores": 8
}

Resource Profile Selection for Pandas Workloads

# For notebooks heavy on pandas operations (read-heavy pattern)
spark.conf.set("spark.fabric.resourceProfile", "readHeavyForSpark")

# Key settings this enables:
# - spark.databricks.delta.optimizeWrite.enabled = true
# - Optimized read paths for Delta tables

Shuffle Optimization

Tune for pandas API on Spark Operations

# Reduce shuffle partitions for smaller datasets
# Default 200 is too high for datasets < 10 GB
spark.conf.set("spark.sql.shuffle.partitions", "auto")  # Adaptive (AQE)

# Or set explicitly based on data size
# Rule: ~128 MB per partition
data_size_gb = 5
optimal_partitions = max(1, int(data_size_gb * 1024 / 128))
spark.conf.set("spark.sql.shuffle.partitions", str(optimal_partitions))

# Enable Adaptive Query Execution (on by default in Fabric)
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

Broadcast Join Optimization

# Increase broadcast threshold for pandas API on Spark joins
# Default: 10 MB - increase for medium lookup tables
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "100m")  # 100 MB

# Force broadcast for known small DataFrames
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_lookup_df), "key_col")

Enable Autotune

# Let Fabric auto-optimize shuffle, broadcast, and partition settings
spark.conf.set("spark.ms.autotune.enabled", "true")

# Autotune requires:
# - Runtime 1.1 or 1.2
# - Queries > 15 seconds
# - Not in high concurrency mode
# - ~20-25 iterations to learn optimal settings

Arrow Conversion Fixes

Common Errors and Solutions

Error	Cause	Fix
`ArrowInvalid: Could not convert X`	Unsupported type	Cast column before conversion
`ArrowNotImplementedError`	Nested types	Flatten struct/array columns
`pyarrow.lib.ArrowMemoryError`	OOM during Arrow transfer	Reduce data size

fabric-pandas-perf-remediate

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

xlsx

mem-search

weekly-digests

how-it-works

Recibe nuevas skills de Dados e Análise todos los lunes