Fabric Pandas Performance Troubleshooting
Diagnose and resolve pandas-related performance issues in Microsoft Fabric Spark notebooks, including memory exhaustion, slow conversions, and suboptimal pandas API on Spark usage.
When to Use This Skill
- Notebook cells hang or timeout during pandas operations
toPandas()fails with OutOfMemoryError or Java heap space errorscollect()crashes the driver node- Pandas API on Spark (
pyspark.pandas/ps) runs slower than expected - DataFrame conversion between Spark and pandas causes memory spikes
- Notebook kernel restarts unexpectedly during data processing
- Large dataset operations exhaust driver memory on Fabric capacity
- Need to choose between pandas, Spark DataFrame, or pandas API on Spark
Prerequisites
- Microsoft Fabric workspace with Data Engineering experience
- Fabric capacity F2 or higher (F64+ recommended for large datasets)
- PySpark notebook with Spark session active
- Basic familiarity with pandas and PySpark DataFrames
Quick Diagnosis
Symptom-to-Solution Map
| Symptom | Likely Cause | Jump To |
|---|---|---|
toPandas() OOM error | Dataset too large for driver | toPandas Optimization |
| Kernel restart during pandas op | Driver memory exhausted | Driver Memory Tuning |
pyspark.pandas slower than native pandas | Spark overhead on small data | Right-Size Your Approach |
| Slow groupby/merge in pandas API on Spark | Excessive shuffling | Shuffle Optimization |
| Cell timeout on DataFrame conversion | Large collect to driver | Incremental Processing |
ArrowInvalid or conversion errors | Schema mismatch / nulls | Arrow Conversion Fixes |
| High memory but slow pandas operations | GC pressure / fragmentation | Memory Profiling |
Right-Size Your Approach
Critical Decision: Choose the right DataFrame API for your data size and workload.
Dataset Size Decision Tree:
─────────────────────────────────────────────────────────
< 100 MB → Native pandas (pd.DataFrame)
100 MB - 1 GB → pandas API on Spark (ps.DataFrame)
> 1 GB → PySpark DataFrame (spark.DataFrame)
> 10 GB → PySpark + partitioning + Delta optimization
Mixed workload? → Process in Spark, convert final aggregation to pandas
Visualization? → Aggregate in Spark first, toPandas() on summary only
ML feature eng? → Spark for transforms, pandas for final model input
API Comparison
| Operation | Native pandas | pandas API on Spark | PySpark DataFrame |
|---|---|---|---|
| Memory model | Single-node (driver) | Distributed | Distributed |
| Max practical size | ~2-4 GB | 10s-100s GB | TB+ |
| Startup overhead | None | Spark session | Spark session |
| groupby speed (small) | Fast | Slower (shuffle) | Slower (shuffle) |
| groupby speed (large) | OOM risk | Fast | Fast |
| Interop with Spark | .toPandas() | .to_spark() | Native |
toPandas Optimization
Problem
toPandas() collects the entire distributed DataFrame to the single driver node. This is the #1 cause of OOM in Fabric notebooks.
Solutions (Progressive)
1. Reduce data BEFORE conversion
# BAD - converts entire table
pdf = spark_df.toPandas()
# GOOD - filter and select first
pdf = (spark_df
.filter("date >= '2024-01-01'")
.select("customer_id", "revenue", "region")
.toPandas())
2. Aggregate in Spark, convert summary
# BAD - convert raw data then aggregate in pandas
pdf = spark_df.toPandas()
result = pdf.groupby('region')['revenue'].sum()
# GOOD - aggregate in Spark first
summary = spark_df.groupBy("region").agg(F.sum("revenue").alias("total_revenue"))
pdf = summary.toPandas() # Only converting small aggregated result
3. Enable Apache Arrow for faster conversion
# Enable Arrow-based columnar transfer (3-100x faster)
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", "true")
# Now toPandas() uses Arrow columnar format
pdf = spark_df.toPandas()
4. Use sampling for exploration
# Sample before converting (for EDA/visualization)
pdf = spark_df.sample(fraction=0.01, seed=42).toPandas()
# Or limit rows
pdf = spark_df.limit(100000).toPandas()
5. Chunk large conversions
# Process in chunks using partitioning
def process_in_chunks(spark_df, chunk_col="date", process_fn=None):
"""Convert Spark DF to pandas in manageable chunks."""
chunks = [row[chunk_col] for row in spark_df.select(chunk_col).distinct().collect()]
results = []
for chunk_val in chunks:
chunk_pdf = spark_df.filter(F.col(chunk_col) == chunk_val).toPandas()
if process_fn:
chunk_pdf = process_fn(chunk_pdf)
results.append(chunk_pdf)
return pd.concat(results, ignore_index=True)
Driver Memory Tuning
Fabric Driver Memory by Node Size
| Node Size | vCores | Memory | Recommended Max toPandas() |
|---|---|---|---|
| Small | 4 | 32 GB | ~4-6 GB |
| Medium | 8 | 64 GB | ~10-12 GB |
| Large | 16 | 128 GB | ~20-25 GB |
| X-Large | 32 | 256 GB | ~40-50 GB |
Rule of thumb: toPandas() safe limit ≈ 15-20% of total driver memory (pandas creates copies during operations).
Configure Driver Memory
# Check current driver memory
print(f"Driver memory: {spark.conf.get('spark.driver.memory', 'default')}")
# Set via environment Spark properties (before session starts)
# In Fabric Environment > Spark properties:
# spark.driver.memory = 28g (for Medium nodes)
# Or override in notebook (must be first cell)
%%configure
{
"driverMemory": "28g",
"driverCores": 8
}
Resource Profile Selection for Pandas Workloads
# For notebooks heavy on pandas operations (read-heavy pattern)
spark.conf.set("spark.fabric.resourceProfile", "readHeavyForSpark")
# Key settings this enables:
# - spark.databricks.delta.optimizeWrite.enabled = true
# - Optimized read paths for Delta tables
Shuffle Optimization
Tune for pandas API on Spark Operations
# Reduce shuffle partitions for smaller datasets
# Default 200 is too high for datasets < 10 GB
spark.conf.set("spark.sql.shuffle.partitions", "auto") # Adaptive (AQE)
# Or set explicitly based on data size
# Rule: ~128 MB per partition
data_size_gb = 5
optimal_partitions = max(1, int(data_size_gb * 1024 / 128))
spark.conf.set("spark.sql.shuffle.partitions", str(optimal_partitions))
# Enable Adaptive Query Execution (on by default in Fabric)
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
Broadcast Join Optimization
# Increase broadcast threshold for pandas API on Spark joins
# Default: 10 MB - increase for medium lookup tables
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "100m") # 100 MB
# Force broadcast for known small DataFrames
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_lookup_df), "key_col")
Enable Autotune
# Let Fabric auto-optimize shuffle, broadcast, and partition settings
spark.conf.set("spark.ms.autotune.enabled", "true")
# Autotune requires:
# - Runtime 1.1 or 1.2
# - Queries > 15 seconds
# - Not in high concurrency mode
# - ~20-25 iterations to learn optimal settings
Arrow Conversion Fixes
Common Errors and Solutions
| Error | Cause | Fix |
|---|---|---|
ArrowInvalid: Could not convert X | Unsupported type | Cast column before conversion |
ArrowNotImplementedError | Nested types | Flatten struct/array columns |
pyarrow.lib.ArrowMemoryError | OOM during Arrow transfer | Reduce data size |