Microsoft Fabric Apache Spark Performance remediate
Systematic workflows for diagnosing, analyzing, and resolving Apache Spark performance problems in Microsoft Fabric Data Engineering and Data Science workloads.
When to Use This Skill
Activate when encountering any of the following scenarios:
- Spark notebooks or jobs running slower than expected
- Capacity throttling errors (HTTP 430 / TooManyRequestsForCapacity)
- Data skew detected by Spark Advisor in notebook cells
- Excessive shuffle read/write in Spark UI stages
- Small files accumulation in Delta Lake tables
- Streaming ingestion throughput degradation
- Need to select or tune a Fabric Spark resource profile
- VOrder vs. Optimized Write decision-making
- Autotune configuration and validation
- Right-sizing Spark pools, node counts, or Fabric capacity SKUs
Prerequisites
- Access to a Microsoft Fabric workspace with Data Engineering enabled
- Contributor or higher role on the workspace
- Familiarity with PySpark or Spark SQL
- PowerShell 7+ (for diagnostic scripts)
- Fabric REST API access token (for API-based diagnostics)
Quick Diagnosis Decision Tree
Start here when a Spark job is slow:
-
Is the job queued or throttled? Check Monitoring Hub for HTTP 430.
- Yes → See Capacity and Concurrency Tuning
- No → Continue
-
Did the Spark Advisor flag warnings? Check notebook cell indicators.
- Data Skew detected → See Data Skew Resolution
- No warnings → Continue
-
Is a single stage disproportionately slow? Open Spark UI → Stages tab.
- Yes, shuffle stage → See Shuffle Optimization
- Yes, scan stage → See File Scan Optimization
- No → Continue
-
Are executors underutilized? Check Resources tab in monitoring detail.
- High idle cores → See Pool and Executor Sizing
- All cores busy → See Partitioning Strategy
-
Is the issue write-related? Check write duration in Spark UI.
- Yes → See Delta Write Optimization
- No → See General Spark SQL Tuning
Core Spark Configuration Quick Reference
These are the three settings Fabric Autotune manages automatically. If autotune is disabled, tune them manually:
| Setting | Default | Purpose | Tuning Guidance |
|---|---|---|---|
spark.sql.shuffle.partitions | 200 | Partition count during joins/aggregations | Set to 2-3x total executor cores for your pool |
spark.sql.autoBroadcastJoinThreshold | 10 MB | Max table size for broadcast joins | Increase to 100-256 MB for star-schema joins |
spark.sql.files.maxPartitionBytes | 128 MB | Max bytes per file-read partition | Increase for large sequential scans, decrease for high parallelism |
Resource Profiles Quick Reference
Fabric provides predefined profiles that bundle optimized Spark settings:
| Profile | Best For | VOrder | Key Characteristics |
|---|---|---|---|
writeHeavy | ETL, batch ingestion, streaming | Disabled | Default for new workspaces; optimized write throughput |
readHeavyForSpark | Interactive Spark queries, analytics | Enabled | Optimized read paths for Spark workloads |
readHeavyForPBI | Power BI dashboards, DW queries | Enabled | Optimized for DirectLake and cross-engine reads |
Apply a profile at the environment level or override per-session:
# Per-session override example
spark.conf.set("spark.fabric.resource.profile", "readHeavyForSpark")
Autotune Quick Start
Enable autotune to let Fabric automatically optimize shuffle partitions, broadcast thresholds, and partition bytes:
# Enable in a notebook session
spark.conf.set("spark.ms.autotune.enabled", "true")
# Or set in Environment > Spark Properties
# spark.ms.autotune.enabled = true
Requirements: Runtime 1.1 or 1.2 only. Not compatible with high concurrency mode or private endpoints. Needs 20-25 iterations to converge on optimal settings.
Check autotune status after a query:
# View autotune decisions in Spark UI SQL tab
# Status values: QUERY_TUNING_SUCCEED, QUERY_TUNING_DISABLED,
# QUERY_PATTERN_NOT_MATCH, QUERY_DURATION_TOO_SHORT
Common Error Patterns
| Error / Symptom | Root Cause | Quick Fix |
|---|---|---|
| HTTP 430: TooManyRequestsForCapacity | All Spark VCores consumed | Cancel idle jobs in Monitoring Hub or upgrade SKU |
| Stage with 200 tasks, 1 task 100x slower | Data skew on join/group key | Add salting or use broadcast join |
| OOM on executor | Partition too large or broadcast too big | Increase partitions or lower broadcast threshold |
| Write takes >60% of total job time | Small files or missing optimization | Enable Optimized Write or run table maintenance |
| Streaming micro-batch latency increasing | Checkpoint overhead or partition mismatch | Tune trigger interval and Event Hub partitions |
Step-by-Step Workflows
For detailed procedures, see the reference guides:
- Spark Configuration Tuning — Shuffle, broadcast, skew, pool sizing, capacity planning
- Monitoring and Diagnostics — Spark UI navigation, Monitoring Hub, APIs, log analysis
- Delta Lake Optimization — VOrder, Optimized Write, table maintenance, partitioning, streaming
Available Scripts
Run the Fabric Spark diagnostics script to collect Spark application metrics via the Fabric REST API:
./scripts/Get-FabricSparkDiagnostics.ps1 -WorkspaceId "<guid>" -Token "<bearer-token>"
Available Templates
Use the performance analysis notebook template as a starting point for in-session diagnostics:
# Paste into a Fabric notebook to analyze current session performance
remediate
| Problem | Check | Resolution |
|---|---|---|
| Autotune not activating | Runtime version, HC mode, private endpoint | Switch to Runtime 1.1/1.2, disable HC mode |
| Resource profile not applying | Environment publish status | Republish environment after profile change |
| Pool autoscale not scaling up | Capacity SKU limits | Verify VCore headroom in Capacity Metrics app |
| Table maintenance job stuck | Concurrent maintenance on same table | Wait for previous job or cancel via API |
| Notebook cell shows no Spark Advisor | Runtime < 3.4 | Upgrade to Spark 3.4+ runtime |