Microsoft Fabric Spark Compute remediate

Structured diagnostic workflows for resolving Apache Spark compute issues in Microsoft Fabric Data Engineering and Data Science workloads.

When to Use This Skill

Spark jobs fail with HTTP 430 throttling or TooManyRequestsForCapacity errors
Notebook or Spark job sessions are slow to start (>30 seconds)
Environment publishing fails or hangs
Autoscale is not scaling up or down as expected
Jobs are queued indefinitely or expiring after 24 hours
Custom pool creation fails or pools are undersized
Library installation causes session startup delays
Capacity appears exhausted despite low job counts
VNet/Private Link provisioning adds unexpected delays
Burst factor or job-level bursting behavior is unclear

Prerequisites

Workspace Admin or Capacity Admin role in Microsoft Fabric
Access to the Monitoring Hub for active Spark sessions
Access to Workspace Settings > Data Engineering/Science
Knowledge of current Fabric capacity SKU (F2 through F2048)

Quick Diagnostic: Identify Your Issue

Start here. Match your symptom to a diagnostic path:

Symptom	Diagnostic Path
HTTP 430 error	See Throttling and Concurrency
Jobs stuck in queue	See Throttling and Concurrency
Slow session startup	See Session and Environment
Environment publish fails	See Session and Environment
Autoscale not working	See Pool Configuration
Pool sizing questions	See Pool Configuration
Library conflicts	See Session and Environment
VNet delay on first job	See Session and Environment

Core Concepts

Capacity Unit to VCore Mapping

Every Fabric capacity SKU provides Spark VCores at a fixed ratio with a 3x burst factor:

1 Capacity Unit = 2 Spark VCores

For an F64 SKU: 64 CU x 2 = 128 base VCores, with 3x burst = 384 max Spark VCores.

Job Admission Model

Fabric Spark uses optimistic job admission: jobs are admitted based on their minimum core requirement (determined by the pool's minimum node setting). Jobs start with minimum nodes and scale up toward maximum nodes as cores become available. If no cores are available for the minimum requirement, the job is rejected or queued.

Two Pool Types

Starter Pools: Pre-warmed, medium nodes only, 5-10 second startup, always available
Custom Pools: User-configured node sizes (Small through XX-Large), 2-5 minute cold start, full flexibility

remediate Workflows

Workflow 1: HTTP 430 Throttling

Confirm the error: HTTP Response code 430: This Spark job can't be run because you have hit a Spark compute or API rate limit
Open the Monitoring Hub and count active Spark sessions
Calculate your capacity's max VCores: SKU CU × 2 × 3 (burst) = max VCores
Compare active usage against max VCores
Resolve by canceling idle sessions, upgrading SKU, or enabling job queueing for pipeline/scheduler jobs

See throttling-and-concurrency.md for the full SKU limits table and queue configuration.

Workflow 2: Slow Session Startup

Determine pool type (Starter vs Custom)
If Starter Pool with no custom libraries: expect 5-10 seconds; if slower, check capacity utilization
If custom libraries or Spark properties are attached via environment: expect 30 seconds to 5 minutes
If using non-Medium node size: Starter Pool fast-start is unavailable; expect 2-5 minutes (on-demand)
If Private Link is enabled and this is the first job: expect 10-15 minute VNet provisioning delay

See session-and-environment.md for detailed diagnosis.

Workflow 3: Environment Publishing Failure

Check if another Publish action is already in progress (only one at a time)
Verify library compatibility with the selected Spark runtime version
If runtime was recently changed, remove incompatible libraries and republish
If Private Link is enabled, the first publish may trigger VNet provisioning (10-15 min delay)
Review the error notification for specific failure details

See session-and-environment.md for resolution steps.

Available Scripts

Run the Spark capacity calculator to quickly determine VCore limits, max nodes, and queue limits for any Fabric SKU.

# Calculate capacity for F64 SKU
./scripts/Get-FabricSparkCapacity.ps1 -SkuSize 64

# Compare multiple SKUs
./scripts/Get-FabricSparkCapacity.ps1 -SkuSize 64 -CompareWith 128,256

Key Decision Points

When to Use Starter Pools vs Custom Pools

Use Starter Pools when: you need fast startup (<10s), workloads fit Medium nodes (8 VCores, 64 GB), and you have no heavy library dependencies.

Use Custom Pools when: you need Large/X-Large/XX-Large nodes for memory-intensive workloads, you need precise control over min/max node counts, or you need to limit autoscale behavior.

When to Enable Job-Level Bursting

Enable (default) when: you run large batch jobs that benefit from consuming all available burst VCores and concurrency is low.

Disable when: you have a multi-tenant environment with many concurrent users and fairness across teams matters more than single-job throughput.

Admin Portal path: Capacity Settings > Data Engineering/Science > Disable Job-Level Bursting toggle.

References

Throttling and Concurrency Guide — SKU limits, queue sizes, HTTP 430 resolution
Session and Environment Guide — Startup delays, publishing, libraries, VNet
Pool Configuration Guide — Node sizing, autoscale, custom pool setup, billing

fabric-spark-compute-remediate

How to add

Drop this on your repo README

Related skills

internal-comms

babysit

do

smart-explore

Get new DevOps e Infra skills every Monday