Infrastructure Verification Skill
Tech Stack: AWS CLI, Terraform, VPC, CloudWatch, bash
Source: Extracted from PDF S3 upload timeout investigation (2026-01-05) and Infrastructure-Application Contract principle.
When to Use This Skill
Use the infrastructure-verification skill when:
- ✓ Before deploying Lambda-in-VPC code
- ✓ Investigating Lambda connection timeouts
- ✓ Debugging deterministic failure patterns (first N succeed, last M fail)
- ✓ Validating network path to AWS services (S3, DynamoDB, RDS)
- ✓ After adding VPC endpoints
- ✓ Before concurrent Lambda executions
DO NOT use this skill for:
- ✗ Application code debugging (use error-investigation)
- ✗ Performance optimization (different focus)
- ✗ IAM permission issues (use AWS CLI directly)
Core Verification Principles
Principle 1: Infrastructure Dependency Validation
From CLAUDE.md Principle #15:
"Before deploying code that depends on AWS infrastructure (S3, VPC endpoints, NAT Gateway), verify infrastructure exists and is correctly configured. Network path issues cause deterministic failure patterns."
When to validate:
- Before deploying Lambda functions that make AWS service calls
- After Terraform infrastructure changes
- When investigating Lambda timeout patterns
- Before increasing concurrency limits
Principle 2: Pattern Recognition
Failure Pattern Types:
| Pattern | Root Cause | Investigation Priority |
|---|---|---|
| First N succeed, last M fail | Infrastructure bottleneck (NAT, connection limits) | HIGH - VPC endpoint missing |
| Random scattered failures | Performance issue (slow API, memory) | MEDIUM - Optimize code |
| All operations fail | Configuration issue (permissions, endpoint) | HIGH - Fix config |
| Intermittent failures | Rate limiting, transient network | LOW - Add retries |
Deterministic pattern (first N succeed, last M fail) is strongest signal of infrastructure bottleneck.
Verification Workflows
Workflow 1: VPC Endpoint Verification
Use when: Lambda-in-VPC needs to access S3 or DynamoDB
Steps:
# 1. Check if VPC endpoint exists
aws ec2 describe-vpc-endpoints \
--filters "Name=vpc-id,Values=vpc-xxx" \
"Name=service-name,Values=com.amazonaws.ap-southeast-1.s3" \
--query 'VpcEndpoints[*].{ID:VpcEndpointId,State:State,Service:ServiceName}' \
--output table
# Expected output (if endpoint exists):
# -----------------------------------------
# | DescribeVpcEndpoints |
# +-------+-------+------------------------+
# | ID | State | Service |
# +-------+-------+------------------------+
# | vpce-xxx | available | com.amazonaws.ap-southeast-1.s3 |
# +-------+-------+------------------------+
# If empty → No S3 VPC Endpoint (traffic goes through NAT Gateway)
# 2. Verify endpoint state
aws ec2 describe-vpc-endpoints \
--vpc-endpoint-ids vpce-xxx \
--query 'VpcEndpoints[0].State' \
--output text
# Expected: "available"
# If "pending" → Wait for creation
# If "failed" → Check Terraform logs
# 3. Verify route table attachment
aws ec2 describe-vpc-endpoints \
--vpc-endpoint-ids vpce-xxx \
--query 'VpcEndpoints[0].RouteTableIds' \
--output table
# Expected: List of route table IDs (must include Lambda subnet route tables)
# 4. Check Lambda subnet route tables
aws lambda get-function-configuration \
--function-name my-function \
--query 'VpcConfig.SubnetIds' \
--output text | xargs -I {} aws ec2 describe-subnets --subnet-ids {}
# Compare: Lambda subnets' route tables should be in VPC endpoint's RouteTableIds
# 5. Verify S3 prefix list in route tables
ROUTE_TABLE_ID=$(aws ec2 describe-route-tables \
--filters "Name=vpc-id,Values=vpc-xxx" \
--query 'RouteTables[0].RouteTableId' \
--output text)
aws ec2 describe-route-tables \
--route-table-ids $ROUTE_TABLE_ID \
--query 'RouteTables[*].Routes[?GatewayId==`vpce-xxx`]'
# Expected: Route with DestinationPrefixListId (S3 prefix list)
Verification checklist:
- VPC endpoint exists (
describe-vpc-endpointsreturns result) - State is "available" (not "pending" or "failed")
- Route tables attached (includes Lambda subnet route tables)
- S3 prefix list routes created (check route tables)
Common issues:
- Missing VPC endpoint → Create with Terraform
- State "pending" → Wait 2-3 minutes
- Route tables not attached → Update Terraform
route_table_ids - Lambda subnets not covered → Verify subnet route table IDs
Workflow 2: NAT Gateway Diagnosis
Use when: Investigating Lambda connection timeouts with external services
Steps:
# 1. Check NAT Gateway exists
aws ec2 describe-nat-gateways \
--filter "Name=vpc-id,Values=vpc-xxx" \
--query 'NatGateways[*].{ID:NatGatewayId,State:State,PublicIp:NatGatewayAddresses[0].PublicIp}' \
--output table
# Expected: State "available"
# 2. Check route tables using NAT Gateway
aws ec2 describe-route-tables \
--filters "Name=vpc-id,Values=vpc-xxx" \
--query 'RouteTables[*].Routes[?NatGatewayId!=`null`].[RouteTableId,DestinationCidrBlock,NatGatewayId]' \
--output table
# Expected: Route 0.0.0.0/0 → nat-xxx (default route through NAT)
# 3. Analyze connection saturation pattern
# Run this during concurrent Lambda executions
aws logs filter-log-events \
--log-group-name /aws/lambda/my-function \
--start-time $(date -d '5 minutes ago' +%s)000 \
--filter-pattern "START RequestId" \
--query 'events[*].timestamp' \
--output text | xargs -n1 date -d @
# Check execution pattern:
# - All start within 1 second → Concurrent execution
# - Some timeout after 600s → NAT Gateway saturation
# 4. Check for connection timeout errors
aws logs filter-log-events \
--log-group-name /aws/lambda/my-function \
--filter-pattern "ConnectTimeoutError" \
--query 'events[*].message' \
--output text
# If errors found → NAT Gateway connection limit reached
# 5. Calculate concurrent connection demand
CONCURRENT_LAMBDAS=$(aws logs filter-log-events \
--log-group-name /aws/lambda/my-function \
--start-time $(date -d '1 minute ago' +%s)000 \
--filter-pattern "START RequestId" \
--query 'length(events)' \
--output text)
echo "Concurrent Lambdas: $CONCURRENT_LAMBDAS"
echo "NAT Gateway connection limit: ~55,000 (but establishment rate limited)"
NAT Gateway saturation indicators:
- ✅ Deterministic pattern (first N succeed, last M fail)
- ✅ ConnectTimeoutError in logs
- ✅ Long execution times (600s = boto3 default timeout)
- ✅ Timeline shows concurrent starts → split success/failure
Solution: Add VPC Gateway Endpoint for S3/DynamoDB to bypass NAT
Workflow 3: Network Path Validation
Use when: Verifying Lambda can reach AWS services
Steps:
# 1. Identify Lambda VPC configuration
aws lambda get-function-configuration \
--function-name my-function \
--query 'VpcConfig.{VpcId:VpcId,SubnetIds:SubnetIds,SecurityGroupIds:SecurityGroupIds}' \
--output json
# Save VPC ID, Subnet IDs, Security Group IDs
# 2. Check security group egress rules
aws ec2 describe-security-groups \
--group-ids sg-xxx \
--query 'SecurityGroups[*].IpPermissionsEgress[*].{Proto:IpProtocol,Port:FromPort,Dest:IpRanges[0].CidrIp}' \
--output table
# Expected: 0.0.0.0/0 allowed (all egress)
# If restricted → Add rule for destination service
# 3. Check route table for Lambda subnet
SUBNET_ID=$(aws lambda get-function-configuration \
--function-name my-function \
--query 'VpcConfig.SubnetIds[0]' \
--output text)
ROUTE_TABLE_ID=$(aws ec2 describe-route-tables \
--filters "Name=association.subnet-id,Values=$SUBNET_ID" \
--query 'RouteTables[0].RouteTableId' \
--output text)
aws ec2 describe-route-tables \
--route-table-ids $ROUTE_TABLE_ID \
--query 'RouteTables[*].Routes[*].[DestinationCidrBlock,GatewayId,NatGatewayId]' \
--output table
# Expected routes:
# - local → vpc-xxx (VPC internal)
# - 0.0.