CloudWatch Log Debugger
Query, filter, and analyze AWS CloudWatch logs for production debugging. Auto-configures to any AWS environment.
Current State
- Current timestamp (epoch seconds): !
date +%s - Current time (human-readable): !
date '+%Y-%m-%d %H:%M:%S %Z'
First-Time Setup
If config.json does not exist in this skill's directory, tell the user:
This skill needs to discover your AWS infrastructure first. Run
/cloudwatch configureor let me auto-configure now.
Then read and follow the instructions in scripts/configure.sh to generate config.json.
Configuration
Read config.json from this skill's directory for all environment-specific values. The config contains:
aws_cli— path to the AWS CLI binary (e.g.,awsor/snap/bin/aws)region— AWS regionlog_groups— discovered log groups with their purpose and stream prefixesdefault_log_group— which log group to query when the user doesn't specifyecs_clusters— ECS clusters if anyalarms— CloudWatch alarms if anyoutput_dir— where to save log files (default:logs/)
Use these values in all commands instead of hardcoded strings.
Command Dispatch
Parse $ARGUMENTS to determine which command to run:
If $ARGUMENTS starts with... | Action |
|---|---|
configure | Run configuration (see First-Time Setup) |
status | Jump to Status Check below |
report | Jump to Report below (remaining args = time range) |
alarms | Jump to Alarms below |
diff | Jump to Error Rate Comparison below (remaining args = time windows) |
| anything else | Jump to Workflow below (reactive debugging) |
Status Check
Quick health dashboard. No arguments needed.
Read config.json, then run these queries:
1. Error Count (last 30 min)
Run a Logs Insights query against app log groups (priority <= 2). Use --log-group-names to batch:
QUERY_ID=$($AWS_CLI logs start-query \
--log-group-names "$LOG_GROUP_1" "$LOG_GROUP_2" \
--start-time $(date -d '30 minutes ago' +%s) \
--end-time $(date +%s) \
--query-string 'fields @timestamp, @message | filter @message like /ERROR|Exception|FATAL/ | stats count() as error_count by @logStream' \
--region $REGION --output text --query 'queryId')
Then sleep 3, then get-query-results.
2. Alarm States (live)
Fetch current alarm states from the API — do NOT use cached values from config:
$AWS_CLI cloudwatch describe-alarms \
--region $REGION --output json \
--query 'MetricAlarms[].{name:AlarmName,state:StateValue,metric:MetricName,namespace:Namespace,threshold:Threshold}'
3. ECS Service Health
For each cluster/service in config.ecs:
$AWS_CLI ecs describe-services \
--cluster $CLUSTER \
--services $SERVICE_ARN \
--region $REGION --output json \
--query 'services[].{name:serviceName,desired:desiredCount,running:runningCount,pending:pendingCount}'
4. Recently Stopped Tasks
$AWS_CLI ecs list-tasks --cluster $CLUSTER --desired-status STOPPED --region $REGION --output json
If any stopped tasks exist, describe them for crash reasons:
$AWS_CLI ecs describe-tasks --cluster $CLUSTER --tasks $TASK_ARNS \
--region $REGION --output json \
--query 'tasks[].{taskArn:taskArn,stoppedReason:stoppedReason,stopCode:stopCode,stoppedAt:stoppedAt,containers:containers[].{name:name,exitCode:exitCode,reason:reason}}'
5. CPU/Memory Utilization
$AWS_CLI cloudwatch get-metric-statistics \
--namespace AWS/ECS --metric-name CPUUtilization \
--dimensions Name=ClusterName,Value=$CLUSTER \
--start-time $(date -d '30 minutes ago' -u +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 --statistics Average Maximum \
--region $REGION --output json
Same for MemoryUtilization.
Output Format
Present as a dashboard summary:
## System Status (as of YYYY-MM-DD HH:MM:SS)
### Errors (last 30 min)
- app-backend: 12 errors
- app-frontend: 0 errors
### Alarms
- OK: my-app-ECS-CPU-High (CPUUtilization < 80)
- **ALARM: my-app-ApplicationErrors-High** (ErrorCount > 50)
### ECS Services
- my-app-web: 2/2 running, 0 pending
- my-app-worker: 1/1 running, 0 pending
### Resource Utilization (30-min avg)
- CPU: 45% avg, 62% max
- Memory: 71% avg, 78% max
Save to $OUTPUT_DIR/YYYYMMDD_HHMMSS_status.txt.
Report
Periodic summary over a configurable time range. Parse the time range from the remaining arguments after report (e.g., last 24 hours, last 6h, today). Default: last 1 hour.
Run these Logs Insights queries against app log groups:
1. Top Errors
fields @timestamp, @message
| filter @message like /ERROR/
| parse @message '"message": "*"' as error_msg
| stats count() as occurrences by error_msg
| sort occurrences desc
| limit 10
2. Error Trend
fields @timestamp, @message
| filter @message like /ERROR/
| stats count() as errors by bin(5m)
| sort @timestamp asc
For time ranges > 6 hours, use bin(30m) instead of bin(5m).
3. P95 Latency
fields @timestamp, @message
| filter @message like /request completed|duration/
| parse @message '"duration": *,' as duration_ms
| stats avg(duration_ms) as avg_ms, max(duration_ms) as max_ms, pct(duration_ms, 95) as p95_ms by bin(5m)
| sort @timestamp asc
4. Most Affected Endpoints
fields @timestamp, @message
| filter @message like /ERROR/
| parse @message '"path": "*"' as endpoint
| stats count() as errors by endpoint
| sort errors desc
| limit 10
Output Format
## Report: Last 1 Hour (HH:MM - HH:MM)
### Top Errors
| # | Error | Count |
|---|-------|-------|
| 1 | ConnectionRefused: DB pool exhausted | 23 |
| 2 | TokenExpiredError | 8 |
### Error Trend (5-min bins)
HH:00 ██████████ 23
HH:05 ████ 8
HH:10 ██ 4
...
### Latency
- Average: 120ms
- P95: 450ms
- Max: 2300ms
### Most Affected Endpoints
| Endpoint | Errors |
|----------|--------|
| /api/auth/callback | 15 |
| /api/users/profile | 8 |
Save to $OUTPUT_DIR/YYYYMMDD_HHMMSS_report.txt.
Alarms
List all CloudWatch alarms with their current state.
1. Fetch Live Alarm Data
$AWS_CLI cloudwatch describe-alarms \
--region $REGION --output json
2. Present Grouped by State
Group alarms by state. Show ALARM state first (highlighted), then OK, then INSUFFICIENT_DATA.
For each alarm, show:
- Alarm name
- Metric and namespace
- Threshold and comparison operator
- Evaluation periods and period length
- State reason (for alarms not in OK state)
3. Map to Log Groups
Map alarm namespaces to log group categories for investigation suggestions:
AWS/ApplicationELB→ ecs-app → suggest/cloudwatch 500 errorsAWS/ECS→ container-insights → suggest/cloudwatch ECS task crashesAWS/RDS→ rds → suggest/cloudwatch database errorsAWS/Lambda→ lambda → suggest/cloudwatch lambda errors
Output Format
## CloudWatch Alarms
### ALARM (1)
- **my-app-ApplicationErrors-High**
Metric: AWS/ApplicationELB > ErrorCount
Condition: ErrorCount > 50 for 1 period(s) of 300s
Reason: Threshold crossed...
→ Investigate: /cloudwatch 500 errors in the last hour
### OK (2)
- my-app-ECS-CPU-High
Metric: AWS/ECS > CPUUtilization
Condition: CPUUtilization > 80 for 2 period(s) of 300s
### INSUFFICIENT_DATA (0)
None.
Save to $OUTPUT_DIR/YYYYMMDD_HHMMSS_alarms.txt.
Error Rate Comparison
Compare error rates between two time windows to detect regressions or confirm fixes.
1. Parse Time Windows
From the remaining arguments after diff. Defaults:
- Window A (current): last 30 minutes
- Window B (baseline): 30–60 minutes ago
Support natural language like:
last 1h vs yesterday same timelast 30m vs 2h agopost-deploy vs pre-deploy(user should provide timestamps)
2. Run Error Count for Both Windows
Use --log-group-names to batch app log groups in