Server Log Analysis

Analyzes local server log files for crawl budget breakdown. No MCP or external calls required.

Inputs

file: Absolute path to server log file (Apache Combined, Apache Common, or Nginx access log). If user provides relative path, resolve with Bash: realpath <path>.

Execution

Step 1: Format Detection

Read the first 10 lines of the log file to detect format:

Apache Combined: %h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-agent}i" — 9+ fields, has referer and UA in quotes
Apache Common: %h %l %u %t "%r" %>s %b — 7 fields, no referer/UA
Nginx: similar to Apache Combined with slight field order differences
Check for compressed files (.gz) — if detected, inform user to decompress first

Step 2: Parse Log Lines

Use Bash awk to extract fields. For Apache Combined/Nginx format (9 fields):

awk '{
  ip=$1; method_url=$7; status=$9; ua=$0
  match($0, /"([^"]+)"$/, arr)  # Extract UA from last quoted field
  print ip, $7, $9, arr[1]
}' logfile

For Apache Common (7 fields): ip=$1, request=$7, status=$9, ua="unknown"

Step 3: Classify User-Agents

Group each request into categories:

Googlebot: Googlebot, Googlebot-Image, Googlebot-News, AdsBot-Google
Bingbot: bingbot, BingPreview, MicrosoftPreview
Other search bots: Slurp (Yahoo), DuckDuckBot, Baiduspider, YandexBot, Sogou
AI crawlers: GPTBot, ClaudeBot, PerplexityBot, Bytespider, CCBot, anthropic-ai
Monitoring tools: Pingdom, UptimeRobot, StatusCake, NewRelic, Datadog
Real users: everything else (browsers: Mozilla, Chrome, Safari, Firefox, Edge)
Unknown: no UA or unrecognized

Step 4: Calculate Metrics

Using awk/grep on the log file:

Total request count
Requests by bot category (count per category, % of total)
Requests by HTTP status code (200, 301, 302, 404, 500, etc.)
Top 20 crawled URLs by frequency — sort by count descending
Top 10 crawled path prefixes (first 2 URL segments, e.g., /blog/, /products/) — aggregate by prefix
Requests by hour-of-day (extract hour from timestamp field [DD/Mon/YYYY:HH:MM:SS])

Step 5: Identify Crawl Budget Concerns

Flag these patterns:

4xx error rate >5%: crawlers wasting budget on broken URLs
5xx error rate >1%: server errors burning crawl budget
Duplicate crawl patterns: same URL crawled >10x without apparent content change
Low-value paths: bots crawling /wp-admin, /search?, ?sort=, ?page=, session URLs
302 redirect overuse: temporary redirects don't pass full crawl equity
Non-canonical crawls: ?utm_ or tracking parameters in crawled URLs

Output Format

## Server Log Analysis: [filename]

**File:** [path] | **Format:** [Apache Combined/Common/Nginx] | **Total Requests:** [N]

### Crawl Budget Summary

| Metric | Value |
|--------|-------|
| Total requests | N |
| Bot traffic | N (X%) |
| Human traffic | N (X%) |
| Crawl error rate | X% (4xx+5xx) |
| Date range | [first log entry] to [last log entry] |

### Bot Traffic Breakdown

| Bot Category | Requests | % of Total | Top URL |
|---|---|---|---|
| Googlebot | N | X% | /path |
| Bingbot | N | X% | /path |
| AI Crawlers | N | X% | /path |
| Monitoring | N | X% | /path |
| Real Users | N | X% | — |
| Other/Unknown | N | X% | — |

### Top 20 Crawled URLs

| Rank | URL | Requests | Status Codes |
|------|-----|----------|--------------|
| 1 | /path | N | 200: N, 404: N |

### Crawl Frequency by Path

| Path Prefix | Requests | % of Bot Traffic |
|---|---|---|
| /blog/ | N | X% |

### Status Code Distribution

| Status | Count | % | Interpretation |
|--------|-------|---|----------------|
| 200 | N | X% | OK |
| 301 | N | X% | Permanent redirect |
| 404 | N | X% | Not found (crawl waste) |

### Crawl Budget Recommendations

[Prioritized list of issues found — Critical/High/Medium/Low]

## Data Sources

- Source: Local server log file (no external calls)

seo-log-analysis

How to add

Drop this on your repo README

Related skills

understand-dashboard

understand-chat

understand-domain

dev-browser

Get new Pesquisa e Web skills every Monday

Server Log Analysis

Inputs

Execution

Output Format

Comments · No comments