Technical SEO Audit
Categories
1. Crawlability
- robots.txt: exists, valid, not blocking important resources
- XML sitemap: exists, referenced in robots.txt, valid format
- Noindex tags: intentional vs accidental
- Crawl depth: important pages within 3 clicks of homepage
- JavaScript rendering: check if critical content requires JS execution
- Crawl budget: for large sites (>10k pages), efficiency matters
AI Crawler Management
As of 2025-2026, AI companies actively crawl the web to train models and power AI search. Managing these crawlers via robots.txt is a critical technical SEO consideration.
Known AI crawlers:
| Crawler | Company | robots.txt token | Purpose |
|---|---|---|---|
| GPTBot | OpenAI | GPTBot | Model training |
| ChatGPT-User | OpenAI | ChatGPT-User | Real-time browsing |
| ClaudeBot | Anthropic | ClaudeBot | Model training |
| PerplexityBot | Perplexity | PerplexityBot | Search index + training |
| Bytespider | ByteDance | Bytespider | Model training |
| Google-Extended | Google-Extended | Gemini training (NOT search) | |
| CCBot | Common Crawl | CCBot | Open dataset |
Key distinctions:
- Blocking
Google-Extendedprevents Gemini training use but does NOT affect Google Search indexing or AI Overviews (those useGooglebot) - Blocking
GPTBotprevents OpenAI training but does NOT prevent ChatGPT from citing your content via browsing (ChatGPT-User) - ~3-5% of websites now use AI-specific robots.txt rules
Example, selective AI crawler blocking:
# Allow search indexing, block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
# Allow all other crawlers (including Googlebot for search)
User-agent: *
Allow: /
Recommendation: Consider your AI visibility strategy before blocking. Being cited by AI systems drives brand awareness and referral traffic. Cross-reference the seo-geo skill for full AI visibility optimization.
2. Indexability
- Canonical tags: self-referencing, no conflicts with noindex
- Duplicate content: near-duplicates, parameter URLs, www vs non-www
- Thin content: pages below minimum word counts per type
- Pagination: rel=next/prev or load-more pattern
- Hreflang: correct for multi-language/multi-region sites
- Index bloat: unnecessary pages consuming crawl budget
3. Security
- HTTPS: enforced, valid SSL certificate, no mixed content
- Security headers:
- Content-Security-Policy (CSP)
- Strict-Transport-Security (HSTS)
- X-Frame-Options
- X-Content-Type-Options
- Referrer-Policy
- HSTS preload: check preload list inclusion for high-security sites
4. URL Structure
- Clean URLs: descriptive, hyphenated, no query parameters for content
- Hierarchy: logical folder structure reflecting site architecture
- Redirects: no chains (max 1 hop), 301 for permanent moves
- URL length: flag >100 characters
- Trailing slashes: consistent usage
5. Mobile Optimization
- Responsive design: viewport meta tag, responsive CSS
- Touch targets: minimum 48x48px with 8px spacing
- Font size: minimum 16px base
- No horizontal scroll
- Mobile-first indexing: Google indexes mobile version. Mobile-first indexing is 100% complete as of July 5, 2024. Google now crawls and indexes ALL websites exclusively with the mobile Googlebot user-agent.
6. Core Web Vitals
- LCP (Largest Contentful Paint): target <2.5s
- INP (Interaction to Next Paint): target <200ms
- INP replaced FID on March 12, 2024. FID was fully removed from all Chrome tools (CrUX API, PageSpeed Insights, Lighthouse) on September 9, 2024. Do NOT reference FID anywhere.
- CLS (Cumulative Layout Shift): target <0.1
- Evaluation uses 75th percentile of real user data
- Use PageSpeed Insights API or CrUX data if MCP available
7. Structured Data
- Detection: JSON-LD (preferred), Microdata, RDFa
- Validation against Google's supported types
- See seo-schema skill for full analysis
8. JavaScript Rendering
- Check if content visible in initial HTML vs requires JS
- Identify client-side rendered (CSR) vs server-side rendered (SSR)
- Flag SPA frameworks (React, Vue, Angular) that may cause indexing issues
- Verify dynamic rendering setup if applicable
JavaScript SEO: Canonical & Indexing Guidance (December 2025)
Google updated its JavaScript SEO documentation in December 2025 with critical clarifications:
- Canonical conflicts: If a canonical tag in raw HTML differs from one injected by JavaScript, Google may use EITHER one. Ensure canonical tags are identical between server-rendered HTML and JS-rendered output.
- noindex with JavaScript: If raw HTML contains
<meta name="robots" content="noindex">but JavaScript removes it, Google MAY still honor the noindex from raw HTML. Serve correct robots directives in the initial HTML response. - Non-200 status codes: Google does NOT render JavaScript on pages returning non-200 HTTP status codes. Any content or meta tags injected via JS on error pages will be invisible to Googlebot.
- Structured data in JavaScript: Product, Article, and other structured data injected via JS may face delayed processing. For time-sensitive structured data (especially e-commerce Product markup), include it in the initial server-rendered HTML.
Best practice: Serve critical SEO elements (canonical, meta robots, structured data, title, meta description) in the initial server-rendered HTML rather than relying on JavaScript injection.
9. IndexNow Protocol
- Check if site supports IndexNow for Bing, Yandex, Naver
- Supported by search engines other than Google
- Recommend implementation for faster indexing on non-Google engines
Agent-Friendly Pages (forward-looking)
AI agents (not just AI summarizers) increasingly read sites through three
channels: vision models on screenshots, raw HTML/DOM, and the accessibility
tree (the cleanest signal). Audit criteria — semantic HTML (real <button>
and <a>, not <div onclick>), label associations, interactive target sizing,
layout stability across templates, cursor: pointer correctness — live in
references/agent-friendly-pages.md.
Audit command
# Render with Playwright + capture accessibility tree, then score
python scripts/agent_ux_check.py https://example.com --json
The scanner outputs an Agent-UX score (0-100) plus itemized issues:
- HTML findings: real buttons / anchors,
<div onclick>widgets, semantic landmarks, inputs without<label for>, inputs without ARIA labels - Accessibility tree findings: total nodes, interactive nodes, unnamed
interactive elements,
role="generic"ratio
The accessibility-tree snapshot uses Playwright's
page.accessibility.snapshot(interesting_only=False). To capture the tree
without scoring, use python scripts/render_page.py <url> --a11y-tree --json.
Surface findings as opportunities, not failures. The standards (WebMCP, agent UX heuristics) are early — don't gate audits on a sub-100 score.
Output
Technical Score: XX/100
Category Breakdown
| Category | Status | Score |
|---|---|---|
| Crawlability | pass/warn/fail | XX/100 |
| Indexability | pass/warn/fail | XX/100 |
| Security | pass/warn/fail | XX/100 |
| URL Structure | pass/warn/fail | XX/100 |
| Mobile | pass/warn/fail | XX/100 |
| Core Web Vitals | pass/warn/fail | XX/100 |
| Structured Data | pass/warn/fail | XX/100 |
| JS Rendering | pass/warn/fail | XX/100 |
| IndexNow | pass/warn/fail | XX/100 |
Critical Issues (fix immediately)
High Priority (fix within 1 week)
Medium Priority (fix within 1 month)
Low Priority (backlog)
DataForSEO Integration (Optional)
If DataForSEO MCP tools are available, use on_page_instant_pages for real page analysis (status codes, page timing, broken links, on-page checks), on_page_lighthouse for Lighthouse audits (performance, accessibility, SEO s