Ansible: Production Configuration Management
Write, review, and architect Ansible automation - from single playbooks to multi-tier, compliance-hardened infrastructure management. The goal is idempotent, auditable, maintainable automation that works the same locally and in CI/CD.
Target versions (May 2026):
- ansible-core 2.20.x LTS (Python 3.12+ controller, 3.9+ target, EOL May 2027)
- ansible (community package) 13.x (depends on ansible-core 2.20)
- molecule 26.x (CalVer), ansible-lint 26.x (CalVer), ansible-navigator 26.x (CalVer)
- ansible-builder 3.1.x (EE definition v3)
- AWX 24.6.1 (last formal release Jul 2024; upstream AWX releases paused for a major refactor, devel branch active - track ansible/awx; awx-operator ~2.12.x still ships for K8s deploys). Verify current AWX/AAP release status before recommending a specific version or install path.
- AAP 2.6 (Oct 2025 - last RPM-installable release; AAP 2.7+ containerized-only)
This skill covers four domains depending on context:
- Playbooks - tasks, handlers, variables, conditions, loops, blocks, templates, Jinja2
- Roles & Collections - role structure, collection packaging, Galaxy/Automation Hub, Molecule testing
- Operations - inventory, Execution Environments, CI/CD integration, Vault, ansible-navigator
- Compliance - PCI-DSS 4.0 hardening, CIS benchmarks, Ansible-Lockdown, audit logging
When to use
- Writing or reviewing Ansible playbooks, roles, or collections
- Configuring servers after Terraform provisions them (day-2 operations)
- OS hardening (CIS benchmarks, STIG, PCI-DSS configuration requirements)
- Managing packages, services, users, firewall rules, cron jobs, config files
- Testing automation with Molecule or tox-ansible
- Setting up Ansible Vault for secrets management
- Designing inventory structures (static, dynamic, multi-environment)
- Building Execution Environments for consistent runtime
- Integrating Ansible into CI/CD pipelines (GitLab CI, GitHub Actions)
- Reviewing AI-generated playbooks for correctness and idiomatic patterns
When NOT to use
- Infrastructure provisioning (VPCs, RDS, EC2, cloud resources) - use terraform
- Kubernetes manifests, Helm charts, cluster architecture - use kubernetes
- Dockerfiles, Compose stacks, container image optimization - use docker
- CI/CD pipeline design (stages, runners, caching) - use ci-cd
- Security audits of application code (SAST, dependency scanning) - use security-audit
- Shell scripting or one-off commands - use command-prompt
- Firewall appliance management (OPNsense/pfSense) - use firewall-appliance
- Single-machine OS-level admin questions (package setup, user management, service config without automation context) - use the appropriate distro skill: debian-ubuntu, rhel-fedora, kali-linux, or arch-btw
AI Self-Check
AI tools consistently produce the same Ansible mistakes. Before returning any generated playbook, role, or task, verify against this list:
- FQCNs used everywhere (
ansible.builtin.copy, notcopy). AI almost never does this unprompted. -
become: truepresent where privilege escalation is needed (AI often forgets this) -
no_log: trueon every task handling secrets, passwords, tokens, or API keys (CVE-2024-8775 proved this matters) - Every task has a descriptive
name:field (AI sometimes omits names on simple tasks) - Handler names are unique and
notify:strings match exactly (typos = silent failures) - Variables use
{{ var }}with quotes:"{{ my_var }}"not{{ my_var }}(bare Jinja2 without quotes breaks YAML parsing) - No
command/shell/rawwhen an Ansible module exists for the operation - Tasks are idempotent - running twice produces the same result (watch
command/shelltasks withoutcreates/removes) - No hardcoded values - IPs, paths, package versions, usernames go in variables with defaults
-
ansible.builtin.apt/ansible.builtin.dnfusestate: present, notstate: latest(unless explicitly upgrading) - Loop variable is
item(default) or renamed vialoop_varin nested loops (AI conflates loop variables) -
block/rescue/alwaysused for error handling, not bareignore_errors: true - No
ansible.builtin.templatewithsrc:pointing to a non-.j2file (confusing, even if it works) -
changed_when/failed_whenset oncommand/shelltasks to prevent false change reports - Tags present on logical task groups for selective execution
Run generated playbooks through ansible-lint (production profile) when available.
- Current source checked: dated versions, CLI flags, API names, and support windows are verified against primary docs before repeating them
- Hidden state identified: local config, credentials, caches, contexts, branches, cluster targets, or previous runs are made explicit before acting
- Verification is real: final checks exercise the actual runtime, parser, service, or integration point instead of only linting prose or happy paths
- Routing overlap checked: overlapping skills, trigger terms, and "When NOT to use" boundaries are checked before returning guidance
- Spec claims verified: claims about tool behavior, output contracts, or repo conventions are checked against current docs, scripts, or skill files
- Collection docs checked: module arguments and return values match the installed collection version
- Idempotence proven: changed/ok behavior is verified with check mode or a second run where practical
Performance
- Use targeted inventories, tags, and
--limitfor large fleets; avoid full-fleet runs while iterating on a single role. - Gather only required facts and cache facts where supported for slow or high-latency environments.
- Prefer native modules over shell loops so Ansible can batch work, diff safely, and report idempotence.
Best Practices
- Pin collection versions in
requirements.ymlfor production automation. - Run destructive playbooks with
--check --difffirst and require a human-reviewed limit for production hosts. - Keep Vault values out of diffs, logs, callback output, and generated examples.
Workflow
Step 1: Determine the domain
Based on the request:
- "Write a playbook to configure X" -> Playbooks
- "Create a reusable role for X" -> Roles & Collections
- "Set up inventory" / "CI/CD" / "vault" / "EE" -> Operations
- "Harden this server" / "CIS benchmark" / "PCI compliance" -> Compliance
- "Review this playbook/role" -> Apply production checklist + critical rules + AI self-check
Most real tasks blend domains. Start with the playbook, extract to roles when reuse is clear, wire into operations last.
Step 2: Gather requirements
Before writing YAML, determine:
- Target OS: RHEL/CentOS, Ubuntu/Debian, Alpine, Windows - affects module choices
- Python version on targets: ansible-core 2.20 requires Python 3.9+ on managed nodes
- Privilege escalation:
becomemethod (sudo, su, doas, runas for Windows) - Connection: SSH (default), WinRM (Windows), local, network_cli (network devices)
- Idempotency: every task must be safe to run multiple times
- Secrets: Ansible Vault, HashiCorp Vault, CI/CD secrets, environment variables
- Testing: Molecule scenario? tox-ansible matrix? Integration tests?
- Compliance: PCI-DSS scope? CIS benchmark level? STIG profile?
- Inventory: static, dynamic (cloud), or hybrid? Multi-environment?
- Execution: ansible-playbook (direct), ansible-navigator (EE), AWX/AAP (platform)?
Step 3: Build
Follow the domain-specific section below. Always apply the production checklist (Step 4) and AI self-check before finishing.
Step 4: Validate
# Syntax check (fast, no connection needed)
ansible-playbook playbook.yml --syntax-check
# Lint (use production profile for strictest checks)
ansible-lint --p