CPS 234 wants evidence that security gaps get closed, not just detected. Auto-remediation is how you produce that evidence at scale — a finding appears, a fix fires, and you have the execution history to prove it. It's also how you break production at 2am if you do it wrong. Here's the safe pattern.
Practitioner guidance, not legal or audit advice.
The pattern: detect → decide → remediate → record
AWS Config rule (non-compliant) ──► EventBridge ──► SSM Automation / Lambda
│
▼
remediation + recordTwo engines do the work:
- Config auto-remediation (SSM Automation documents bound to a non-compliant rule) — the cleanest for well-known fixes AWS already has runbooks for (e.g. enable S3 encryption, remove a public ACL).
- EventBridge → Lambda — for custom logic Config's built-in remediations don't cover.
Start with the safe, boring fixes
Auto-remediation earns trust fastest on changes that are always correct:
| Finding | Safe auto-remediation |
|---|---|
| New security group opens 0.0.0.0/0 on SSH/RDP | Revoke the rule |
| S3 bucket missing default encryption | Enable SSE |
| EBS volume created unencrypted | (Prevent via account default; alert on existing) |
| IAM access key older than policy | Notify + flag (don't auto-delete) |
Notice the last one: notify, don't auto-delete. Which brings us to the rule that matters.
The rule: never auto-remediate something that can break production
The fastest way to lose the right to run automation is to auto-"fix" a public S3 bucket that was meant to be public (a static site) and take down production. Guardrails:
- Allowlist what's auto-remediable. Only the changes that are always safe fire automatically. Everything else alerts a human.
- Honour exceptions. A documented, intentional configuration (your exceptions register) must be excluded from auto-remediation — tag it and check the tag before acting.
- Make remediations reversible and logged. Every action writes to CloudTrail/CloudWatch; you can answer "what changed, when, why."
- Test in lower environments first. A remediation Lambda is production code. Treat it like it.
A minimal safe Lambda shape
def handler(event, context):
resource = parse_config_event(event)
if has_exception_tag(resource): # intentional config — leave it
return notify_only(resource)
if resource.type in AUTO_REMEDIABLE: # allowlisted safe fixes only
result = remediate(resource)
log_remediation(resource, result) # evidence trail
return result
return notify_human(resource) # everything else: alert, don't actThe allowlist + exception check are the whole game. Without them you have a robot that occasionally deletes production.
Why this is the AIOps payoff
This is the operational core of "AI-powered AWS operations": the system detects, decides within safe bounds, fixes what's safe, and escalates what isn't — with a complete record. For CPS 234's testing-effectiveness clauses, the remediation execution history is direct evidence that control gaps get closed, with the before/after to prove it.
Building this safely — the allowlist, the exception handling, the evidence trail — is exactly the kind of thing a readiness assessment scopes before you point automation at production.
Primary sources: AWS Config conformance packs · Security Hub FSBP standard