Automated Remediation on AWS: EventBridge + Lambda

CPS 234 wants evidence that security gaps get closed, not just detected. Auto-remediation is how you produce that evidence at scale — a finding appears, a fix fires, and you have the execution history to prove it. It's also how you break production at 2am if you do it wrong. Here's the safe pattern.

Practitioner guidance, not legal or audit advice.

The pattern: detect → decide → remediate → record

AWS Config rule (non-compliant)  ──►  EventBridge  ──►  SSM Automation / Lambda
                                                              │
                                                              ▼
                                                     remediation + record

Two engines do the work:

Config auto-remediation (SSM Automation documents bound to a non-compliant rule) — the cleanest for well-known fixes AWS already has runbooks for (e.g. enable S3 encryption, remove a public ACL).
EventBridge → Lambda — for custom logic Config's built-in remediations don't cover.

Start with the safe, boring fixes

Auto-remediation earns trust fastest on changes that are always correct:

Finding	Safe auto-remediation
New security group opens 0.0.0.0/0 on SSH/RDP	Revoke the rule
S3 bucket missing default encryption	Enable SSE
EBS volume created unencrypted	(Prevent via account default; alert on existing)
IAM access key older than policy	Notify + flag (don't auto-delete)

Notice the last one: notify, don't auto-delete. Which brings us to the rule that matters.

The rule: never auto-remediate something that can break production

The fastest way to lose the right to run automation is to auto-"fix" a public S3 bucket that was meant to be public (a static site) and take down production. Guardrails:

Allowlist what's auto-remediable. Only the changes that are always safe fire automatically. Everything else alerts a human.
Honour exceptions. A documented, intentional configuration (your exceptions register) must be excluded from auto-remediation — tag it and check the tag before acting.
Make remediations reversible and logged. Every action writes to CloudTrail/CloudWatch; you can answer "what changed, when, why."
Test in lower environments first. A remediation Lambda is production code. Treat it like it.

A minimal safe Lambda shape

def handler(event, context):
    resource = parse_config_event(event)
    if has_exception_tag(resource):      # intentional config — leave it
        return notify_only(resource)
    if resource.type in AUTO_REMEDIABLE: # allowlisted safe fixes only
        result = remediate(resource)
        log_remediation(resource, result)  # evidence trail
        return result
    return notify_human(resource)        # everything else: alert, don't act

The allowlist + exception check are the whole game. Without them you have a robot that occasionally deletes production.

Why this is the AIOps payoff

This is the operational core of "AI-powered AWS operations": the system detects, decides within safe bounds, fixes what's safe, and escalates what isn't — with a complete record. For CPS 234's testing-effectiveness clauses, the remediation execution history is direct evidence that control gaps get closed, with the before/after to prove it.

Building this safely — the allowlist, the exception handling, the evidence trail — is exactly the kind of thing a readiness assessment scopes before you point automation at production.

Primary sources: AWS Config conformance packs · Security Hub FSBP standard

Automated Remediation on AWS: EventBridge + Lambda Done Safely

The pattern: detect → decide → remediate → record

Start with the safe, boring fixes

The rule: never auto-remediate something that can break production

A minimal safe Lambda shape

Why this is the AIOps payoff

Get the CPS 234 → AWS Controls cheatsheet