Inspector to SSM Vulnerability Patching Automation

AWS Inspector finds vulnerabilities but doesn’t patch them automatically. The obvious solution seems simple: wire Inspector → EventBridge → Systems Manager together and let automation handle the rest.
Except it breaks spectacularly in production.
A recent blog post demonstrated this pattern with a critical caveat buried in the middle: “full auto-patching didn’t complete due to overlapping runs + reboots when multiple findings existed.” That’s code for “it doesn’t work when you actually use it.” Which honestly surprised me at first.
Here’s what actually happens. Inspector scans an EC2 instance and finds 15 vulnerabilities. It fires 15 separate EventBridge events. Your automation tries to patch the same instance 15 times simultaneously.
Chaos.
The first patch operation starts, reboots the instance, and the other 14 operations fail. You get partial patches, broken services, and instances stuck in weird states.
The Race Condition Problem
Most examples ignore concurrency entirely. They assume one vulnerability per instance or perfect timing - fantasy scenarios that don’t exist in enterprise environments. (I still don’t understand why the basic examples skip this entirely.)
The solution requires deduplication with state management. DynamoDB provides this through conditional writes:
table.put_item(
Item=item,
ConditionExpression='attribute_not_exists(instance_id) OR #status <> :in_progress'
)
First vulnerability wins the lock. Others get queued. No more simultaneous patch operations destroying your infrastructure.
What Production Actually Requires
The blog post approach patches immediately, regardless of business hours or system health. That’s like performing surgery without checking if the patient can survive it. Not great.
Pre-Patch Safety Validation
Production systems need safety nets before any patches get applied:
def validate_instance_health(instance_id):
"""Pre-patch validation implemented in Python via SSM Run Command"""
# Execute health checks via SSM
command_id = ssm.send_command(
InstanceIds=[instance_id],
DocumentName="AWS-RunShellScript",
Parameters={
'commands': [
# Disk space validation
'DISK_USAGE=$(df -h / | awk \'NR==2 {print $5}\' | sed \'s/%//\')',
'if [ $DISK_USAGE -gt 85 ]; then echo "ERROR: Insufficient disk space ($DISK_USAGE%)"; exit 1; fi',
# Memory validation
'FREE_MEM=$(free -m | awk \'NR==2{printf "%.1f", $7/$2*100}\')',
'if (( $(echo "$FREE_MEM < 10.0" | bc -l) )); then echo "ERROR: Low memory"; exit 1; fi',
# Critical services check
'for service in sshd systemd-networkd; do',
' if ! systemctl is-active --quiet $service; then',
' echo "ERROR: Critical service $service not running"; exit 1',
' fi',
'done'
]
}
)
return wait_for_command_completion(command_id['Command']['CommandId'])
This Python-based validation runs via SSM Run Command, preventing patches on systems with insufficient resources or failed critical services. (Learned this one the hard way.)
Automated Backup Strategy
EBS snapshots before every patch operation provide a recovery path when things go wrong:
def create_snapshots(instance_id):
"""Create EBS snapshots before patching with 30-day TTL"""
try:
volumes = ec2.describe_volumes(
Filters=[{'Name': 'attachment.instance-id', 'Values': [instance_id]}]
)['Volumes']
snapshot_ids = []
for volume in volumes:
response = ec2.create_snapshot(
VolumeId=volume['VolumeId'],
Description=f'Pre-patch snapshot for {instance_id} - {volume["Attachments"][0]["Device"]}',
TagSpecifications=[{
'ResourceType': 'snapshot',
'Tags': [
{'Key': 'AutoDelete', 'Value': 'true'},
{'Key': 'Purpose', 'Value': 'PrePatchBackup'},
{'Key': 'InstanceId', 'Value': instance_id},
{'Key': 'DeleteAfter', 'Value': (datetime.now() + timedelta(days=30)).isoformat()}
]
}]
)
snapshot_ids.append(response['SnapshotId'])
return snapshot_ids
except Exception as e:
logger.error(f"Failed to create snapshots for {instance_id}: {str(e)}")
raise
Snapshots take seconds to create but can save hours of recovery time. Auto-delete tags ensure they clean themselves up after 30 days. Simple but effective.
Maintenance Window Awareness
The original approach patches whenever Inspector finds something. Fine for development, terrible for production workloads serving real users. Obviously.
Production systems have two scheduling options:
- AWS SSM Maintenance Windows: Native scheduling with powerful controls (concurrency, error thresholds, cross-account execution)
- Custom Lambda scheduler: Better for complex business logic, custom windows, or multi-account orchestration patterns
We chose the Lambda approach for flexibility, but SSM Maintenance Windows work great for simpler cases.
def is_in_maintenance_window():
"""Check if current time is within maintenance window (2-6 AM UTC)"""
current_hour = datetime.now(timezone.utc).hour
MAINTENANCE_START = 2 # 2 AM UTC
MAINTENANCE_END = 6 # 6 AM UTC
if MAINTENANCE_START > MAINTENANCE_END:
return current_hour >= MAINTENANCE_START or current_hour < MAINTENANCE_END
else:
return MAINTENANCE_START <= current_hour < MAINTENANCE_END
if not is_in_maintenance_window():
schedule_for_maintenance(instance_id, vulnerability)
The scheduler Lambda runs on EventBridge cron (0 2-6 * * ? *
) during the 2-6 AM UTC maintenance window, processing queued patches without disrupting business operations. (SSM Automation also supports rate controls with up to ~500 concurrent executions if you go the native route.)
Network Security Architecture
Most examples run Lambda functions with default networking, creating security gaps. Production systems require proper network isolation. This caught me off guard initially.
# VPC isolation for both Lambda functions
resource "aws_lambda_function" "patch_deduplication" {
vpc_config {
subnet_ids = aws_subnet.lambda_subnet[*].id
security_group_ids = [aws_security_group.lambda_sg.id]
}
}
resource "aws_lambda_function" "patch_maintenance_scheduler" {
vpc_config {
subnet_ids = aws_subnet.lambda_subnet[*].id
security_group_ids = [aws_security_group.lambda_sg.id]
}
}
# VPC endpoints for secure AWS service communication
resource "aws_vpc_endpoint" "ssm" {
vpc_id = aws_vpc.patch_automation_vpc.id
service_name = "com.amazonaws.${var.aws_region}.ssm"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.lambda_subnet[*].id
}
resource "aws_vpc_endpoint" "dynamodb" {
vpc_id = aws_vpc.patch_automation_vpc.id
service_name = "com.amazonaws.${var.aws_region}.dynamodb"
vpc_endpoint_type = "Gateway"
}
Lambda functions communicate with AWS services through VPC endpoints without internet access. NAT Gateway provides outbound connectivity only when needed.
The Real Scalability Challenges
Three limitations surface when moving from proof-of-concept to production scale.
DynamoDB Throttling Under Load
Conditional writes create contention when hundreds of vulnerability events arrive simultaneously. Multiple Lambda functions competing for locks on the same partition key causes throttling.
Implementation detail: The current architecture uses attribute_not_exists(instance_id) OR #status <> :in_progress
conditions. Under load, this creates a bottleneck where only one vulnerability event per instance can acquire the lock.
Scaling limits: Testing shows failures starting around 50+ simultaneous vulnerability events for the same instance. The solution involves exponential backoff with jitter, but there’s still a theoretical limit at ~200 events/second per instance.
Massive environments need sharding across multiple DynamoDB tables or alternative coordination mechanisms like SQS FIFO queues. Not ideal, but it works.
Inspector vs SSM Data Mismatches
Inspector uses Amazon’s vulnerability database, updated constantly. SSM Patch Manager relies on OS vendor repositories with different update schedules.
Real-world example: Inspector2 detects CVE-2024-1234 on an Ubuntu instance. The automation triggers but SSM finds no applicable patches because Ubuntu hasn’t released the security update yet. The operation completes “successfully” but the vulnerability remains unpatched. Frustrating.
Frequency: In our testing environment, ~15-20% of Inspector findings initially lacked available patches (this may vary by organization), particularly for:
- Zero-day vulnerabilities where patches lag detection
- Distribution-specific packages with different naming conventions
- Backported security fixes that don’t match CVE identifiers exactly
You can mitigate mismatches using tailored patch baselines with auto-approval settings, maintenance windows to delay patching until vendors release fixes, and Security Hub workflows for manual review. Still no out-of-the-box CVE-to-patch resolver though.
EventBridge Throughput Limits
Inspector can generate thousands of findings during initial scans. EventBridge has default quotas of 18,750 invocations/sec and 10,000 PutEvents/sec in us-east-1 that cause throttling (delayed delivery, not silent loss) during these spikes.
You need quota increases, retry logic, and Dead Letter Queues for proper handling. The basic examples never encounter this because they test with single instances. Which is… unhelpful.
State Management
Traditional patching approaches are stateless. Each vulnerability event triggers independently with no coordination.
item = {
'instance_id': instance_id,
'status': 'IN_PROGRESS',
'started_at': datetime.now().isoformat(),
'vulnerabilities': [vulnerability],
'queued_vulnerabilities': [],
'expiration_time': int(time.time() + 7200), # 2 hour TTL
'lock_id': lock_id,
'request_id': request_id
}
This tracks patch operations, prevents re-patching recently updated instances, queues additional vulnerabilities, and provides operational visibility. TTL ensures failed operations don’t block instances forever. (Two hour timeout seemed reasonable through testing.)
Instance Opt-In Control
Automated patching without consent can destroy critical systems. The solution requires explicit opt-in through tagging:
# Require explicit opt-in via tags - NO instances patched without consent
response = ec2.describe_instances(InstanceIds=[instance_id])
if not response['Reservations']:
logger.error(f"Instance {instance_id} not found")
return False
tags = response['Reservations'][0]['Instances'][0].get('Tags', [])
patch_enabled = any(tag['Key'] == 'AutoPatch' and tag['Value'].lower() == 'true' for tag in tags)
if not patch_enabled:
logger.info(f"Instance {instance_id} not opted in for auto-patching (missing AutoPatch=true tag)")
return False
Only instances tagged AutoPatch=true
participate in automated patching. This prevents accidental patching of critical systems. Zero exceptions.
Post-Patch Validation
Patches aren’t successful just because they installed without errors. Services need to remain functional. Obviously.
# Post-patch service validation
for service in sshd apache2 nginx; do
if ! systemctl is-active --quiet $service; then
FAILED_SERVICES="$FAILED_SERVICES $service"
fi
done
if [ -n "$FAILED_SERVICES" ]; then
echo "ERROR: Critical services not running:$FAILED_SERVICES"
exit 1
fi
The automation validates that critical services restart correctly after patching. Failed validation triggers immediate alerts instead of waiting for monitoring systems to notice. Much faster.
Error Handling and Recovery
When things go wrong (and they do), recovery mechanisms are essential:
# Rollback assessment on failure
- name: RollbackAssessment
action: aws:executeScript
onFailure: step:NotifyFailure
Script: |
# Update state to failed and assess rollback options
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('PatchExecutionState')
table.update_item(
Key={'instance_id': events['InstanceId']},
UpdateExpression='SET #status = :failed',
ExpressionAttributeNames={'#status': 'status'},
ExpressionAttributeValues={':failed': 'FAILED'}
)
Dead letter queues capture failed Lambda invocations. Detailed failure notifications provide actionable context. Automatic retry with exponential backoff handles transient failures. Standard stuff.
Monitoring and Observability
CloudWatch dashboards provide real-time visibility:
resource "aws_cloudwatch_dashboard" "patch_monitoring" {
dashboard_body = jsonencode({
widgets = [
{
properties = {
metrics = [
["AWS/Lambda", "Invocations", "FunctionName", aws_lambda_function.patch_deduplication.function_name],
[".", "Errors", ".", "."],
[".", "Duration", ".", "."]
]
title = "Lambda Function Metrics"
}
}
]
})
}
Proactive alerting catches problems before they cascade. Success rates, failure reasons, and queue depths provide operational insight. Essential for debugging.
The Economics
Serverless architecture scales costs with usage. Quiet periods cost almost nothing. Major vulnerability events spike costs but remain predictable.
VPC endpoints eliminate NAT Gateway data transfer charges for AWS service calls. This saves hundreds of dollars monthly for organizations processing thousands of patch operations.
More importantly, automation provides consistency. Manual processes have human error, sick days, and vacation gaps. Automation doesn’t. Big win.
Architecture Layout
The blog post pattern is correct: Inspector → EventBridge → Systems Manager. But production requires comprehensive enhancements.
Flexible scheduling architecture works with custom Lambda for complex logic, or SSM Maintenance Windows for simpler cases. Deduplication prevents race conditions through DynamoDB conditional writes. Safety validation protects critical systems with pre-patch checks, while backup strategy enables recovery with automated EBS snapshots.
Maintenance windows respect business operations during 2-6 AM UTC. Network isolation follows security best practices with VPC endpoints. State management provides operational visibility with TTL cleanup. Explicit opt-in requires AutoPatch=true tag for safety.
Error handling includes DLQs, retry logic, and comprehensive logging. Quota awareness handles EventBridge throttling and proper scaling.
Building reliable infrastructure automation isn’t about getting the happy path working but also handling all the edge cases that could break production systems.
The key insight? Treat automation as a production system that requires the same engineering rigor as any other critical infrastructure component.
Security tools that break production systems create more risk than they mitigate. This architecture proves you can achieve both security objectives and operational reliability simultaneously. Which is the whole point.
Lessons Learned & Next Iteration
After sharing this with a couple of security engineers, several important improvements emerged that are worth implementing in future iterations.
Technical Corrections: EBS snapshots are asynchronous and can take hours depending on changed blocks, not “seconds” as stated. AWS-RunPatchBaseline reboots by default unless you specify RebootOption=NoReboot
, which often leaves instances in pending-reboot states. Inspector rescans continuously, so deduplication should key off finding ARN + instance ID + maintenance window, not just timestamps.
Architecture Enhancements: An SQS FIFO queue between EventBridge and Lambda would provide better deduplication than conditional DynamoDB writes alone. Pre/post-hooks for load balancer drainage enable zero-downtime patching. Structured observability events for decision points (skipped, blocked, throttled) provide better operational insight than basic logs.
Security Hardening: IAM permissions need resource-level conditions for least privilege (e.g., tag-based ssm:SendCommand
only on AutoPatch=true
instances). Validation should verify configurations persist post-patch, not just service availability.
Scope Expansions: Container vulnerability “patching” means rebuilding images from patched base images, not OS-level patching. Windows environments need different SSM documents and approval workflows. ASG workloads might benefit more from AMI bake + Instance Refresh than in-place patching.
I believe these insights highlight the iterative nature of building production systems. The current architecture handles the core problem well, but there’s always room for improvement based on feedback and evolving requirements.
You can view the complete GitHub project from the link below - review it, contribute, comment, or use it as a starting point for your own implementation:
GitHub Repository: Autopatch-Vuln
The repo includes all Terraform configurations, Lambda functions, SSM automation documents, and deployment scripts discussed in this article.