Implementing Vertical Autoscaling for Aurora Databases Using Lambda Functions
AWS provides horizontal scaling for Aurora out of the box – add read replicas, distribute load, done. Vertical scaling? You’re on your own. Aurora PostgreSQL supports a single writer instance, so when that writer needs more horsepower, you can’t just throw more nodes at it.
This guide covers a production-ready implementation of vertical autoscaling for Aurora using Lambda functions, CloudWatch Alarms, SNS, and RDS Event Subscriptions. The approach minimises downtime through coordinated reader-first scaling and automated failover.
Architecture Overview
┌─────────────────────────────────────────────────────────────────────────────┐
│ Aurora Vertical Autoscaling │
└─────────────────────────────────────────────────────────────────────────────┘
┌──────────────┐ ┌──────────────┐ ┌──────────────────────────────────┐
│ CloudWatch │────▶│ SNS │────▶│ Alarm Lambda │
│ Alarm │ │ Topic │ │ • Validates cluster state │
│ (CPU > 80%) │ │ │ │ • Tags instance as 'modifying' │
└──────────────┘ └──────────────┘ │ • Initiates modify-db-instance │
└──────────────────────────────────┘
│
▼
┌──────────────────────────────────┐
│ Aurora Cluster │
│ ┌────────┐ ┌────────┐ │
│ │ Writer │ │ Reader │ │
│ │db.r6g. │ │db.r6g. │ │
│ │xlarge │ │xlarge │ ◀─ Scale │
│ └────────┘ └────────┘ │
└──────────────────────────────────┘
│
│ RDS Event
│ (RDS-EVENT-0014)
▼
┌──────────────────────────────────────────────────────────────────────────────┐
│ RDS Event Subscription │
│ (filters: modification complete) │
└──────────────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────┐
│ Event Lambda │
│ • Removes 'modifying' tag │
│ • Checks for size parity │
│ • Scales next smallest instance │
│ • Triggers failover if needed │
└──────────────────────────────────┘
Prerequisites
- Aurora PostgreSQL or MySQL cluster with at least one reader
- IAM permissions for Lambda to modify RDS instances and manage tags
- SNS topic for alarm notifications
- Terraform (or CloudFormation if you must)
Code Repository: All code from this post is available at github.com/moabukar/blog-code/vertical-scaling-aurora
Repository Structure
aurora-vertical-autoscaling/
├── terraform/
│ ├── main.tf
│ ├── variables.tf
│ ├── outputs.tf
│ ├── lambda.tf
│ ├── cloudwatch.tf
│ ├── sns.tf
│ └── iam.tf
├── lambda/
│ ├── alarm_handler/
│ │ ├── handler.py
│ │ └── requirements.txt
│ └── event_handler/
│ ├── handler.py
│ └── requirements.txt
├── scripts/
│ └── package_lambda.sh
└── README.md
IAM Configuration
The Lambda functions need granular RDS permissions. Avoid rds:* – specify exactly what’s required.
# terraform/iam.tf
data "aws_iam_policy_document" "lambda_assume_role" {
statement {
effect = "Allow"
principals {
type = "Service"
identifiers = ["lambda.amazonaws.com"]
}
actions = ["sts:AssumeRole"]
}
}
resource "aws_iam_role" "aurora_autoscaler" {
name = "aurora-vertical-autoscaler-${var.environment}"
assume_role_policy = data.aws_iam_policy_document.lambda_assume_role.json
}
data "aws_iam_policy_document" "aurora_autoscaler" {
# RDS describe permissions
statement {
effect = "Allow"
actions = [
"rds:DescribeDBClusters",
"rds:DescribeDBInstances",
"rds:ListTagsForResource"
]
resources = ["*"]
}
# RDS modify permissions – scoped to specific cluster
statement {
effect = "Allow"
actions = [
"rds:ModifyDBInstance",
"rds:FailoverDBCluster",
"rds:AddTagsToResource",
"rds:RemoveTagsFromResource"
]
resources = [
"arn:aws:rds:${var.region}:${data.aws_caller_identity.current.account_id}:cluster:${var.cluster_identifier}",
"arn:aws:rds:${var.region}:${data.aws_caller_identity.current.account_id}:db:${var.cluster_identifier}-*"
]
}
# CloudWatch Logs
statement {
effect = "Allow"
actions = [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
]
resources = ["arn:aws:logs:${var.region}:${data.aws_caller_identity.current.account_id}:*"]
}
# SNS publish for notifications
statement {
effect = "Allow"
actions = ["sns:Publish"]
resources = [aws_sns_topic.scaling_notifications.arn]
}
}
resource "aws_iam_role_policy" "aurora_autoscaler" {
name = "aurora-autoscaler-policy"
role = aws_iam_role.aurora_autoscaler.id
policy = data.aws_iam_policy_document.aurora_autoscaler.json
}
CloudWatch Alarm Configuration
CPU utilisation is the trigger here. You could substitute any CloudWatch metric – DatabaseConnections, FreeableMemory, ReadIOPS, etc.
# terraform/cloudwatch.tf
resource "aws_cloudwatch_metric_alarm" "aurora_cpu_high" {
alarm_name = "aurora-${var.cluster_identifier}-cpu-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "CPUUtilization"
namespace = "AWS/RDS"
period = 60
statistic = "Average"
threshold = var.cpu_threshold # Default: 80
alarm_description = "CPU utilisation exceeded ${var.cpu_threshold}% for 3 consecutive minutes"
dimensions = {
DBClusterIdentifier = var.cluster_identifier
}
alarm_actions = [aws_sns_topic.scaling_trigger.arn]
ok_actions = [] # Optional: notify when alarm clears
treat_missing_data = "notBreaching"
tags = var.tags
}
Why 3 evaluation periods? Single spikes shouldn’t trigger scaling. Sustained load over 3 minutes indicates genuine capacity pressure. Adjust based on your workload characteristics.
SNS Topics
Two topics: one for triggering the alarm Lambda, one for operational notifications.
# terraform/sns.tf
resource "aws_sns_topic" "scaling_trigger" {
name = "aurora-scaling-trigger-${var.environment}"
}
resource "aws_sns_topic" "scaling_notifications" {
name = "aurora-scaling-notifications-${var.environment}"
}
resource "aws_sns_topic_subscription" "alarm_lambda" {
topic_arn = aws_sns_topic.scaling_trigger.arn
protocol = "lambda"
endpoint = aws_lambda_function.alarm_handler.arn
}
# Optional: email notifications for scaling events
resource "aws_sns_topic_subscription" "email" {
count = var.notification_email != "" ? 1 : 0
topic_arn = aws_sns_topic.scaling_notifications.arn
protocol = "email"
endpoint = var.notification_email
}
RDS Event Subscription
This triggers the Event Lambda when an instance modification completes.
# terraform/rds_events.tf
resource "aws_db_event_subscription" "modification_complete" {
name = "aurora-modification-complete-${var.environment}"
sns_topic = aws_sns_topic.event_trigger.arn
source_type = "db-instance"
source_ids = data.aws_rds_cluster.target.cluster_members
event_categories = ["configuration change"]
tags = var.tags
}
resource "aws_sns_topic" "event_trigger" {
name = "aurora-event-trigger-${var.environment}"
}
resource "aws_sns_topic_subscription" "event_lambda" {
topic_arn = aws_sns_topic.event_trigger.arn
protocol = "lambda"
endpoint = aws_lambda_function.event_handler.arn
}
Lambda Functions
Alarm Handler
This function receives the CloudWatch Alarm, validates cluster state, and initiates scaling.
# lambda/alarm_handler/handler.py
import boto3
import json
import os
import random
from datetime import datetime, timezone, timedelta
from typing import Optional
rds = boto3.client('rds')
sns = boto3.client('sns')
# Instance size ordering for comparison
INSTANCE_SIZE_ORDER = {
'small': 1, 'medium': 2, 'large': 3, 'xlarge': 4,
'2xlarge': 5, '4xlarge': 6, '8xlarge': 7, '12xlarge': 8,
'16xlarge': 9, '24xlarge': 10
}
# Allowed instance families for scaling (configure per cluster)
ALLOWED_FAMILIES = os.environ.get('ALLOWED_FAMILIES', 'db.r6g,db.r7g').split(',')
MAX_INSTANCE_CLASS = os.environ.get('MAX_INSTANCE_CLASS', 'db.r6g.4xlarge')
COOLDOWN_MINUTES = int(os.environ.get('COOLDOWN_MINUTES', '15'))
NOTIFICATION_TOPIC = os.environ['NOTIFICATION_TOPIC_ARN']
def handler(event, context):
"""
Handles CloudWatch Alarm via SNS.
Validates cluster state and initiates vertical scaling if conditions are met.
"""
try:
# Parse SNS message
sns_message = json.loads(event['Records'][0]['Sns']['Message'])
alarm_name = sns_message.get('AlarmName', '')
# Extract cluster identifier from alarm dimensions
cluster_id = extract_cluster_id(sns_message)
if not cluster_id:
return {'statusCode': 400, 'body': 'Could not determine cluster ID'}
print(f"Processing alarm for cluster: {cluster_id}")
# Get cluster details
cluster = get_cluster_details(cluster_id)
if not cluster:
return {'statusCode': 404, 'body': f'Cluster {cluster_id} not found'}
# Validation checks
validation_result = validate_cluster_state(cluster)
if not validation_result['can_scale']:
print(f"Scaling blocked: {validation_result['reason']}")
return {'statusCode': 200, 'body': validation_result['reason']}
# Execute scaling
result = execute_scaling(cluster)
# Send notification
notify(result)
return {'statusCode': 200, 'body': json.dumps(result)}
except Exception as e:
print(f"Error: {str(e)}")
notify({'status': 'error', 'message': str(e)})
raise
def extract_cluster_id(alarm_message: dict) -> Optional[str]:
"""Extract cluster identifier from alarm dimensions."""
trigger = alarm_message.get('Trigger', {})
dimensions = trigger.get('Dimensions', [])
for dim in dimensions:
if dim.get('name') == 'DBClusterIdentifier':
return dim.get('value')
return None
def get_cluster_details(cluster_id: str) -> Optional[dict]:
"""Fetch cluster and instance details from RDS."""
try:
cluster_resp = rds.describe_db_clusters(DBClusterIdentifier=cluster_id)
cluster = cluster_resp['DBClusters'][0]
# Get instance details
instances = []
for member in cluster['DBClusterMembers']:
instance_resp = rds.describe_db_instances(
DBInstanceIdentifier=member['DBInstanceIdentifier']
)
instance = instance_resp['DBInstances'][0]
# Get tags
tags_resp = rds.list_tags_for_resource(
ResourceName=instance['DBInstanceArn']
)
instance['Tags'] = {t['Key']: t['Value'] for t in tags_resp['TagList']}
instance['IsWriter'] = member['IsClusterWriter']
instances.append(instance)
cluster['Instances'] = instances
return cluster
except rds.exceptions.DBClusterNotFoundFault:
return None
def validate_cluster_state(cluster: dict) -> dict:
"""
Check if scaling is permitted:
1. No instances currently being modified
2. No instances tagged as 'modifying'
3. Cooldown period has elapsed
"""
instances = cluster['Instances']
# Check for active modifications
for instance in instances:
if instance['DBInstanceStatus'] != 'available':
return {
'can_scale': False,
'reason': f"Instance {instance['DBInstanceIdentifier']} is {instance['DBInstanceStatus']}"
}
# Check for modifying tag
for instance in instances:
if instance['Tags'].get('aurora-autoscaler-modifying') == 'true':
return {
'can_scale': False,
'reason': f"Instance {instance['DBInstanceIdentifier']} has modifying tag"
}
# Check cooldown period
latest_modification = get_latest_modification_timestamp(instances)
if latest_modification:
cooldown_end = latest_modification + timedelta(minutes=COOLDOWN_MINUTES)
if datetime.now(timezone.utc) < cooldown_end:
return {
'can_scale': False,
'reason': f"Cooldown period active until {cooldown_end.isoformat()}"
}
return {'can_scale': True, 'reason': None}
def get_latest_modification_timestamp(instances: list) -> Optional[datetime]:
"""Get the most recent modification timestamp from instance tags."""
timestamps = []
for instance in instances:
ts_str = instance['Tags'].get('aurora-autoscaler-modification-timestamp')
if ts_str:
try:
timestamps.append(datetime.fromisoformat(ts_str.replace('Z', '+00:00')))
except ValueError:
pass
return max(timestamps) if timestamps else None
def execute_scaling(cluster: dict) -> dict:
"""
Scaling algorithm:
1. Find smallest reader instances
2. Scale one reader to match largest instance
3. If all instances same size, scale to next tier
4. If writer is smallest, scale writer (triggers failover)
"""
instances = cluster['Instances']
readers = [i for i in instances if not i['IsWriter']]
writer = next(i for i in instances if i['IsWriter'])
# Parse instance classes
for instance in instances:
instance['_parsed'] = parse_instance_class(instance['DBInstanceClass'])
# Sort by size
instances_by_size = sorted(instances, key=lambda x: get_size_rank(x['_parsed']))
smallest_size = get_size_rank(instances_by_size[0]['_parsed'])
largest_size = get_size_rank(instances_by_size[-1]['_parsed'])
# Check if at maximum
max_parsed = parse_instance_class(MAX_INSTANCE_CLASS)
if smallest_size >= get_size_rank(max_parsed):
return notify_max_reached(cluster['DBClusterIdentifier'])
# Determine target instance and size
if smallest_size < largest_size:
# Scale smallest to match largest
target_class = instances_by_size[-1]['DBInstanceClass']
smallest_readers = [r for r in readers if get_size_rank(r['_parsed']) == smallest_size]
if smallest_readers:
target_instance = random.choice(smallest_readers)
else:
# Writer is smallest – scale it
target_instance = writer
else:
# All same size – scale to next tier
target_class = get_next_instance_class(instances_by_size[0]['DBInstanceClass'])
if not target_class:
return notify_max_reached(cluster['DBClusterIdentifier'])
if readers:
target_instance = random.choice(readers)
else:
target_instance = writer
# Tag and modify
tag_instance_as_modifying(target_instance['DBInstanceArn'])
rds.modify_db_instance(
DBInstanceIdentifier=target_instance['DBInstanceIdentifier'],
DBInstanceClass=target_class,
ApplyImmediately=True
)
return {
'status': 'scaling_initiated',
'cluster': cluster['DBClusterIdentifier'],
'instance': target_instance['DBInstanceIdentifier'],
'from_class': target_instance['DBInstanceClass'],
'to_class': target_class
}
def parse_instance_class(instance_class: str) -> dict:
"""Parse db.r6g.xlarge into components."""
parts = instance_class.split('.')
return {
'prefix': parts[0],
'family': parts[1],
'size': parts[2] if len(parts) > 2 else 'medium'
}
def get_size_rank(parsed: dict) -> int:
"""Get numeric rank for instance size."""
return INSTANCE_SIZE_ORDER.get(parsed['size'], 0)
def get_next_instance_class(current_class: str) -> Optional[str]:
"""Get the next larger instance class."""
parsed = parse_instance_class(current_class)
sizes = list(INSTANCE_SIZE_ORDER.keys())
current_idx = sizes.index(parsed['size'])
if current_idx >= len(sizes) - 1:
return None
next_size = sizes[current_idx + 1]
next_class = f"{parsed['prefix']}.{parsed['family']}.{next_size}"
# Validate against max
max_parsed = parse_instance_class(MAX_INSTANCE_CLASS)
if get_size_rank({'size': next_size}) > get_size_rank(max_parsed):
return None
return next_class
def tag_instance_as_modifying(instance_arn: str):
"""Tag instance to prevent concurrent modifications."""
rds.add_tags_to_resource(
ResourceName=instance_arn,
Tags=[
{'Key': 'aurora-autoscaler-modifying', 'Value': 'true'},
{'Key': 'aurora-autoscaler-modification-timestamp',
'Value': datetime.now(timezone.utc).isoformat()}
]
)
def notify_max_reached(cluster_id: str) -> dict:
"""Send high-priority notification when max size reached."""
message = {
'status': 'max_size_reached',
'cluster': cluster_id,
'message': f"Cluster {cluster_id} has reached maximum instance size {MAX_INSTANCE_CLASS}",
'priority': 'high'
}
notify(message)
return message
def notify(message: dict):
"""Send notification to SNS topic."""
sns.publish(
TopicArn=NOTIFICATION_TOPIC,
Subject=f"Aurora Autoscaler: {message.get('status', 'update')}",
Message=json.dumps(message, indent=2)
)
Event Handler
This function processes RDS modification completion events and continues the scaling chain.
# lambda/event_handler/handler.py
import boto3
import json
import os
import random
from datetime import datetime, timezone
rds = boto3.client('rds')
sns = boto3.client('sns')
NOTIFICATION_TOPIC = os.environ['NOTIFICATION_TOPIC_ARN']
def handler(event, context):
"""
Handles RDS Event Subscription notifications (modification complete).
Removes modifying tag and continues scaling if needed.
"""
try:
# Parse SNS message from RDS Event Subscription
sns_message = json.loads(event['Records'][0]['Sns']['Message'])
# RDS events have different structure
source_id = sns_message.get('Source ID')
event_message = sns_message.get('Event Message', '')
# Only process completion events
if 'has been modified' not in event_message.lower():
print(f"Ignoring event: {event_message}")
return {'statusCode': 200, 'body': 'Ignored non-completion event'}
print(f"Processing completion for instance: {source_id}")
# Get instance details
instance_resp = rds.describe_db_instances(DBInstanceIdentifier=source_id)
instance = instance_resp['DBInstances'][0]
cluster_id = instance['DBClusterIdentifier']
# Get cluster details
cluster = get_cluster_details(cluster_id)
# Remove modifying tag
remove_modifying_tag(instance)
# Check if more scaling needed
if should_continue_scaling(cluster):
result = continue_scaling(cluster)
else:
result = {
'status': 'scaling_complete',
'cluster': cluster_id,
'message': 'All instances are now the same size'
}
notify(result)
return {'statusCode': 200, 'body': json.dumps(result)}
except Exception as e:
print(f"Error: {str(e)}")
notify({'status': 'error', 'message': str(e)})
raise
def get_cluster_details(cluster_id: str) -> dict:
"""Fetch cluster and instance details."""
cluster_resp = rds.describe_db_clusters(DBClusterIdentifier=cluster_id)
cluster = cluster_resp['DBClusters'][0]
instances = []
for member in cluster['DBClusterMembers']:
instance_resp = rds.describe_db_instances(
DBInstanceIdentifier=member['DBInstanceIdentifier']
)
instance = instance_resp['DBInstances'][0]
instance['IsWriter'] = member['IsClusterWriter']
tags_resp = rds.list_tags_for_resource(ResourceName=instance['DBInstanceArn'])
instance['Tags'] = {t['Key']: t['Value'] for t in tags_resp['TagList']}
instances.append(instance)
cluster['Instances'] = instances
return cluster
def remove_modifying_tag(instance: dict):
"""Remove the modifying tag from an instance."""
rds.remove_tags_from_resource(
ResourceName=instance['DBInstanceArn'],
TagKeys=['aurora-autoscaler-modifying']
)
print(f"Removed modifying tag from {instance['DBInstanceIdentifier']}")
def should_continue_scaling(cluster: dict) -> bool:
"""Check if instances still need equalisation."""
classes = set(i['DBInstanceClass'] for i in cluster['Instances'])
return len(classes) > 1
def continue_scaling(cluster: dict) -> dict:
"""Scale the next smallest reader to match the largest instance."""
instances = cluster['Instances']
readers = [i for i in instances if not i['IsWriter']]
writer = next(i for i in instances if i['IsWriter'])
# Find smallest and largest
instances_by_size = sorted(instances, key=lambda x: get_instance_rank(x['DBInstanceClass']))
smallest = instances_by_size[0]
largest = instances_by_size[-1]
# Prefer readers for scaling
smallest_class = smallest['DBInstanceClass']
smallest_readers = [r for r in readers if r['DBInstanceClass'] == smallest_class]
if smallest_readers:
target = random.choice(smallest_readers)
else:
# Writer is smallest
target = writer
# Tag and modify
tag_instance_as_modifying(target['DBInstanceArn'])
rds.modify_db_instance(
DBInstanceIdentifier=target['DBInstanceIdentifier'],
DBInstanceClass=largest['DBInstanceClass'],
ApplyImmediately=True
)
return {
'status': 'scaling_continued',
'cluster': cluster['DBClusterIdentifier'],
'instance': target['DBInstanceIdentifier'],
'from_class': target['DBInstanceClass'],
'to_class': largest['DBInstanceClass']
}
def get_instance_rank(instance_class: str) -> int:
"""Get numeric rank for sorting."""
size_order = {
'small': 1, 'medium': 2, 'large': 3, 'xlarge': 4,
'2xlarge': 5, '4xlarge': 6, '8xlarge': 7, '12xlarge': 8,
'16xlarge': 9, '24xlarge': 10
}
size = instance_class.split('.')[-1]
return size_order.get(size, 0)
def tag_instance_as_modifying(instance_arn: str):
"""Tag instance to prevent concurrent modifications."""
rds.add_tags_to_resource(
ResourceName=instance_arn,
Tags=[
{'Key': 'aurora-autoscaler-modifying', 'Value': 'true'},
{'Key': 'aurora-autoscaler-modification-timestamp',
'Value': datetime.now(timezone.utc).isoformat()}
]
)
def notify(message: dict):
"""Send notification to SNS topic."""
sns.publish(
TopicArn=NOTIFICATION_TOPIC,
Subject=f"Aurora Autoscaler: {message.get('status', 'update')}",
Message=json.dumps(message, indent=2)
)
Lambda Terraform Configuration
# terraform/lambda.tf
data "archive_file" "alarm_handler" {
type = "zip"
source_dir = "${path.module}/../lambda/alarm_handler"
output_path = "${path.module}/../.build/alarm_handler.zip"
}
data "archive_file" "event_handler" {
type = "zip"
source_dir = "${path.module}/../lambda/event_handler"
output_path = "${path.module}/../.build/event_handler.zip"
}
resource "aws_lambda_function" "alarm_handler" {
function_name = "aurora-autoscaler-alarm-${var.environment}"
filename = data.archive_file.alarm_handler.output_path
source_code_hash = data.archive_file.alarm_handler.output_base64sha256
handler = "handler.handler"
runtime = "python3.11"
timeout = 30
memory_size = 256
role = aws_iam_role.aurora_autoscaler.arn
environment {
variables = {
NOTIFICATION_TOPIC_ARN = aws_sns_topic.scaling_notifications.arn
ALLOWED_FAMILIES = join(",", var.allowed_instance_families)
MAX_INSTANCE_CLASS = var.max_instance_class
COOLDOWN_MINUTES = tostring(var.cooldown_minutes)
}
}
tags = var.tags
}
resource "aws_lambda_function" "event_handler" {
function_name = "aurora-autoscaler-event-${var.environment}"
filename = data.archive_file.event_handler.output_path
source_code_hash = data.archive_file.event_handler.output_base64sha256
handler = "handler.handler"
runtime = "python3.11"
timeout = 30
memory_size = 256
role = aws_iam_role.aurora_autoscaler.arn
environment {
variables = {
NOTIFICATION_TOPIC_ARN = aws_sns_topic.scaling_notifications.arn
}
}
tags = var.tags
}
# Lambda permissions for SNS invocation
resource "aws_lambda_permission" "alarm_sns" {
statement_id = "AllowSNSInvoke"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.alarm_handler.function_name
principal = "sns.amazonaws.com"
source_arn = aws_sns_topic.scaling_trigger.arn
}
resource "aws_lambda_permission" "event_sns" {
statement_id = "AllowSNSInvoke"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.event_handler.function_name
principal = "sns.amazonaws.com"
source_arn = aws_sns_topic.event_trigger.arn
}
Variables
# terraform/variables.tf
variable "environment" {
type = string
description = "Environment name (dev, staging, prod)"
}
variable "region" {
type = string
description = "AWS region"
}
variable "cluster_identifier" {
type = string
description = "Aurora cluster identifier"
}
variable "cpu_threshold" {
type = number
default = 80
description = "CPU utilisation percentage to trigger scaling"
}
variable "cooldown_minutes" {
type = number
default = 15
description = "Minutes to wait between scaling operations"
}
variable "allowed_instance_families" {
type = list(string)
default = ["db.r6g", "db.r7g"]
description = "Allowed instance families for scaling"
}
variable "max_instance_class" {
type = string
default = "db.r6g.4xlarge"
description = "Maximum instance class to scale to"
}
variable "notification_email" {
type = string
default = ""
description = "Email address for scaling notifications"
}
variable "tags" {
type = map(string)
default = {}
description = "Tags to apply to resources"
}
Scaling Behaviour Summary
| Scenario | Action |
|---|---|
| CPU alarm fires, all instances same size | Scale random reader to next tier |
| CPU alarm fires, instances different sizes | Scale smallest reader to match largest |
| Writer is smallest instance | Scale writer (triggers automatic failover) |
| All instances at max size | High-priority notification, no scaling |
| Instance modification in progress | Skip, wait for completion |
| Within cooldown period | Skip, wait for cooldown |
Downtime Characteristics
Reader scaling: Zero downtime. Reader becomes unavailable briefly during modification (~2–5 minutes depending on size), but connections route to other readers.
Writer scaling: Requires failover. When the writer needs scaling:
- A reader is scaled first
- Failover promotes the scaled reader to writer (~10–30 seconds)
- Original writer (now reader) is scaled
With RDS Proxy in front of the cluster, observed downtime drops to 1–3 seconds for the failover.
Trade-offs
Pros:
- No external dependencies beyond AWS services
- Automatic coordination prevents concurrent modifications
- Scales readers first to minimise writer disruption
- Configurable cooldown prevents thrashing
Cons:
- No downscaling – once scaled up, instances stay large
- RDS modification times can be unpredictable (5–15 minutes)
- Failover still causes brief connection drops
- CloudWatch Alarm delays add latency to scaling response
Gotchas
-
RDS Proxy connection limits: If using RDS Proxy, ensure max_connections on the proxy can handle the scaled instance. Proxy doesn’t auto-adjust.
-
Parameter groups: Scaling to a different instance family might require a compatible parameter group. Aurora usually handles this, but verify memory-related parameters.
-
Reserved instances: Scaling to larger instances may exceed your reserved instance coverage. Monitor RI utilisation.
-
Multi-AZ considerations: Ensure your VPC subnets in each AZ can accommodate the larger instance types.
-
Performance Insights: Scaling resets Performance Insights history. Export metrics before scaling if you need them.
Observability
Add CloudWatch dashboards and alerts:
resource "aws_cloudwatch_dashboard" "aurora_scaling" {
dashboard_name = "aurora-autoscaling-${var.cluster_identifier}"
dashboard_body = jsonencode({
widgets = [
{
type = "metric"
width = 12
height = 6
properties = {
title = "CPU Utilisation"
region = var.region
metrics = [
["AWS/RDS", "CPUUtilization", "DBClusterIdentifier", var.cluster_identifier]
]
annotations = {
horizontal = [{
value = var.cpu_threshold
label = "Scaling threshold"
color = "#ff7f0e"
}]
}
}
},
{
type = "metric"
width = 12
height = 6
properties = {
title = "Lambda Invocations"
region = var.region
metrics = [
["AWS/Lambda", "Invocations", "FunctionName", aws_lambda_function.alarm_handler.function_name],
["AWS/Lambda", "Invocations", "FunctionName", aws_lambda_function.event_handler.function_name]
]
}
}
]
})
}
Downscaling (Future Work)
The current implementation only scales up. For FinOps-conscious environments, consider:
- Scheduled downscaling: Lambda triggered by EventBridge schedule during known low-traffic periods
- Metric-based downscaling: Separate alarm for sustained low CPU (<20% for 30+ minutes)
- Manual approval gate: SNS → approval workflow → Lambda execution
Downscaling is riskier – you need to ensure the smaller instance can handle baseline load before committing.
Conclusion
This approach leverages native AWS primitives (CloudWatch, SNS, Lambda, RDS Events) to implement vertical autoscaling without third-party dependencies. The coordination logic using tags and cooldown periods prevents race conditions and thrashing.
For workloads with predictable scaling patterns, consider pairing this reactive approach with proactive scheduled scaling. And if you’re hitting the maximum instance size regularly, it’s time to evaluate Aurora Serverless v2 or architectural changes to reduce write pressure.
Source code COMING SOON at: github.com/moabukar/aurora-vertical-autoscaling