Skip to content
Back to blog Implementing Vertical Autoscaling for Aurora Databases Using Lambda Functions

Implementing Vertical Autoscaling for Aurora Databases Using Lambda Functions

AWSDatabases

Implementing Vertical Autoscaling for Aurora Databases Using Lambda Functions

AWS provides horizontal scaling for Aurora out of the box – add read replicas, distribute load, done. Vertical scaling? You’re on your own. Aurora PostgreSQL supports a single writer instance, so when that writer needs more horsepower, you can’t just throw more nodes at it.

This guide covers a production-ready implementation of vertical autoscaling for Aurora using Lambda functions, CloudWatch Alarms, SNS, and RDS Event Subscriptions. The approach minimises downtime through coordinated reader-first scaling and automated failover.

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                         Aurora Vertical Autoscaling                         │
└─────────────────────────────────────────────────────────────────────────────┘

┌──────────────┐     ┌──────────────┐     ┌──────────────────────────────────┐
│  CloudWatch  │────▶│     SNS      │────▶│        Alarm Lambda              │
│    Alarm     │     │    Topic     │     │  • Validates cluster state       │
│ (CPU > 80%)  │     │              │     │  • Tags instance as 'modifying'  │
└──────────────┘     └──────────────┘     │  • Initiates modify-db-instance  │
                                          └──────────────────────────────────┘


                                          ┌──────────────────────────────────┐
                                          │      Aurora Cluster              │
                                          │  ┌────────┐  ┌────────┐          │
                                          │  │ Writer │  │ Reader │          │
                                          │  │db.r6g. │  │db.r6g. │          │
                                          │  │xlarge  │  │xlarge  │ ◀─ Scale │
                                          │  └────────┘  └────────┘          │
                                          └──────────────────────────────────┘

                                                         │ RDS Event
                                                         │ (RDS-EVENT-0014)

┌──────────────────────────────────────────────────────────────────────────────┐
│                        RDS Event Subscription                                │
│                    (filters: modification complete)                          │
└──────────────────────────────────────────────────────────────────────────────┘


                                          ┌──────────────────────────────────┐
                                          │        Event Lambda              │
                                          │  • Removes 'modifying' tag       │
                                          │  • Checks for size parity        │
                                          │  • Scales next smallest instance │
                                          │  • Triggers failover if needed   │
                                          └──────────────────────────────────┘

Prerequisites

  • Aurora PostgreSQL or MySQL cluster with at least one reader
  • IAM permissions for Lambda to modify RDS instances and manage tags
  • SNS topic for alarm notifications
  • Terraform (or CloudFormation if you must)

Code Repository: All code from this post is available at github.com/moabukar/blog-code/vertical-scaling-aurora

Repository Structure

aurora-vertical-autoscaling/
├── terraform/
│   ├── main.tf
│   ├── variables.tf
│   ├── outputs.tf
│   ├── lambda.tf
│   ├── cloudwatch.tf
│   ├── sns.tf
│   └── iam.tf
├── lambda/
│   ├── alarm_handler/
│   │   ├── handler.py
│   │   └── requirements.txt
│   └── event_handler/
│       ├── handler.py
│       └── requirements.txt
├── scripts/
│   └── package_lambda.sh
└── README.md

IAM Configuration

The Lambda functions need granular RDS permissions. Avoid rds:* – specify exactly what’s required.

# terraform/iam.tf

data "aws_iam_policy_document" "lambda_assume_role" {
  statement {
    effect = "Allow"
    principals {
      type        = "Service"
      identifiers = ["lambda.amazonaws.com"]
    }
    actions = ["sts:AssumeRole"]
  }
}

resource "aws_iam_role" "aurora_autoscaler" {
  name               = "aurora-vertical-autoscaler-${var.environment}"
  assume_role_policy = data.aws_iam_policy_document.lambda_assume_role.json
}

data "aws_iam_policy_document" "aurora_autoscaler" {
  # RDS describe permissions
  statement {
    effect = "Allow"
    actions = [
      "rds:DescribeDBClusters",
      "rds:DescribeDBInstances",
      "rds:ListTagsForResource"
    ]
    resources = ["*"]
  }

  # RDS modify permissions – scoped to specific cluster
  statement {
    effect = "Allow"
    actions = [
      "rds:ModifyDBInstance",
      "rds:FailoverDBCluster",
      "rds:AddTagsToResource",
      "rds:RemoveTagsFromResource"
    ]
    resources = [
      "arn:aws:rds:${var.region}:${data.aws_caller_identity.current.account_id}:cluster:${var.cluster_identifier}",
      "arn:aws:rds:${var.region}:${data.aws_caller_identity.current.account_id}:db:${var.cluster_identifier}-*"
    ]
  }

  # CloudWatch Logs
  statement {
    effect = "Allow"
    actions = [
      "logs:CreateLogGroup",
      "logs:CreateLogStream",
      "logs:PutLogEvents"
    ]
    resources = ["arn:aws:logs:${var.region}:${data.aws_caller_identity.current.account_id}:*"]
  }

  # SNS publish for notifications
  statement {
    effect    = "Allow"
    actions   = ["sns:Publish"]
    resources = [aws_sns_topic.scaling_notifications.arn]
  }
}

resource "aws_iam_role_policy" "aurora_autoscaler" {
  name   = "aurora-autoscaler-policy"
  role   = aws_iam_role.aurora_autoscaler.id
  policy = data.aws_iam_policy_document.aurora_autoscaler.json
}

CloudWatch Alarm Configuration

CPU utilisation is the trigger here. You could substitute any CloudWatch metric – DatabaseConnections, FreeableMemory, ReadIOPS, etc.

# terraform/cloudwatch.tf

resource "aws_cloudwatch_metric_alarm" "aurora_cpu_high" {
  alarm_name          = "aurora-${var.cluster_identifier}-cpu-high"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "CPUUtilization"
  namespace           = "AWS/RDS"
  period              = 60
  statistic           = "Average"
  threshold           = var.cpu_threshold  # Default: 80
  alarm_description   = "CPU utilisation exceeded ${var.cpu_threshold}% for 3 consecutive minutes"

  dimensions = {
    DBClusterIdentifier = var.cluster_identifier
  }

  alarm_actions = [aws_sns_topic.scaling_trigger.arn]
  ok_actions    = []  # Optional: notify when alarm clears

  treat_missing_data = "notBreaching"

  tags = var.tags
}

Why 3 evaluation periods? Single spikes shouldn’t trigger scaling. Sustained load over 3 minutes indicates genuine capacity pressure. Adjust based on your workload characteristics.

SNS Topics

Two topics: one for triggering the alarm Lambda, one for operational notifications.

# terraform/sns.tf

resource "aws_sns_topic" "scaling_trigger" {
  name = "aurora-scaling-trigger-${var.environment}"
}

resource "aws_sns_topic" "scaling_notifications" {
  name = "aurora-scaling-notifications-${var.environment}"
}

resource "aws_sns_topic_subscription" "alarm_lambda" {
  topic_arn = aws_sns_topic.scaling_trigger.arn
  protocol  = "lambda"
  endpoint  = aws_lambda_function.alarm_handler.arn
}

# Optional: email notifications for scaling events
resource "aws_sns_topic_subscription" "email" {
  count     = var.notification_email != "" ? 1 : 0
  topic_arn = aws_sns_topic.scaling_notifications.arn
  protocol  = "email"
  endpoint  = var.notification_email
}

RDS Event Subscription

This triggers the Event Lambda when an instance modification completes.

# terraform/rds_events.tf

resource "aws_db_event_subscription" "modification_complete" {
  name      = "aurora-modification-complete-${var.environment}"
  sns_topic = aws_sns_topic.event_trigger.arn

  source_type = "db-instance"
  source_ids  = data.aws_rds_cluster.target.cluster_members

  event_categories = ["configuration change"]

  tags = var.tags
}

resource "aws_sns_topic" "event_trigger" {
  name = "aurora-event-trigger-${var.environment}"
}

resource "aws_sns_topic_subscription" "event_lambda" {
  topic_arn = aws_sns_topic.event_trigger.arn
  protocol  = "lambda"
  endpoint  = aws_lambda_function.event_handler.arn
}

Lambda Functions

Alarm Handler

This function receives the CloudWatch Alarm, validates cluster state, and initiates scaling.

# lambda/alarm_handler/handler.py

import boto3
import json
import os
import random
from datetime import datetime, timezone, timedelta
from typing import Optional

rds = boto3.client('rds')
sns = boto3.client('sns')

# Instance size ordering for comparison
INSTANCE_SIZE_ORDER = {
    'small': 1, 'medium': 2, 'large': 3, 'xlarge': 4,
    '2xlarge': 5, '4xlarge': 6, '8xlarge': 7, '12xlarge': 8,
    '16xlarge': 9, '24xlarge': 10
}

# Allowed instance families for scaling (configure per cluster)
ALLOWED_FAMILIES = os.environ.get('ALLOWED_FAMILIES', 'db.r6g,db.r7g').split(',')
MAX_INSTANCE_CLASS = os.environ.get('MAX_INSTANCE_CLASS', 'db.r6g.4xlarge')
COOLDOWN_MINUTES = int(os.environ.get('COOLDOWN_MINUTES', '15'))
NOTIFICATION_TOPIC = os.environ['NOTIFICATION_TOPIC_ARN']


def handler(event, context):
    """
    Handles CloudWatch Alarm via SNS.
    Validates cluster state and initiates vertical scaling if conditions are met.
    """
    try:
        # Parse SNS message
        sns_message = json.loads(event['Records'][0]['Sns']['Message'])
        alarm_name = sns_message.get('AlarmName', '')
        
        # Extract cluster identifier from alarm dimensions
        cluster_id = extract_cluster_id(sns_message)
        if not cluster_id:
            return {'statusCode': 400, 'body': 'Could not determine cluster ID'}
        
        print(f"Processing alarm for cluster: {cluster_id}")
        
        # Get cluster details
        cluster = get_cluster_details(cluster_id)
        if not cluster:
            return {'statusCode': 404, 'body': f'Cluster {cluster_id} not found'}
        
        # Validation checks
        validation_result = validate_cluster_state(cluster)
        if not validation_result['can_scale']:
            print(f"Scaling blocked: {validation_result['reason']}")
            return {'statusCode': 200, 'body': validation_result['reason']}
        
        # Execute scaling
        result = execute_scaling(cluster)
        
        # Send notification
        notify(result)
        
        return {'statusCode': 200, 'body': json.dumps(result)}
        
    except Exception as e:
        print(f"Error: {str(e)}")
        notify({'status': 'error', 'message': str(e)})
        raise


def extract_cluster_id(alarm_message: dict) -> Optional[str]:
    """Extract cluster identifier from alarm dimensions."""
    trigger = alarm_message.get('Trigger', {})
    dimensions = trigger.get('Dimensions', [])
    
    for dim in dimensions:
        if dim.get('name') == 'DBClusterIdentifier':
            return dim.get('value')
    return None


def get_cluster_details(cluster_id: str) -> Optional[dict]:
    """Fetch cluster and instance details from RDS."""
    try:
        cluster_resp = rds.describe_db_clusters(DBClusterIdentifier=cluster_id)
        cluster = cluster_resp['DBClusters'][0]
        
        # Get instance details
        instances = []
        for member in cluster['DBClusterMembers']:
            instance_resp = rds.describe_db_instances(
                DBInstanceIdentifier=member['DBInstanceIdentifier']
            )
            instance = instance_resp['DBInstances'][0]
            
            # Get tags
            tags_resp = rds.list_tags_for_resource(
                ResourceName=instance['DBInstanceArn']
            )
            instance['Tags'] = {t['Key']: t['Value'] for t in tags_resp['TagList']}
            instance['IsWriter'] = member['IsClusterWriter']
            instances.append(instance)
        
        cluster['Instances'] = instances
        return cluster
        
    except rds.exceptions.DBClusterNotFoundFault:
        return None


def validate_cluster_state(cluster: dict) -> dict:
    """
    Check if scaling is permitted:
    1. No instances currently being modified
    2. No instances tagged as 'modifying'
    3. Cooldown period has elapsed
    """
    instances = cluster['Instances']
    
    # Check for active modifications
    for instance in instances:
        if instance['DBInstanceStatus'] != 'available':
            return {
                'can_scale': False,
                'reason': f"Instance {instance['DBInstanceIdentifier']} is {instance['DBInstanceStatus']}"
            }
    
    # Check for modifying tag
    for instance in instances:
        if instance['Tags'].get('aurora-autoscaler-modifying') == 'true':
            return {
                'can_scale': False,
                'reason': f"Instance {instance['DBInstanceIdentifier']} has modifying tag"
            }
    
    # Check cooldown period
    latest_modification = get_latest_modification_timestamp(instances)
    if latest_modification:
        cooldown_end = latest_modification + timedelta(minutes=COOLDOWN_MINUTES)
        if datetime.now(timezone.utc) < cooldown_end:
            return {
                'can_scale': False,
                'reason': f"Cooldown period active until {cooldown_end.isoformat()}"
            }
    
    return {'can_scale': True, 'reason': None}


def get_latest_modification_timestamp(instances: list) -> Optional[datetime]:
    """Get the most recent modification timestamp from instance tags."""
    timestamps = []
    for instance in instances:
        ts_str = instance['Tags'].get('aurora-autoscaler-modification-timestamp')
        if ts_str:
            try:
                timestamps.append(datetime.fromisoformat(ts_str.replace('Z', '+00:00')))
            except ValueError:
                pass
    return max(timestamps) if timestamps else None


def execute_scaling(cluster: dict) -> dict:
    """
    Scaling algorithm:
    1. Find smallest reader instances
    2. Scale one reader to match largest instance
    3. If all instances same size, scale to next tier
    4. If writer is smallest, scale writer (triggers failover)
    """
    instances = cluster['Instances']
    readers = [i for i in instances if not i['IsWriter']]
    writer = next(i for i in instances if i['IsWriter'])
    
    # Parse instance classes
    for instance in instances:
        instance['_parsed'] = parse_instance_class(instance['DBInstanceClass'])
    
    # Sort by size
    instances_by_size = sorted(instances, key=lambda x: get_size_rank(x['_parsed']))
    smallest_size = get_size_rank(instances_by_size[0]['_parsed'])
    largest_size = get_size_rank(instances_by_size[-1]['_parsed'])
    
    # Check if at maximum
    max_parsed = parse_instance_class(MAX_INSTANCE_CLASS)
    if smallest_size >= get_size_rank(max_parsed):
        return notify_max_reached(cluster['DBClusterIdentifier'])
    
    # Determine target instance and size
    if smallest_size < largest_size:
        # Scale smallest to match largest
        target_class = instances_by_size[-1]['DBInstanceClass']
        smallest_readers = [r for r in readers if get_size_rank(r['_parsed']) == smallest_size]
        
        if smallest_readers:
            target_instance = random.choice(smallest_readers)
        else:
            # Writer is smallest – scale it
            target_instance = writer
    else:
        # All same size – scale to next tier
        target_class = get_next_instance_class(instances_by_size[0]['DBInstanceClass'])
        if not target_class:
            return notify_max_reached(cluster['DBClusterIdentifier'])
        
        if readers:
            target_instance = random.choice(readers)
        else:
            target_instance = writer
    
    # Tag and modify
    tag_instance_as_modifying(target_instance['DBInstanceArn'])
    
    rds.modify_db_instance(
        DBInstanceIdentifier=target_instance['DBInstanceIdentifier'],
        DBInstanceClass=target_class,
        ApplyImmediately=True
    )
    
    return {
        'status': 'scaling_initiated',
        'cluster': cluster['DBClusterIdentifier'],
        'instance': target_instance['DBInstanceIdentifier'],
        'from_class': target_instance['DBInstanceClass'],
        'to_class': target_class
    }


def parse_instance_class(instance_class: str) -> dict:
    """Parse db.r6g.xlarge into components."""
    parts = instance_class.split('.')
    return {
        'prefix': parts[0],
        'family': parts[1],
        'size': parts[2] if len(parts) > 2 else 'medium'
    }


def get_size_rank(parsed: dict) -> int:
    """Get numeric rank for instance size."""
    return INSTANCE_SIZE_ORDER.get(parsed['size'], 0)


def get_next_instance_class(current_class: str) -> Optional[str]:
    """Get the next larger instance class."""
    parsed = parse_instance_class(current_class)
    sizes = list(INSTANCE_SIZE_ORDER.keys())
    current_idx = sizes.index(parsed['size'])
    
    if current_idx >= len(sizes) - 1:
        return None
    
    next_size = sizes[current_idx + 1]
    next_class = f"{parsed['prefix']}.{parsed['family']}.{next_size}"
    
    # Validate against max
    max_parsed = parse_instance_class(MAX_INSTANCE_CLASS)
    if get_size_rank({'size': next_size}) > get_size_rank(max_parsed):
        return None
    
    return next_class


def tag_instance_as_modifying(instance_arn: str):
    """Tag instance to prevent concurrent modifications."""
    rds.add_tags_to_resource(
        ResourceName=instance_arn,
        Tags=[
            {'Key': 'aurora-autoscaler-modifying', 'Value': 'true'},
            {'Key': 'aurora-autoscaler-modification-timestamp', 
             'Value': datetime.now(timezone.utc).isoformat()}
        ]
    )


def notify_max_reached(cluster_id: str) -> dict:
    """Send high-priority notification when max size reached."""
    message = {
        'status': 'max_size_reached',
        'cluster': cluster_id,
        'message': f"Cluster {cluster_id} has reached maximum instance size {MAX_INSTANCE_CLASS}",
        'priority': 'high'
    }
    notify(message)
    return message


def notify(message: dict):
    """Send notification to SNS topic."""
    sns.publish(
        TopicArn=NOTIFICATION_TOPIC,
        Subject=f"Aurora Autoscaler: {message.get('status', 'update')}",
        Message=json.dumps(message, indent=2)
    )

Event Handler

This function processes RDS modification completion events and continues the scaling chain.

# lambda/event_handler/handler.py

import boto3
import json
import os
import random
from datetime import datetime, timezone

rds = boto3.client('rds')
sns = boto3.client('sns')

NOTIFICATION_TOPIC = os.environ['NOTIFICATION_TOPIC_ARN']


def handler(event, context):
    """
    Handles RDS Event Subscription notifications (modification complete).
    Removes modifying tag and continues scaling if needed.
    """
    try:
        # Parse SNS message from RDS Event Subscription
        sns_message = json.loads(event['Records'][0]['Sns']['Message'])
        
        # RDS events have different structure
        source_id = sns_message.get('Source ID')
        event_message = sns_message.get('Event Message', '')
        
        # Only process completion events
        if 'has been modified' not in event_message.lower():
            print(f"Ignoring event: {event_message}")
            return {'statusCode': 200, 'body': 'Ignored non-completion event'}
        
        print(f"Processing completion for instance: {source_id}")
        
        # Get instance details
        instance_resp = rds.describe_db_instances(DBInstanceIdentifier=source_id)
        instance = instance_resp['DBInstances'][0]
        cluster_id = instance['DBClusterIdentifier']
        
        # Get cluster details
        cluster = get_cluster_details(cluster_id)
        
        # Remove modifying tag
        remove_modifying_tag(instance)
        
        # Check if more scaling needed
        if should_continue_scaling(cluster):
            result = continue_scaling(cluster)
        else:
            result = {
                'status': 'scaling_complete',
                'cluster': cluster_id,
                'message': 'All instances are now the same size'
            }
        
        notify(result)
        return {'statusCode': 200, 'body': json.dumps(result)}
        
    except Exception as e:
        print(f"Error: {str(e)}")
        notify({'status': 'error', 'message': str(e)})
        raise


def get_cluster_details(cluster_id: str) -> dict:
    """Fetch cluster and instance details."""
    cluster_resp = rds.describe_db_clusters(DBClusterIdentifier=cluster_id)
    cluster = cluster_resp['DBClusters'][0]
    
    instances = []
    for member in cluster['DBClusterMembers']:
        instance_resp = rds.describe_db_instances(
            DBInstanceIdentifier=member['DBInstanceIdentifier']
        )
        instance = instance_resp['DBInstances'][0]
        instance['IsWriter'] = member['IsClusterWriter']
        
        tags_resp = rds.list_tags_for_resource(ResourceName=instance['DBInstanceArn'])
        instance['Tags'] = {t['Key']: t['Value'] for t in tags_resp['TagList']}
        
        instances.append(instance)
    
    cluster['Instances'] = instances
    return cluster


def remove_modifying_tag(instance: dict):
    """Remove the modifying tag from an instance."""
    rds.remove_tags_from_resource(
        ResourceName=instance['DBInstanceArn'],
        TagKeys=['aurora-autoscaler-modifying']
    )
    print(f"Removed modifying tag from {instance['DBInstanceIdentifier']}")


def should_continue_scaling(cluster: dict) -> bool:
    """Check if instances still need equalisation."""
    classes = set(i['DBInstanceClass'] for i in cluster['Instances'])
    return len(classes) > 1


def continue_scaling(cluster: dict) -> dict:
    """Scale the next smallest reader to match the largest instance."""
    instances = cluster['Instances']
    readers = [i for i in instances if not i['IsWriter']]
    writer = next(i for i in instances if i['IsWriter'])
    
    # Find smallest and largest
    instances_by_size = sorted(instances, key=lambda x: get_instance_rank(x['DBInstanceClass']))
    smallest = instances_by_size[0]
    largest = instances_by_size[-1]
    
    # Prefer readers for scaling
    smallest_class = smallest['DBInstanceClass']
    smallest_readers = [r for r in readers if r['DBInstanceClass'] == smallest_class]
    
    if smallest_readers:
        target = random.choice(smallest_readers)
    else:
        # Writer is smallest
        target = writer
    
    # Tag and modify
    tag_instance_as_modifying(target['DBInstanceArn'])
    
    rds.modify_db_instance(
        DBInstanceIdentifier=target['DBInstanceIdentifier'],
        DBInstanceClass=largest['DBInstanceClass'],
        ApplyImmediately=True
    )
    
    return {
        'status': 'scaling_continued',
        'cluster': cluster['DBClusterIdentifier'],
        'instance': target['DBInstanceIdentifier'],
        'from_class': target['DBInstanceClass'],
        'to_class': largest['DBInstanceClass']
    }


def get_instance_rank(instance_class: str) -> int:
    """Get numeric rank for sorting."""
    size_order = {
        'small': 1, 'medium': 2, 'large': 3, 'xlarge': 4,
        '2xlarge': 5, '4xlarge': 6, '8xlarge': 7, '12xlarge': 8,
        '16xlarge': 9, '24xlarge': 10
    }
    size = instance_class.split('.')[-1]
    return size_order.get(size, 0)


def tag_instance_as_modifying(instance_arn: str):
    """Tag instance to prevent concurrent modifications."""
    rds.add_tags_to_resource(
        ResourceName=instance_arn,
        Tags=[
            {'Key': 'aurora-autoscaler-modifying', 'Value': 'true'},
            {'Key': 'aurora-autoscaler-modification-timestamp',
             'Value': datetime.now(timezone.utc).isoformat()}
        ]
    )


def notify(message: dict):
    """Send notification to SNS topic."""
    sns.publish(
        TopicArn=NOTIFICATION_TOPIC,
        Subject=f"Aurora Autoscaler: {message.get('status', 'update')}",
        Message=json.dumps(message, indent=2)
    )

Lambda Terraform Configuration

# terraform/lambda.tf

data "archive_file" "alarm_handler" {
  type        = "zip"
  source_dir  = "${path.module}/../lambda/alarm_handler"
  output_path = "${path.module}/../.build/alarm_handler.zip"
}

data "archive_file" "event_handler" {
  type        = "zip"
  source_dir  = "${path.module}/../lambda/event_handler"
  output_path = "${path.module}/../.build/event_handler.zip"
}

resource "aws_lambda_function" "alarm_handler" {
  function_name    = "aurora-autoscaler-alarm-${var.environment}"
  filename         = data.archive_file.alarm_handler.output_path
  source_code_hash = data.archive_file.alarm_handler.output_base64sha256
  handler          = "handler.handler"
  runtime          = "python3.11"
  timeout          = 30
  memory_size      = 256

  role = aws_iam_role.aurora_autoscaler.arn

  environment {
    variables = {
      NOTIFICATION_TOPIC_ARN = aws_sns_topic.scaling_notifications.arn
      ALLOWED_FAMILIES       = join(",", var.allowed_instance_families)
      MAX_INSTANCE_CLASS     = var.max_instance_class
      COOLDOWN_MINUTES       = tostring(var.cooldown_minutes)
    }
  }

  tags = var.tags
}

resource "aws_lambda_function" "event_handler" {
  function_name    = "aurora-autoscaler-event-${var.environment}"
  filename         = data.archive_file.event_handler.output_path
  source_code_hash = data.archive_file.event_handler.output_base64sha256
  handler          = "handler.handler"
  runtime          = "python3.11"
  timeout          = 30
  memory_size      = 256

  role = aws_iam_role.aurora_autoscaler.arn

  environment {
    variables = {
      NOTIFICATION_TOPIC_ARN = aws_sns_topic.scaling_notifications.arn
    }
  }

  tags = var.tags
}

# Lambda permissions for SNS invocation
resource "aws_lambda_permission" "alarm_sns" {
  statement_id  = "AllowSNSInvoke"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.alarm_handler.function_name
  principal     = "sns.amazonaws.com"
  source_arn    = aws_sns_topic.scaling_trigger.arn
}

resource "aws_lambda_permission" "event_sns" {
  statement_id  = "AllowSNSInvoke"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.event_handler.function_name
  principal     = "sns.amazonaws.com"
  source_arn    = aws_sns_topic.event_trigger.arn
}

Variables

# terraform/variables.tf

variable "environment" {
  type        = string
  description = "Environment name (dev, staging, prod)"
}

variable "region" {
  type        = string
  description = "AWS region"
}

variable "cluster_identifier" {
  type        = string
  description = "Aurora cluster identifier"
}

variable "cpu_threshold" {
  type        = number
  default     = 80
  description = "CPU utilisation percentage to trigger scaling"
}

variable "cooldown_minutes" {
  type        = number
  default     = 15
  description = "Minutes to wait between scaling operations"
}

variable "allowed_instance_families" {
  type        = list(string)
  default     = ["db.r6g", "db.r7g"]
  description = "Allowed instance families for scaling"
}

variable "max_instance_class" {
  type        = string
  default     = "db.r6g.4xlarge"
  description = "Maximum instance class to scale to"
}

variable "notification_email" {
  type        = string
  default     = ""
  description = "Email address for scaling notifications"
}

variable "tags" {
  type        = map(string)
  default     = {}
  description = "Tags to apply to resources"
}

Scaling Behaviour Summary

ScenarioAction
CPU alarm fires, all instances same sizeScale random reader to next tier
CPU alarm fires, instances different sizesScale smallest reader to match largest
Writer is smallest instanceScale writer (triggers automatic failover)
All instances at max sizeHigh-priority notification, no scaling
Instance modification in progressSkip, wait for completion
Within cooldown periodSkip, wait for cooldown

Downtime Characteristics

Reader scaling: Zero downtime. Reader becomes unavailable briefly during modification (~2–5 minutes depending on size), but connections route to other readers.

Writer scaling: Requires failover. When the writer needs scaling:

  1. A reader is scaled first
  2. Failover promotes the scaled reader to writer (~10–30 seconds)
  3. Original writer (now reader) is scaled

With RDS Proxy in front of the cluster, observed downtime drops to 1–3 seconds for the failover.

Trade-offs

Pros:

  • No external dependencies beyond AWS services
  • Automatic coordination prevents concurrent modifications
  • Scales readers first to minimise writer disruption
  • Configurable cooldown prevents thrashing

Cons:

  • No downscaling – once scaled up, instances stay large
  • RDS modification times can be unpredictable (5–15 minutes)
  • Failover still causes brief connection drops
  • CloudWatch Alarm delays add latency to scaling response

Gotchas

  1. RDS Proxy connection limits: If using RDS Proxy, ensure max_connections on the proxy can handle the scaled instance. Proxy doesn’t auto-adjust.

  2. Parameter groups: Scaling to a different instance family might require a compatible parameter group. Aurora usually handles this, but verify memory-related parameters.

  3. Reserved instances: Scaling to larger instances may exceed your reserved instance coverage. Monitor RI utilisation.

  4. Multi-AZ considerations: Ensure your VPC subnets in each AZ can accommodate the larger instance types.

  5. Performance Insights: Scaling resets Performance Insights history. Export metrics before scaling if you need them.

Observability

Add CloudWatch dashboards and alerts:

resource "aws_cloudwatch_dashboard" "aurora_scaling" {
  dashboard_name = "aurora-autoscaling-${var.cluster_identifier}"

  dashboard_body = jsonencode({
    widgets = [
      {
        type   = "metric"
        width  = 12
        height = 6
        properties = {
          title  = "CPU Utilisation"
          region = var.region
          metrics = [
            ["AWS/RDS", "CPUUtilization", "DBClusterIdentifier", var.cluster_identifier]
          ]
          annotations = {
            horizontal = [{
              value = var.cpu_threshold
              label = "Scaling threshold"
              color = "#ff7f0e"
            }]
          }
        }
      },
      {
        type   = "metric"
        width  = 12
        height = 6
        properties = {
          title  = "Lambda Invocations"
          region = var.region
          metrics = [
            ["AWS/Lambda", "Invocations", "FunctionName", aws_lambda_function.alarm_handler.function_name],
            ["AWS/Lambda", "Invocations", "FunctionName", aws_lambda_function.event_handler.function_name]
          ]
        }
      }
    ]
  })
}

Downscaling (Future Work)

The current implementation only scales up. For FinOps-conscious environments, consider:

  1. Scheduled downscaling: Lambda triggered by EventBridge schedule during known low-traffic periods
  2. Metric-based downscaling: Separate alarm for sustained low CPU (<20% for 30+ minutes)
  3. Manual approval gate: SNS → approval workflow → Lambda execution

Downscaling is riskier – you need to ensure the smaller instance can handle baseline load before committing.

Conclusion

This approach leverages native AWS primitives (CloudWatch, SNS, Lambda, RDS Events) to implement vertical autoscaling without third-party dependencies. The coordination logic using tags and cooldown periods prevents race conditions and thrashing.

For workloads with predictable scaling patterns, consider pairing this reactive approach with proactive scheduled scaling. And if you’re hitting the maximum instance size regularly, it’s time to evaluate Aurora Serverless v2 or architectural changes to reduce write pressure.

Source code COMING SOON at: github.com/moabukar/aurora-vertical-autoscaling


Found this helpful?

Comments