How to configure a Custom Termination Policy in an Auto Scaling group

A guide on how create an auto scaling group configured with custom termination policies using terraform.

How to configure a Custom Termination Policy in an Auto Scaling group

Custom termination Policies is an exciting new feature launched for EC2 Auto Scaling Groups in 2021. The feature itself is described in an earlier post. This post is a step-by-step guide to configuring a custom termination policy in an auto-scaling group.

First of all, you will need an Auto Scaling Group. You can follow this blog post to learn more about how to configure an Auto Scaling Group with Terraform. In this blog, I will primarily focus on the lambda function.

TLDR; you can find the code for this blog here.

Custom Termination Policy Lambda

Let us start by creating a lambda function, which will be our Custom Termination Policy. The input to the lambda function will look something like this.

{
  "AutoScalingGroupARN": "arn:aws:autoscaling:us-east-1::autoScalingGroup:d4738357-2d40-4038-ae7e-b00ae0227003:autoScalingGroupName/my-asg",
  "AutoScalingGroupName": "my-asg",
  "CapacityToTerminate": [
    {
      "AvailabilityZone": "us-east-1c",
      "Capacity": 3,
      "InstanceMarketOption": "OnDemand"
    }
  ],
  "Instances": [
    {
      "AvailabilityZone": "us-east-1c",
      "InstanceId": "i-02e1c69383a3ed501",
      "InstanceType": "t2.nano",
      "InstanceMarketOption": "OnDemand"
    },
    {
      "AvailabilityZone": "us-east-1c",
      "InstanceId": "i-036bc44b6092c01c7",
      "InstanceType": "t2.nano",
      "InstanceMarketOption": "OnDemand"
    }
  ],
  "Cause": "SCALE_IN"
}

The expected response from the lambda function should look something like this.

{
  "InstanceIDs": [
    "i-02e1c69383a3ed501",
    "i-036bc44b6092c01c7"
  ]
}

For this example, we will create a very simple function that just returns all the instance ids from the input. Here is an example code to do this in NodeJS.

exports.handler = function(event, context, callback) {
  console.log('lambda input: ', JSON.stringify(event))
  const instanceIds = event.Instances.map(i => i.InstanceId)
  const response = {
    InstanceIDs: instanceIds
  }
  console.log('lambda output: ', response)
  callback(null, response)
}

Creating infrastructure using Terraform

Next up, we will deploy this lambda function using terraform. There are two parts to this, the lambda function definition, and our code. First, we will create a .zip file containing our code. We can do this using the archive_file data source in Terraform.

data "archive_file" "ctp_function" {
  type        = "zip"
  source_dir  = "${path.module}/lambda/"
  output_path = "${path.module}/output/lambda.zip"
}

Now we can create the lambda function. Add this to your main.tf file

resource "aws_cloudwatch_log_group" "example" {
  name              = "/aws/lambda/${local.ctp_lambda_name}"
  retention_in_days = 14
}

resource "aws_iam_policy" "lambda_logging" {
  name        = "lambda_logging"
  path        = "/"
  description = "IAM policy for logging from a lambda"

  policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:*:*:*",
      "Effect": "Allow"
    }
  ]
}
EOF
}

resource "aws_iam_role_policy_attachment" "lambda_logs" {
  role       = aws_iam_role.ctp_role.name
  policy_arn = aws_iam_policy.lambda_logging.arn
}

resource "aws_iam_role" "ctp_role" {
  name = "custom-termination-policy-lambda-role"

  assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "lambda.amazonaws.com"
      },
      "Effect": "Allow",
      "Sid": ""
    }
  ]
}
EOF
}

resource "aws_lambda_function" "custom_termination_policy_lambda" {
  filename          = data.archive_file.ctp_function.output_path
  function_name     = local.ctp_lambda_name
  role              = aws_iam_role.ctp_role.arn
  handler           = "index.handler"
  runtime           = "nodejs12.x"
  source_code_hash  = data.archive_file.ctp_function.output_base64sha256
  tags              = local.tags

  depends_on = [
    aws_iam_role_policy_attachment.lambda_logs,
    aws_cloudwatch_log_group.example,
  ]
}

This code creates the execution role for the lambda function and creates the lambda function itself. We also need to allow the Auto Scaling SLR to invoke this function. Auto Scaling uses the Service-Linked Role to get permissions to invoke this Lambda function. For this reason, you must add lambda:InvokeFunction permission to AWSServiceRoleForAutoScaling. Add this to your terraform file.

data "aws_caller_identity" "current" {}

resource "aws_lambda_permission" "allow_autoscaling" {
  statement_id  = "AllowExecutionFromAutoScaling"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.custom_termination_policy_lambda.function_name
  principal     = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling"

  depends_on = [
    aws_lambda_function.custom_termination_policy_lambda,
  ]
}

Configuring an Auto Scaling Group

I have a detailed guide on creating an Auto Scaling Group using terraform, hence I will not repeat myself. Please refer to this guide to understand what is going on. You can simply add the following code to your main.tf file.

module "auto-scaling-group-demo" {
  source  = "terraform-aws-modules/autoscaling/aws"

  name = "external-${local.name}"

  vpc_zone_identifier = module.vpc.private_subnets
  min_size            = 0
  max_size            = 1
  desired_capacity    = 1

  create_launch_template  = false
  launch_template         = aws_launch_template.this.name
  user_data               = base64encode(local.user_data)

  termination_policies = [
    aws_lambda_function.custom_termination_policy_lambda.arn
  ]

  tags = local.tags

  depends_on = [
    aws_launch_template.this,
    aws_lambda_function.custom_termination_policy_lambda,
  ]
}

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"

  name = local.name
  cidr = "10.99.0.0/18"

  azs             = ["${local.region}a", "${local.region}b", "${local.region}c"]
  public_subnets  = ["10.99.0.0/24", "10.99.1.0/24", "10.99.2.0/24"]
  private_subnets = ["10.99.3.0/24", "10.99.4.0/24", "10.99.5.0/24"]

  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = local.tags
}

data "aws_ami" "amazon_linux" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name = "name"

    values = [
      "amzn-ami-hvm-*-x86_64-gp2",
    ]
  }
}

resource "aws_launch_template" "this" {
  name_prefix   = "${local.name}-launch-template"
  image_id      = data.aws_ami.amazon_linux.id
  instance_type = "t3.micro"

  lifecycle {
    create_before_destroy = true
  }
}

Testing

To make things easier, we will export the group name from the Terraform template. Create a file named outputs.tf and add the following

output "autoscaling_group_name" {
  value = module.auto-scaling-group-with-custom-termination-policy.autoscaling_group_name
}

To test the custom termination policy works, we will set the desired size of the group gown to 0. You can use the CLI or console or terraform to do this. To do it via the CLI, run the following command

aws autoscaling set-desired-capacity --auto-scaling-group-name <auto-scaling-group-name> --desired-capacity 0

Next, look at the group's scaling activities to validate the instance is being terminated

aws autoscaling describe-scaling-activities --auto-scaling-group-name <auto-scaling-group-name>

The description of the last instance should say something like Terminating EC2 instance: <instance-id>.

Cleanup

Lastly, run terraform destroy to clean up all the resources you have created. You can find the source code for this demo on my git repository.

Debugging

This is potentially the hardest part of integrating with this feature. There is a general lack of visibility in the handling of the lambda response. Here are a few things you can do

  1. The first place to look for anything wrong with Auto Scaling groups is the describe-scaling-activities call. If there is an issue invoking the lambda function, or decoding the response from the lambda function,, you will see an activity like Could not terminate instances due to invalid response from custom termination policy. Status Reason:.... If it is  a permissions issue, you should ensure aws_lambda_permission is correctly set up to allow the Auto Scaling SLR to invoke the lambda function
  2. If Auto Scaling is unable to decode the response from the lambda function, you can look at the logs of the function to see what instances are getting returned.
  3. You can also test the lambda function via the console to ensure the response is in the correct format. Any deviation from the specified format will result in Customer's response from lambda was not decodable error. You can simply paste the content from the sample Lambda function input in the Custom Termination Policy Input section into the event JSON block of the lambda function in the console. This invocation will show the exact output that will be returned to Auto Scaling. Make sure you are using the right lambda function semantics to not add any additional information in the return body.

Conclusion

Custom termination policies is a great tool to have a more fine-grained impact over which instance Auto Scaling chooses for termination. This blog should give you a good place to start playing around with this feature. Let me know if you would like more how-to blogs or more in-depth posts about this feature.