Controlling which instance to terminate in an Auto Scaling Group

Custom termination policies allow you to return a list of instances for EC2 Auto Scaling to choose from for termination.

Controlling which instance to terminate in an Auto Scaling Group

EC2 Auto Scaling has the concept of Termination Policies to decide which instance to terminate during scale down. There is a set of pre-defined policies - like OldestInstance, ClosestToTheHour, etc. which look at metadata available to AWS to select the instance for termination. However, there is a very valid case where you have a better idea of which instance to terminate looking at factors other than instance metadata - like the number of open tasks on an instance, etc. Or you may want to block terminations completely - eg. in anticipation of a large influx of requests. This is where Custom Termination Policies come in handy as an EC2 Auto Scaling Feature.

You can specify a lambda function in the list of termination policies given to an EC2 Auto Scaling Group. ASG will invoke this lambda function when it determines the need to terminate instances. ASG only terminates from the instances returned in this lambda function. In this post, I will talk about this feature, and how to configure it.

How do Custom Termination Policies work

At a high level, the flow for groups configured with custom termination policies is very straightforward. ASG invokes the specified Lambda function when it determines the need to terminate instances. It then selects enough instances from the list returned in the lambda response for termination to reach desired capacity.

What happens when?

  • The lambda function returns fewer instances than needed - ASG terminates all instances from the returned list, and continues to invoke the lambda for more instances
  • The lambda function returns more instances than needed - ASG terminates enough instances to reach the desired size. It orders the instances to prioritize balanced terminations across availability zones and breaks ties using other termination policies specified for the group.
  • The lambda function returns an invalid/empty response, or does not return at all - ASG does not terminate anything. Effectively all terminations are blocked until the custom termination policy is fixed.

Is custom termination policy invoked for all terminations?

Not quite. The idea is that the policy is only invoked for terminations where the user has a real choice between instances to terminate. That is not the case for terminations due to health checks and spot rebalance notifications. Custom termination policy is invoked for all but these processes - including scale down, instance refresh, etc.

Configuring a Custom Termination Policy

You only need to specify the lambda function that you want to use as a custom termination policy in the list of termination policies in create/update auto-scaling group call. This is an example input to the create/update auto-scaling group call.

Auto Scaling uses the Service-Linked Role to get permissions to invoke this Lambda function. For this reason, it is vital that you add lambda:InvokeFunction permission to AWSServiceRoleForAutoScaling. This is detailed in this guide.

Things to keep in mind when using this feature

  • Possibly the most important thing is that the group terminations will stop if the Custom termination policy does not return a valid response. And Auto Scaling will keep trying to invoke the lambda function periodically until it gets a valid response. In this scenario, you will be paying for both instances running for longer than needed, and the cost of lambda function invocations. You should monitor for when this happens, and fix the custom termination policy as soon as you can. When your group is in this state, you will see failed activities in describe-scaling-activities call. An example activity would look like
    {
            "ActivityId": "7345fd80-c9da-9756-3a65-2f5026eab029",
            "AutoScalingGroupName": "test-group",
            "Description": "Could not terminate instances due to invalid response from custom termination policy.  Status Reason: Customer's response from lambda was not decodable",
            "Cause": "At 2022-03-06T17:06:21Z a user request update of AutoScalingGroup constraints to min: 0, max: 1, desired: 0 changing the desired capacity from 1 to 0.",
            "StartTime": "2022-03-06T17:06:34.277000+00:00",
            "EndTime": "2022-03-06T17:06:34+00:00",
            "StatusCode": "Cancelled",
            "StatusMessage": "Customer's response from lambda was not decodable",
            "Progress": 100,
            "Details": "{}",
            "AutoScalingGroupARN": "arn:aws:autoscaling:us-east-1:<account-id>:autoScalingGroup:1f7d1c9a-4260-4dbf-acb3-fe71f98164b9:autoScalingGroupName/test-group"
        }
  • Returning an instance in the Custom Termination Policy response does not guarantee termination of the instance. That means if you mark the instance as unusable in your scheduler in the custom termination policy, the instance may still linger around without doing any work.
  • For groups with multiple instance market options, auto-scaling will only terminate the number of instances specified in CapacityToTerminate list for each market option. Eg. If Auto Scaling determines it needs to terminate 5 on-demand and 5 spot instances in a given AZ, returning 10 on-demand instances will result in only 5 terminations.

Conclusion

Overall Custom Termination Policies is an extremely useful feature. It adds to the underlying principle of shared responsibility in the cloud. Application developers and operators have a lot more insight into which instances are better for termination. Custom Termination Policies provides an easy interface for you to communicate this back to AWS to make better scaling decisions.

As always, you can subscribe for more blogs on AWS features and services that can help you with your improving your cloud setup. I am always open to requests for things you would like me to write about :)