Monday, February 9, 2015

AWS - autoscaling and self healing NAT instance

Having your AWS hosted services maintain high availability is often a top priority, and sometimes its not as straightforward as we all would like it to be.  Here I will describe how to create an "almost" highly-available NAT server.

NOTE:  This configuration is not 100% highly available.  If you only have one NAT instance you will have downtime until the newly created NAT instance is re-instated.  For my use this was acceptable as this was used for an email service.  Any outgoing emails would be queued while a new replacement NAT is launched.  The time it takes for a new NAT to be put into service is about 3 minutes.  That met our SLA and not waking me up in the middle of the night!

This configuration will restore services of a failed instance in approximately 3 minutes!

Amazon provides an example of how to configure NAT instances for High-Availability, see it here,  but this configuration uses (2) NAT instances, and only works if the instance is stopped and restarted.  !!The AWS example does not work for terminated instances!!

When you create your NAT instance using an auto-scaling group and launch configuration the newly created instance will have a new network interface (ENI).  You must then update the routing tables with the new ENI ID to direct traffic to the new NAT instance.  We can accomplish this by adding a few items in the launch configuration user data and properly configuring the roles assigned to the instance.

Here are the steps to follow:

1.  Create a new Role (see example below).  Give it a useful name like:  NAT-update-route-table
   
  The role must grant DescribeNetworkInterfaces and ModifyNetworkInterfaceAttribute for all resources.  This is because we don't have an ARN for the newly launched instance.

  This role must also be allowed to modify the route table that is being used by your subnets.  The actions to allow are CreateRoute and ReplaceRoute.  This we can assign it to only be allowed to our specific route tables using the ARN.


2.  Create a Launch Configuration

   For the launch configuration:
  Select an AMI to use for your NAT.  I recommend using Amazon's community AMI for a NAT, do a search in the AMIs for "amzn-ami-vpc-nat"
  Assign the IAM Role created in the step above
  Assign the appropriate security groups, instance type, etc

  Finally, most importantly provide the User Data which will configure and update the route tables with the new instance ENI.  See full user data below.

I will walk through each step of the user data to explain what each does,  this example is for a AWS Linux NAT instance, therefore we begin our script with:  #!/bin/bash

First step is to enable IP forwarding.

echo 1 > /proc/sys/net/ipv4/ip_forward

Next we must obtain the instance ID.  We can get this from the meta-data provided by Amazon using this URL: http://169.254.169.254/latest/meta-data/instance-id

We set the variable, my_instance_id:

my_instance_id=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)

Next, we must obtain the ID for the network interface, with the instance ID we can now get the ENI ID running this command, which sets the ID to the variable my_eni_id. (be sure to modify this to your region)

my_eni_id=$(aws ec2 describe-network-interfaces --region ", {"Ref": "AWS::Region"}, " --filters Name=attachment.instance-id,Values=${my_instance_id} Name=attachment.device-index,Values=0 --output text | grep NETWORKINTERFACES | cut -f5)

We can now update our Route Table with the new network interface ID (be sure to modify this to match your Route Table ID and region)

aws ec2 replace-route --route-table-id rtb-xxxxxxxx --destination-cidr-block 0.0.0.0/0 --network-interface-id ${my_eni_id} --region us-east-1

And finally, we change the source destination check for the network interface for the instance to work properly as a NAT device.

aws ec2 modify-network-interface-attribute --network-interface-id ${my_eni_id} --no-source-dest-check --region us-east-1


3.  Last, create the auto-scaling group utilizing the launch configuration.  The auto-scaling group should be configured as:
Desired = 1
Min = 1
Max = 1

  In the event your NAT instance is terminated, the auto scaling group will launch a new instance and update the route table with the new ENI ID.

There is one missing component to this setup, and that is creating a Status Check Alarm.  The alarm should be configured to terminate the instance when it fails status check.  When the instance is terminated the auto-scaling group will launch a new instance.

(I have not yet created the code to create a new Status Check Alarm, this should be easily accomplished in the User Data.  I will hopefully find time to add to this post how to do this at a later time)


Here is the complete IAM Role Policy
(change the arn for the region and your route table)


Here is the complete User Data to add to the launch configuration
(change the region and the route-table-id to match your environment)