EC2 Spot Interruptions - AWS Fault Injection Simulator

EC2 Spot Interruptions - AWS Fault Injection Simulator

ยท

3 min read

Abstract

  • AWS Fault Injection Simulator now supports Spot Interruptions, now you can trigger the interruption of an Amazon EC2 Spot Instance using AWS Fault Injection Simulator (FIS).

  • With FIS, you can test the resiliency of your workload and validate that your application is reacting to the interruption notices that EC2 sends before terminating your instances.

  • This blog guides you step-by-step to create FIS Experiment templates using AWS CDK

Table Of Contents


๐Ÿš€ Overview of EC2 spot instance

  • Amazon EC2 Spot Instances reduce the cost up to 90% but can be interrupted or reclaimed at any time with a warning in 2 mins.

  • We can use aws-node-termination-handler to ensure that the Kubernetes control plane responds appropriately to events that can cause your EC2 instance to become unavailable

๐Ÿš€ Simulate Spot Interruptions architect

  • Starting the FIS experiment which sends send-spot-instance-interruptions event.

  • Use cloudwatch event rule to catch EC2 Spot Instance Interruption Warning event and then trigger lambda function for sending slack notifications.

  • aws-node-termination-handler kubernetes DaemonSet also takes action when catching the event


Now we start creating CDK stacks

๐Ÿš€ Create Lambda function - send slack

  • Lambda handler parses the event to send a slack message which contains the event detail-type, instance ID and action

    app.py

    ```plaintext import requests from datetime import datetime import json

def send_slack(msg): """ Send payload to slack """ webhook_url = "hooks.slack.com/services**" footer_icon = 'cdkworkshop.com/images/new-cdk-logo.png' color = '#36C5F0' level = ':white_check_mark: INFO :white_check_mark:' curr_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S') payload = {"username": "Test", "attachments": [{ "pretext": level, "color": color, "text": f"{msg}", "footer": f"{curr_time}", "footer_icon": footer_icon}]} requests.post(webhook_url, data=json.dumps(payload), headers={'Content-Type': 'application/json'})

def handler(event, context): detail_type = event.get('detail-type', '') instance_id = event['detail']['instance-id'] action = event['detail']['instance-action'] message = f'{detail_type}\nresource: {instance_id}, action: {action}' send_slack(message)


* Lambda stack

    `lambda.ts`

    ```plaintext
    const send_slack = new lambda.Function(this, 'slackLambda', {
                description: 'Send Event message to slack',
                runtime: lambda.Runtime.PYTHON_3_8,
                code: lambda.Code.fromAsset('lambda-code/app.zip'),
                handler: 'app.handler',
                functionName: 'send-slack-spot-event'
            });

๐Ÿš€ Create event rule of spot interruption

  • The event listens to EC2 Spot Instance Interruption Warning to trigger the above lambda function

    event.ts

              const spot_event = new event.Rule(this, 'SpotEventRule', {
                  description: 'Spot termination event rule',
                  ruleName: 'spot-event',
                  eventPattern: {
                      source: ['aws.ec2'],
                      detailType: ['EC2 Spot Instance Interruption Warning'],
                      detail: {
                          'instance-action': ['terminate']
                      }
                  }
              });
    
              spot_event.addTarget(new event_target.LambdaFunction(send_slack));
    

๐Ÿš€ Create FIS service role

  • IAM role for AWS FIS permissions to handle the target resources here is EC2 instance

    fis_role.ts

              const fis_role = new iam.Role(this, 'FisRole', {
                  roleName: 'spot-fis-test',
                  assumedBy: new iam.ServicePrincipal('fis.amazonaws.com')
              });
    
              const ec2_policy_sts = new iam.PolicyStatement({
                  sid: 'SpotFisTest',
                  effect: iam.Effect.ALLOW,
                  actions: [
                      'ec2:DescribeInstances',
                      'ec2:StopInstances',
                      'ec2:SendSpotInstanceInterruptions'
                  ],
                  resources: ['arn:aws:ec2:ap-northeast-1:*:instance/*'],
                  conditions: {
                      'StringEquals': {'aws:RequestedRegion': props?.env?.region}
                  }
              });
    
              fis_role.addToPolicy(ec2_policy_sts);
    

๐Ÿš€ Create FIS Experiment Template

  • The experiment template includes:

    • Action: send-spot-instance-interruptions, parameter: durationBeforeInterruption PT2M

    • Targets:

      • Resource type: aws:ec2:spot-instance

      • Resource filters: State.Name=running

      • Selection mode: COUNT(1)

  • Stack

    fis.ts

              const target: fis.CfnExperimentTemplate.ExperimentTemplateTargetProperty = {
                  resourceType: 'aws:ec2:spot-instance',
                  resourceTags: {'eks:nodegroup-name': 'eks-airflow-nodegroup-pet'},
                  selectionMode: 'COUNT(1)',
                  filters: [{
                      path: 'State.Name',
                      values: ['running']
                  }]
              };
    
              const action: fis.CfnExperimentTemplate.ExperimentTemplateActionProperty = {
                  actionId: 'aws:ec2:send-spot-instance-interruptions',
                  parameters: {'durationBeforeInterruption': 'PT2M'},
                  targets: {'SpotInstances': 'spot-fis-target'}
              };
    
              const fis_exp = new fis.CfnExperimentTemplate(this, 'FisExperiment', {
                  description: 'Spot Interruption Simulate',
                  roleArn: fis_role.roleArn,
                  tags: {
                      'Name': 'spot-interrupt-test',
                      'cdk': 'fis-stack'
                  },
                  stopConditions: [
                      {source: 'none'}
                  ],
                  targets: {'spot-fis-target': target},
                  actions: {'send-spot-instance-interruptions': action}
              });
    

๐Ÿš€ Start experiment template

  • Start

  • Complete

  • Slack notify the event and aws-node-termination-handler action either

๐Ÿš€ Conclusion

  • This kind of FIS experiment helps us to test the scenario of spot interruption to check aws-node-termination-handler and fault tolerance of the application

  • We should also know about FIS pricing. The AWS FIS price is $0.10 per action-minute.