Getting Into ECS

The Journey Up to this point

Up until recently, I have been using Elastic beanstalk to manage the infrastructure for different apps. It worked fairly well but it took some getting used to. I initially chose it as I wanted as close to a Heroku experience as possible yet have the flexibility of AWS. I went this route as I wanted to spend the minimal amount of time possible managing infrastructure so I could focus on dev.

I am fairly comfortable with elasticbeanstalk (eb) and it does largely simplify managing the infrastructure for an app. But when trying to set it up for a Rails app which includes Sidekiq, I started hitting a brick wall.

At first, I tried to use EB's multi-container set up to handle this. I managed to get the apps onto eb but the apps kept erroring out on startup. I was finding it incredibly difficult to debug. I tried SSHing into the eb host machine and docker EXECing into the containers but this simply did not work where the exec failed with no obvious errors.

In my process of digging, I saw that multi-container eb uses Elastic Container Service (ECS) behind the scenes. I figured let me just try use ECS directly as it would potentially remove some layers of complexity and make it easier for me to debug any issues.

I was also getting really frustrated with the difficulty in fixing eb when it gets into an error state. When eb is in certain error states its basically impossible to deploy changes to it. At one stage I had to manually delete the stack eb used behind the scenes as the environment was not terminating.

The key limitations holding me back on eb which I could not find easy solutions to were:

easily and transparently stop and start the environment/jobs (eb does not have a stop or start button like Heroku, instead you need to define custom jobs to work around this)
securely and easily inject secrets into the environment (there seems to be a bespoke way of using S3 but it looked like a lot of work and was not perfect)
efficiently maximise the use of the underlying VM (multi-container docker should allow this but it was really painful to debug)
hook this up nicely to CI without having to hack together shell scripts that use the AWS cli to glue it all together. EB does seem to have GitHub actions but they seemed to require some investment to get working which I have not had a chance to do.

Delving into ECS

I have some experience working with Kubernetes so I was fairly comfortable with the ECS terminology. The easiest way to go through some of the concepts would be walking through what happens when a request to a website comes in, goes through the different layers, gets processed by the container and a response is sent back to the user. The rough layers this goes through are:

A user hits your website
This will hit your DNS
The DNS will resolve to an application load balancer
This comes in either via port 80 (HTTP) or 443 (HTTPS)
This load balancer has listeners configured on it. An application load balancer seems to be Nginx behind the scenes.
these listeners are configured against a specific port. So to support HTTP and HTTPS you would have 2 listeners on this one ALB
the listeners can have various actions configured against them like redirect, forward, authenticate, some others and target.
In our case target is used.
On AWS a target is a configuration you create in the EC2 console.
A target lets your ALB direct traffic to an ECS service and port
So, for example, port 80 traffic could be directed to the book-website-target which points to container port 8080 in your organisation's production ECS cluster
This service is a load balancer in front of your containers (in the same way a k8s service works) . It allows ECS to bring more containers in or take containers out to scale up or down for load without the layers higher up being affected.
You configure whether to make requests sticky at the ALB service level or/and the target level.
The request hits the container
The container is in a Virtual Private Cloud (VPC) and can access other servers and services in the same VPC
If the container needs to hit a DB for example, it can as long as the DB is in the same VPC. In addition to being in the same VPC, the DB's security group has to be configured to allow (inbound) traffic from your container's security group on the DB's port
- The default setting for most security groups on AWS is to block all traffic in (requiring you to whitelist what to allow in) but all traffic is allowed out.
The container then sends the response back to the user

The core pieces involved in an ECS cluster are:

The cluster control plane.
- As far as I can tell this does not seem to be charged for explicitly or is really cheap (I assume this is shared across other customers and the cost is therefore amortised across them making it cheaper)
- This control plane also seems to be out of your direct control reducing complexity
The machines/containers that the cluster has inside it
- This is either standard EC2 VMs or Fargate containers. The difference is you explicitly choose and pay for the EC2 machine/s and running time of those machines. With Fargate ,on the other hand, you instead pay for the CPU and RAM used by your containers. With Fargate it also means you do not have to worry about managing updates on the underlying machine.
Task definition
- This is basically the equivalent of a pod on Kubernetes. It defines one or more containers.
- For each task definition you define:
  - The name of the task
  - The overall RAM and CPU to be used by the containers in that task
  - The IAM (security roles) those containers will need to execute (this gives you fine-grained control over what services containers can use)
  - The minimum, desired and maximum number of jobs that can run (basically the number of containers you want at any given time)
  - One or more container definitions.
- For each container within a task definition you define:
  - the image name. i.e. repo URI, image name and image version. You can also specify private repo credentials - this is not necessary for AWS's Elastic Container Registry if your task has an IAM Role with ECR read permissions.
  - the soft and hard memory needed by a container (you can define either, both or none of these). This lets you control whether a container is (in the k8s lingo) guaranteed, burstable or best effort.
  - the CPU requirements. AWS treats one CPU similar to RAM. So a full CPU is represented by the number 1024. For 50% it is simply 512, 25% is 256 etc.
  - environment variables
    - you can easily integrate this with AWS' secret manager or system parameter store which both let you store secretes securely
  - the command to run to check the health of the container (this seems to only a cli command. This is unlike k8s which allows a number of ways to check health is http or cli)
  - the entry point (this lets you ovveride the one from the Dockerfile definition)
  - the volume/s
  - logging
    - there are a tonne of, options. Fo,r simplicity I generally opt for AWS Cloudwatch but could investigate other options in future (theres proprietary options for example Splunk and Datadog)
Service
- This is a similar concept to Kubernetes' services
- You configure it to point to a task version
- You can also configure how to automatically scale up or down based on different metrics

I have just started getting into ECS. In the coming weeks and months I will get more of a feel for it and will be able to update my understanding of it.