Auto Scaling Production Services on Titus
Auto Scaling Production Services on Titus
Over the past three years, Netflix has been investing in container technology. A large part of this investment has been around Titus, Netflix’s container management platform that was open sourced in April of 2018. Titus schedules application containers to be run across a fleet of thousands of Amazon EC2 instances.
Early on Titus focused on supporting simple batch applications and workloads that had a limited set of feature and availability requirements. As several internal teams building microservices wanted to adopt containers, Titus began to build scheduling support for service applications. However, supporting services, especially those in the customer critical path, required Titus to provide a much richer set of production ready features. Since Netflix’s migration to AWS began almost a decade earlier, microservices have been built atop EC2 and heavily leverage AWS and internal Netflix infrastructure services. The set of features used by an internal service then drove if or how that service could leverage Titus.
One of the most commonly used service features is auto scaling. Many microservices are built to be horizontally scalable and leverage Amazon EC2 Auto Scaling to automatically add or remove EC2 instances as the workload changes. For example, as people on the east coast of the U.S. return home from work and turn on Netflix, services automatically scale up to meet this demand. Scaling dynamically with demand rather than static sizing helps ensure that services can automatically meet a variety of traffic patterns without service owners needing to size and plan their desired capacity. Additionally, dynamic scaling enables cloud resources that are not needed to be used for other purposes, such as encoding new content.
As services began looking at leveraging containers and Titus, Titus’s lack of an auto scaling feature became either a major hurdle or blocker for adoption. Around the time that we were investigating building our own solution, we engaged with the AWS Auto Scaling team to describe our use case. As a result of Netflix’s strong relationship with AWS, this discussion and several follow ups led to the design of a new AWS Application Auto Scaling feature that allows the same auto scaling engine that powers services like EC2 and DynamoDB to power auto scaling in a system outside of AWS like Titus.
This design centered around the AWS Auto Scaling engine being able to compute the desired capacity for a Titus service, relay that capacity information to Titus, and for Titus to adjust capacity by launching new or terminating existing containers. There were several advantages to this approach. First, Titus was able to leverage the same proven auto scaling engine that powers AWS rather than having to build our own. Second, Titus users would get to use the same Target Tracking and Step Scaling policies that they were familiar with from EC2. Third, applications would be able to scale on both their own metrics, such as request per second or container CPU utilization, by publishing them to CloudWatch as well as AWS-specific metrics, such as SQS queue depth. Fourth, Titus users would benefit from the new auto scaling features and improvements that AWS introduces.
The key challenge was enabling the AWS Auto Scaling engine to call the Titus control plane running in Netflix’s AWS accounts. To address this, we leveraged AWS API Gateway, a service which provides an accessible API “front door” that AWS can call and a backend that could call Titus. API Gateway exposes a common API for AWS to use to adjust resource capacity and get capacity status while allowing for pluggable backend implementations of the resources being scaled, such as services on Titus. When an auto scaling policy is configured on a Titus service, Titus creates a new scalable target with the AWS Auto Scaling engine. This target is associated with the Titus Job ID representing the service and a secure API Gateway endpoint URL that the AWS Auto Scaling engine can use. The API Gateway “front door” is protected via AWS Service Linked Roles and the backend uses Mutual TLS to communicate to Titus.
Configuring auto scaling for a Titus service works as follows. A user creates a service application on Titus, in this example using Spinnaker, Netflix’s continuous delivery system. The figure below shows configuring a Target Tracking policy for a Node.js application on the Spinnaker UI.
The Spinnaker policy configuration also defines which metrics to forward to CloudWatch and the CloudWatch alarm settings. Titus is able to forward metrics to CloudWatch using Atlas, Netflix’s telemetry system. These metrics include those generated by the application and the container-level system metrics collected by Titus. When metrics are forwarded to Atlas they include information that associates them with the service’s Titus Job ID and whether Atlas should also forward them to CloudWatch.
Once a user has selected policy settings on Spinnaker, Titus associates the service with a new scalable resource within the AWS Auto Scaling engine. This process is shown in the figure below. Titus configures both the AWS Auto Scaling policies and CloudWatch alarms for the service. Depending on the scaling policy type, Titus may explicitly create the CloudWatch alarm or AWS automatically may do it, in the case of Target Tracking policies.
As service apps running on Titus emit metrics, AWS analyzes the metrics to determine whether CloudWatch alarm thresholds are being breached. If an alarm threshold has been breached, AWS triggers the alarm’s associated scaling actions. These actions result in calls to the configured API Gateway endpoints to adjust instance counts. Titus responds to these calls by scaling up or down the Job accordingly. AWS monitors both the results of these scaling requests and how metrics change.
Providing an auto scaling feature that allowed Titus users to configure scaling policies the same way they would on EC2 greatly simplified adoption. Rather than coupling the adoption of containers with new auto scaling technology, Titus was able to provide the benefits of using containers with well tested auto scaling technology that users and their tools already understood. We followed the same pattern of leveraging existing AWS technology instead of building our own for several Titus features, such as networking, security groups, and load balancing. Additionally, auto scaling drove Titus availability improvements to ensure it was capable of making fast, online capacity adjustments. Today, this feature powers services that many Netflix customers interact with every day.
Up until today, Titus has leveraged this functionality as a private AWS feature. We are happy that AWS has recently made this feature generally available to all customers as Custom Resource Scaling. Beyond container management platforms like Titus, any resource that needs scaling, like databases or big data infrastructure, can now leverage AWS Auto Scaling. In addition to helping drive key functionality for Titus, we are excited to see Netflix’s collaboration with AWS yield new features for general AWS customers.