Creating Your Own EC2 Spot Market — Part 2
Creating Your Own EC2 Spot Market — Part 2
In Part 1 Creating Your Own EC2 Spot Market of this series, we explained how Netflix manages its EC2 footprint and how we take advantage of our daily peak of 12,000 unused instances which we named the “internal spot market.”
This sizeable trough has significantly improved our encoding throughput, and we are pursuing other benefits from this large pool of unused resources.
The Encoding team went through two iterations of internal spot market implementations. The initial approach was a simple schedule-based borrowing mechanism that was quickly deployed in June in the us-east AZ to reap immediate benefits. We applied the experience we gained to influence the next iteration of the design based on real-time availability.
The main challenge of using the spot instances effectively is handling the dynamic nature of our instance availability. With correct timing, running spot instances is effectively free; when the timing is off, however, any EC2 usage is billed at the on-demand price. In this post we will discuss how the real-time, availability-based internal spot market system works and efficiently uses the unused capacity.
Benefits of Extra Capacity
The encoding system at Netflix is responsible for encoding master media source files into many different output formats and bitrates for all Netflix supported devices. A typical workload is triggered by source delivery, and sometimes the encoding system receives an entire season of a show within moments. By leveraging the internal spot market, we have measured the equivalent of a 210% increase in encoding capacity. With the extra boost of computing resources, we have improved our ability to handle sudden influx work and to quickly reduce our of backlog.
In addition to the production environment, the encoding infrastructure maintains 40 “farms” for development and testing. Each farm is a complete encoding system with 20+ micro-services that matches the capability and capacity of the production environment.
Computing resources are continuously evaluated and redistributed based on workload. With the boost of spot market instances, the total encoding throughput increases significantly. On the R&D side, researchers leverage these extra resources to carry out experiments in a fraction of the time it used to take. Our QA automation is able to broaden the coverage of our comprehensive suite of continuous integration and run these jobs in less time.
Spot Market Borrowing in Action
We started the new spot market system in October, and we are encouraged by the improved performance compared to our borrowing in the first iteration.
For instance, in one of the research projects, we triggered 12,000 video encoding jobs over a weekend. We had anticipated the work to finish in a few days, but we were pleasantly surprised to discover that the jobs were completed in only 18 hours.
The following graph captures that weekend’s activity.