DevOps: Decision Making: Applying ROI Based Analysis
DevOps: Decision Making: Applying ROI Based Analysis
Return on Investment (ROI) is a structured, objective form, of analysis. DevOps is particularly well suited for because it deals with organizational and process efficiency, which can often be easily quantified by the amount of time or money saved.
This article will cover what ROI, explain why it fits so well with DevOps, and illustrate ROI calculations. Finally we’ll walk through how to use it to: qualify work for consideration, compare potential units of work in order to maximize value, and how to quantify historical work in order to develop an idea of effectiveness!
Return on investment (ROI) is the ratio between the net profit and cost of investment resulting from an investment of some resources. A high ROI means the investment’s gains compare favorably to its cost. As a performance measure, ROI is used to evaluate the efficiency of an investment or to compare the efficiencies of several different investments. In purely economic terms, it is one way of relatingprofits to capital invested.Return on investment is a performance measure used by businesses to identify the efficiency of an investment or number of different investments.
There is the future predictive view of investments, and a historical performance analysis. Both are valuable and will be covered in this article. The predictive view allows for the comparison of multiple units of work ie it helps to answer, which task should a team do first? Which has a greater predicted impact? The historical analysis is useful for gauging how effective an individual or a team actual was in increasing efficiency or cutting costs. We’ll dive into both of these in depth later in the article.
The following are some of the analyses that ROI calculations enable.
- Forecasting (Analysis) — Visualize how decisions will impact the company: “We can free up 4 hours per sprint of engineering time with a half of sprint (20 hour) investment”
- Linting — Provides a quick filter of if a potential unit of work is viable or not. If a change is being proposed to make something more efficient or save money and it has a negative ROI or a static one than it could quickly invalidate the idea.
- Priority Comparisons (decision making) — ROI gives objective hooks (time/money investment vs return) to discuss and compare options. Given a set of analysis ROI will let us compare which one provides more benefit, and allow to schedule an optimal amount of work for an interval.
- Impact — Gives a historical view of how effective a team was. It’s a sum of all the returns over an interval and provides an objective dimension to compare effectiveness interval over interval.
(If the above seem abstract for right now, each will be covered in detail with real world examples later.)
The actual ROI calculation is pretty easy to calculate:
return on investment = (gain from investment — cost of investment) / cost of investment
The gain here is the gain from the investment over the interval. (We’ll walk through a number of examples below to illustrate this). Another useful calculation is when the results of the time or money invested break even.
break even = time to complete / estimated savings
These two simple formulas will create an objective base in order to quantify and compare potential units of work. ROI Based analysis uses ROI calculations in order to objectively compare possible units of work. ROI works as an objective dimension which allows for a common denominator of decision making.
Why DevOps and ROI?
Much of DevOps work is based on analyzing processes and increasing their efficiencies. This ends up equating very easily to ROI analysis. Suppose that we are considering a DevOps task that can reduce deployment times by N minutes per build, and there are N builds per day, and the estimated cost to complete it is N hours. Tasks like this happen all the time in DevOps and fit perfectly (as we’ll see below) into ROI analysis.
The whole goal is to try and coalesce a complicated problem into something with objective properties in order to more easily understand the inputs and the gains.
Below illustrates an ROI Based decision making process, it is an endless cycle of analysis, lint comparison, execution, and then calculating impact. The rest of the article will cover each step.
How to perform DevOps ROI Analysis
The goal of analysis is to take a task (potential unit of work) as an input and to generate an output with quantifiable dimensions. This output is a number representing investment time or money and return in terms of time or money.
DevOps is very focused on process efficiencies which allow for a very straightforward ROI based analysis, below are some common type of ROI calculations and examples:
Simple Time: task -> analyze -> (time invested, time return)
This class of task involves investing time in order to achieve a time savings.This happens all the time when talking about efficiency related tasks.
Automating Toil
SRE defines toil as:
The kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.
Task Description: Since automation is in investment of time in order to save time it is a great candidate for ROI. Suppose there is a manual deployment process which requires following a playbook and takes approximately 2 hours every week. The effort to automate it is estimated to take ~10 hours to complete.
{{ PICTURE OF DEPLOYMENT PROCESS VALUE STREAM}}
Let’s calculate the return over a quarter (13 weeks). Other common values are: sprint, quarter, quarters, year.
+-----------------------------------+-------------------------+| Task | automate manual || | deployment step (rsync) || estimated_time_to_complete (hours)| 10.0 || estimated_savings (hours/week) | 2.0 || return_interval | quarter (13 weeks) || savings_per_interval (hours) | 26.0 || ROI | 1.6 || Break Even | 5w |+-----------------------------------+-------------------------+
Above plugs in all the values identified in the work description above. The ROI is calculated from the equation above, plugging in our values:
return on investment = (26 hours — 10 hours) / 10 hours = 1.6
Similarly the break even is calculated:
break even = 10 hours / 2 hours/week = 5w
Increasing build efficiencies
Another common class of work is increasing efficiencies of builds or other organizational processes. Work that addresses process efficiencies are another great candidate for ROI.
Task description: Suppose that there is a docker based build of a python web application. Each build takes ~7 minutes with anywhere from 5–10 builds per day. This creates ~49minutes of time spent in builds per day. There is a proposal to cache some of the most expensive shared python components in a base image which was observed to reliably shave off 3 minutes per build, which would result in a daily savings of ~21 minutes. The estimate for this is expected to be ~40 hours to fully roll out (1 engineers full 1 week sprint).
+-----------------------------------+--------------------------+| Task | docker python base image || | baked with shared python || | packages. || --- | --- || estimated_time_to_complete(hours) | 40.0 || estimated_savings (hours/week) | 1.75 || return_interval | quarter (13 weeks) || savings_per_interval (hours) | 22.75 || ROI | -0.4 || Break Even | 23w |+-----------------------------------+--------------------------+
In this case we’re not able to reach a positive ROI during a quarter. If we’re looking for investment return only this would not be a good investment. But there are so many more dimensions to consider: does this remove a source of technical debt that causes outages or is a large risk factor? Is this a unit of work that HAS to be done ie deprecation, version change, etc? We’ll touch a bit more on these sorts of considerations below when we talk about caveats.
Simple Dollars (time=$, dollars)
This analysis type uses money as the lowest common denominator. Doing this allows for cost savings work to be compared along side of time (efficiency) savings work. This is an application of the proverb: time is money. In order to perform this analysis type an additional conversion step will take place where time estimations are converted to hourly rates, and then ROI is applied using the same formulas as above.
This calculation can be used when either a time investment is made in order to achieve a return of money or when a money investment is made in order to achieve a return on time. By using money as the lowest common denominator, this further increases the scope of how ROI based calculations can be applied. When more and larger classes of work can be modeled using ROI calculations it allows for even further objective comparisons.
Task Description: A service is deployed to a fleet of 7 amazon EC2 instances with a cost of $250 / month. The instance size initially chosen was over provisioned and after a couple months of operation it is determined that the service only need 3 instances which would result in a cost of $100 / month. The estimated time to complete is 10 hours, with an average DevOps engineer salary of ~60 / hour. This service has been around for a couple years and its expected to be around for at least another year.
+----------------------------------+---------------------------+| task | Reprovision service EC2 || | instance for smaller size || estimated_cost_to_complete ($) | 600 || estimated_cost_savings ($/month) | 150 || return_interval | year (12 months) || savings_per_interval ($) | 1800 || ROI | 2 || Break Even | 4 months |+----------------------------------+---------------------------+
The cost based estimations are plugged into the equations outlined above in the same way the time based estimations are:
return on investment = ($1800 — $600) / $600 = 2
And the break even:
break even = $600 / $150 / month = 4 months
Linting
Linting is the second stage of the analysis. This stage applies a filtering function of a potential unit of work in order to qualify or disqualify it for consideration.
In a ROI focused decision model this would probably be each potential unit of work has to provide better than break even to be considered.
In [1]: import collections
In [2]: Task = collections.namedtuple('Task', ('desc', 'cost', 'roi'))
In [3]: def isPositiveReturn(task): ...: return task.roi > 0 ...:
In [4]: isPositiveReturn(Task(desc='automate manual deployment step (rsync)', cost=10, roi=1.6))Out[4]: True
In [5]: isPositiveReturn(Task(desc='docker python base image', cost=40, roi=-0.4))Out[5]: False
Is this a good idea? Is there a return? Linting is the first type of analysis because it focuses on a single action and its ROI. It provides a filter function over to help qualify work and is simple enough to be calculated mentally.
Caveat: This illustrates where ROI can be deficient. If this component which gives a a month component is a significant source of risk or complexity, or enables more work those concerns aren’t captured in ROI. If we were using only an ROI based model it would rule this out as work. I find ROI analysis works best along other formal decision making approaches, such as Expected Value (EV), risk assessment, or others that consider second and third degree decisions.
In my own personal framework for a break even task to be considered it must be able to remove technical debt or reduce risk or affect other aspects. The really cool part about an ROI approach is that it gives an objective quantifiable base. If linting for a particular task fails a new task could be proposed or the scope of the original task could be changed to try and optimize ROI, ROI provides dimensions to see how modification can affect a return.
How To Perform Comparisons — Guide Decision Making
This step is focused on determine what should be worked on and in what order. Determining what to work on in what order is difficult. Comparisons are a way to maximize the amount of return for a given interval. There are a couple heuristics to use to guide comparisons, but we can model this is as a knapsack problem in order to get the most value for an interval. Luckily ROI helps to quantify decisions along the cost and return dimensions allowing for easy objective comparisons.
We’ll walk through two approaches for deciding what to work on when the first is based on choosing the optimal task to work given a fixed period of time and the second is a heuristic approach used to determine what to work on at a specific moment in time.
In [1]: tasks = [Task(desc="task number {}".format(i), cost=random.randrange(1, 100), roi=random.randrange(1, 100) / 10.0) for i in range(10)]
In [2]: pprint(tasks)[Task(desc='task number 0', cost=46, roi=8.5), Task(desc='task number 1', cost=93, roi=1.5), Task(desc='task number 2', cost=2, roi=3.9), Task(desc='task number 3', cost=12, roi=1.5), Task(desc='task number 4', cost=56, roi=0.1), Task(desc='task number 5', cost=60, roi=5.6), Task(desc='task number 6', cost=37, roi=2.1), Task(desc='task number 7', cost=97, roi=0.2), Task(desc='task number 8', cost=31, roi=5.0), Task(desc='task number 9', cost=88, roi=7.8)]
The knapsack problem is an optimization problem and is focused on finding an optimal set of values to fit in a limited capacity. In this case the capacity is an amount of time ie a quarter/sprint/month/etc, and each member of the set is an ROI analyzed unit of work. By modeling potential work in terms of cost and benefit it unlocks the ability to apply an amazingly powerful computer science algorithm described in the knapsack problem.
In [1]: import knapsack
In [2]: knapsack.knapsack(size=[t.cost for t in tasks], weight=[t.roi for t in tasks]).solve(100)Out[2]: (18.9, [0, 2, 3, 8])
In [3]: for i in result[1]: ...: print(tasks[i]) ...:Task(desc='task number 0', cost=46, roi=8.5)Task(desc='task number 2', cost=2, roi=3.9)Task(desc='task number 3', cost=12, roi=1.5)Task(desc='task number 8', cost=31, roi=5.0)
The above uses the knapsack python package. The size of each item is the cost of the task (hours/money) and the weight is the ROI. We then solve it by passing in the capacity, in this case we are pretending we only have 100 hours. The result is the optimal maximum amount of ROI we can achieve in the 100 hours, and the index of each task which will achieve that ROI.
ROI analysis and knapsack problem allow us to determine the most optimal tasks for us to work on in a given interview in order to generate the greatest value!!! The results above can be prioritized using a (least effort, most value) heuristic described below.
(least effort, most value)
Another strategy is a backlog prioritized by least effort to most value. This is as simple as ordering on by effort increasing and value decreasing: This ordering is maintained so that when a new unit of work is added it will be sorted and the next task at any given time will be the lowest effort for the most value.
In [34]: highest_roi = sorted(tasks, key=operator.attrgetter('roi'), reverse=True)
In [35]: pprint(sorted(highest_roi, key=operator.attrgetter('cost')))[Task(desc='task number 2', cost=2, roi=3.9), Task(desc='task number 3', cost=12, roi=1.5), Task(desc='task number 8', cost=31, roi=5.0), Task(desc='task number 6', cost=37, roi=2.1), Task(desc='task number 0', cost=46, roi=8.5), Task(desc='task number 4', cost=56, roi=0.1), Task(desc='task number 5', cost=60, roi=5.6), Task(desc='task number 9', cost=88, roi=7.8), Task(desc='task number 1', cost=93, roi=1.5), Task(desc='task number 7', cost=97, roi=0.2)]
The above shows a simple two value sort on lowest cost and highest roi. At any point the top value should be the lowest effort for most value.