Risk Management for High Stakes Projects

by John R. Harris

A simplified form of risk management based on quantification and pooling of risk can be applied to software development projects. For most projects this level of risk management and tracking is unnecessary, some would even say it is an agile anti-pattern, but for high-stakes projects where failure to deliver on time, or failure to deliver a defined feature set, would be catastrophic, an approach like this is warranted. It’s up to you which is more appropriate for your project; being so agile you can dance around all risks, or buying insurance by budgeting to manage predictable risks.

Risk Quantification and Pooling

Successful risk management requires a consistent approach to classification and quantification of risks. While most people can identify risks, quantifying them requires some practice. The approach described here is widely used in many industries and works well. A risk is an event or occurrence that has not yet happened but has a realistic possibility of happening during the project and would have a non-trivial impact on at least one project objective. This definition is somewhat convoluted, but the details are important. We don’t care about risks whose probability is vanishingly small or catastrophes from which we cannot possibly recover. The first is not worth tracking and the second would lead to immediate cancelation of the project and so, is not worth managing.

Given a list of risks that are reasonably likely to happen, and not so catastrophic that they would lead to immediate cancelation of the project, we can proceed with classification. To correctly classify each risk we must define five aspects.

Context The frame of reference for the risk, usually a phase of a project or initiative
Hazard The cause of an unplanned occurrence with negative consequences
Impact The cost of recovering from the negative consequences of an occurrence
Probability The likelihood of an occurrence
Exposure The impact, adjusted for likelihood, of an occurrence

Example risk

Context Commuting 20 miles each way, during rush-hour, freeway traffic, in a sedan vehicle
Hazard A fender-bender collision, being rear-ended by another vehicle (not at fault)
Impact Up to $3000 bodywork. No medical fees
Probability 1 in 100,000 (100,000 cars at rush-hour in the city with one accident a day)
Exposure $3000 * 1 in 100,000 = $0.03 per day = $10.95 per year

If every commuter on the road buys collision insurance at $10.95 for the year, we can cover all likely collision risk for all commuters. Risks can also be aggregated across a portfolio of multiple different types of risk each with different probabilities. By pooling all the risks for a project in this way, we can calculate the aggregate exposure across the entire project and ensure that, while we may not anticipate the exact impact or probability of every risk, over the project lifetime we are accurate enough to manage the aggregate risk successfully.

In practice, the success of this approach comes down to our ability to identify the risks and accurately define exposure. We need to define reasonable exposure for each risk for the risk bundling approach to work. It takes practice and experience to estimate exposure successfully. For example, a typical poorly defined risk would be “The sub-contractor will be late delivering their components!” While this may be true, it is not quantified and is therefore not manageable. This risk should be redefined as follows:

Context A three-month project with 6 staff
Hazard The sub-contractor will be late delivering their billing component by up to 3 calendar days Impact A 2-day delay for 4 team members ( 8 project days) Probablity 75%
Exposure = Impact x Probablity 8 x 0.75 = 6 days LoE

Usually, we express this as a single sentence and exclude the context as this tends not to change during the project - all risks are in the same project context. So the above risk would be expressed as:

Risk - The subcontractor will be late delivering the billing component by up to 3 days leading to a 2-day delay for 4 team members
Impact - 8 LoE
Probability 75%
Exposure 8 x 0.75 = 6 LoE

The difference between the poorly defined risk and the well-defined risk is in the quantification of the impact and the likelihood. But what if the subcontractor is even later? We can extend our coverage to handle further delays by adding another risk as follows:

Risk The sub-contractor will be late delivering the billing component by between 4 and 8 days leading to an additional 4-day delay for 4 team members
Impact 16 LoE
Probability 20%
Exposure 3 LoE

Together the two risks account for uncertainty in the delay and spread in probability. The impact and exposure are both defined in terms of Level-of-Effort (LoE) in days for a single team member. If you don’t like estimating in LoE, you can use story points or T-shirt sizing or whatever metric you want to quantify tasks.

Planning to Manage Risk

At the start of a project we identify and quantify all the risks we can and calculate the total Aggregate Risk Exposure in LoE. The total LoE for all tasks AND the aggregate risk exposure are summed and used to forecast the delivery date. By adding the risk exposure to the forecast we are effectively adding a risk buffer to the project that should cover all reasonable eventualities. If we have identified and estimated our risks correctly, the risk buffer will have been consumed by the time we finish the project - all tasks will have been completed, and all risks will have been managed.

Before starting the project, we assess the aggregate exposure as a percentage of planned task LoE using the following rule of thumb.

Low Exposure (aggregate exposure <5% of total project task LoE) Moderate Exposure (aggregate exposure >5% <10% of total project task LoE) High Exposure (aggregate exposure >10% <15% of total project tasks LoE)

We aim to keep the project exposure well below 15% of the total estimated project task LoE. Experience has shown that while 15% is just about manageable, project exposures greater than this are hard to manage and generally indicate that we don’t understand the project well enough to commit to on-time or defined scope delivery.

If the risk exposure is higher than 15% of the total estimated project LoE, we continue to adjust the plan until the risk exposure is reduced to an acceptable level. Various strategies for these adjustments are described below.

Risk Mitigation (Risk Backlog Grooming)

The following table is used to classify each risk and assign a mitigation strategy. Risks are assigned to the table based on their probability and impact. Impact is classified “Very Low” to “Very High” based on the normalized size of the impact compared to the total project task LoE.

All risks are assigned a mitigation strategy by assigning them to the table using their probability and impact

Risk Mitigation Strategy

All risks are assigned a mitigation strategy by assigning them to the table using their probability and impact

Mitigation Strategies

Accept Add the exposure to the project risk backlog
Reduce Accept the risk but develop a plan to reduce the impact as much as possible
Transfer Try to transfer the risk to someone else or buy “insurance”
Avoid Don’t move forward with this risk. Add it to the plan as a scheduled task (assume it will happen and plan for it) or move it until after delivery

This strategy is applied to all risks before the plan is finalized. The project proceeds when the project risk exposure is less than 15%, and all reasonable steps have been taken to manage the risks.

Managing the Risk Backlog

All risks are created as tasks and added to the bottom of the project backlog with a LoE set to the estimated exposure for the risk.

As each sprint is planned,

  • Newly identified risks should be added to the backlog
  • Any risk that has occurred should be moved into the sprint and should have its LoE adjusted to the full estimated impact
  • Any risk that is no longer possible should be closed/retired

Eventually, if our risk estimation is correct, we will reach the end of the project on time with no risks left.

At the start of each sprint we should ensure the aggregate LoE for the outstanding risks is not too high. If the risk backlog exceeds 15% of the remaining task backlog then we should take steps to reduce the risk backlog using the strategy described above. In the worst case, this may lead to a change in the forecast completion date or adding staff to the project.

This approach works provided there are enough risks to average out our impact and probability estimates. As the project proceeds and the number of remaining risks reduces this method becomes erratic. Particular care should be taken at the end of a project if there are remaining risks with substantial impacts as these can prevent on-time delivery.