When Automation Fails: Using Root Cause Analysis to Fix "Broken" Algorithms

Imagine for a moment that you’re in the middle of a major presentation to stakeholders. The dashboard you’re looking to showcase goes dark. Something your team has spent months tuning up to reduce human error has become the point of failure. Automation, as you’ve intended it, isn’t functioning as it should. A bot is churning through tasks and producing outputs no one can rightfully explain.

The point of escalation in scenarios like this is the usual suspects: blame the technology, escalate it to your IT department, and shuffle off with your tail between your legs to resume using the old toolchains. Such a response ascribes a certain degree of mystery to your automation rather than looking at the root cause of the issues.

As Lean Six Sigma practitioners, you should be more than equipped to handle the recovery effort. The same sort of DMAIC processes we undertake to boost efficiency and solve bottlenecks are directly applicable to the problems that arise in automated systems. Algorithms aren’t some black box, but rather processes with inputs, logic, and outputs. When they fail, they fail for easily discernible reasons.

Algorithms as Processes

Computer Science vs. Software Engineering — ©BalanceFormCreative/Shutterstock.com

Before you begin any sort of remediation process, you’ve got to reject a rather dangerous assumption. It is all too easy to think that algorithms are somehow inherently different from processes. To put it plainly, they aren’t.

An automated system can be mapped just like any other process. You’ve got suppliers like data sources, inputs like your parameters and thresholds, a transformation mechanism, outputs in the form of decisions or flags, and customers as your human users. As such, you’ve got a SIPOC just ready to be mapped. Tracing the steps from supplier to customer, even for something like an AI model, forces your team to see where in the workflow the defect is generated.

Without this sort of thinking, you’re likely to fall into the usual patterns of pointing fingers at vendors, a lack of focused code review and testing, or the increasingly mystical qualities that some members of upper management give to a model without understanding the more implicit modalities of failure any automated system can have.

Defining the Failure Mode

Vague statements produce vague solutions. Saying that “The algorithm isn’t working” isn’t a problem statement. Align it more with the values of your LSS training, and understand that when an output is wrong, you have to look at what conditions are producing those results, along with how long it has been doing it.

Start with the defect itself. Is your automation producing incorrect outputs uniformly, or is it constrained within a certain condition or input range? Run analyses on the failures before even touching your models. For many investigations, this is going to reveal something beyond what appeared to be mere systemic failures, often indicating something like a localized breakdown. The model itself might never have been trained on data, which is all too easy to happen for something like seasonal data sets.

From there, you want to establish a measurable definition for your defects. If an algorithm produces a risk score, you can readily define what constitutes the defect and what your upper and lower limits should be.

What the Algorithm Receives

I’ve said it before, but one of the oldest truths in computing is garbage in, garbage out. It is one of the most frequently overlooked aspects of computing, and something that applies to even modern algorithms and automation.

Before you evaluate the model’s logic, you want to audit the inputs. Pulling a representative sample of your data to compare to the ideal data needed for the algorithm is your main goal. There are a few areas of focus here.

First, you’ll want to look at any shifts in the distribution of the model. If you’ve got a 2019 data set based around baseline outputs for automotive manufacturing, it’s likely going to have issues with the standards and models produced for the 2026 production year.

Data pipelines fail silently, often going unnoticed until it’s too late. The usual culprit here is missing data values and encoding errors, not unlike something like a handoff error. These sorts of issues don’t trigger alerts. Instead, they’re just feeding your models junk data.

Scope drift, or feature drift in models, is a real problem. You’ll want to treat every feature as a process variable. When said features drift from the expected control limits, you’ve got a cause worth investigating.

Measurement System Analysis is going to apply here. The data pipeline of any algorithm is a measurement system. Evaluate the consistency of its repeatability and reproducibility. If the same transaction performed at different times produces different inputs, then the measurement system might be your pain point.

Fishbone the Algorithm

At this point, you should have a solid definition for your defects and clean measurement data. As such, you should be more than ready to conduct cause-and-effect analysis. A fishbone or Ishikawa diagram is ideal for highlighting your algorithm’s failure across a few different categories.

First up, your data causes, or the training data that isn’t wholly representative of your use cases. These can lead to labeling errors and might be responsible for the distribution shifts seen in your algorithm.

Model design causes can include an algorithm that is unsuited for the problem at hand. This might include aspects like inadequate features or being mismatched to the task at hand.

Parameter causes are the most common and the easiest of these categories to address. Decision thresholds, weighting, and confidence cutoffs are set during initial deployment. Your conditions are going to change. The parameters of the model aren’t likely to.

Integration causes are fairly easy to remediate and include some tell-tale signs like API changes, version mismatches, and timing issues for real-time systems.

Finally, you’ll want to look at change management causes. For most algorithms, the model itself hasn’t changed, but the environment has. A process change might have altered input data, or a new product line has introduced transactions that the model has never seen.

With each of these categories, you’ll want to run through the Five Whys to drill down to the root cause of the issue, which becomes actionable. Remember, simply saying “The model was wrong” isn’t adequate. Why is the model wrong? Keep asking the hard questions that will lead to a concrete course of action.

Targeted Remediation

One common mistake you’ll run into when dealing with algorithms is treating retraining the model itself as the default method of remediation. Retraining your model on bad data is going to make a bad model. Remember, garbage in, garbage out.

Instead, match the solution to the verified root cause. If it were something like threshold miscalibration, recalibrate rather than retrain. If it’s pipeline corruption, fix the pipeline, then evaluate whatever next steps might be necessary.

Retraining the model isn’t always your best course of action. In truth, you aren’t likely to rely on the same methods of remediation in physical processes, so why would you default to a singular method for your digital toolchains?

Building Ideal Process Controls

An algorithm requires a control plan, just like your physical processes do. You’ll want controls like continuous monitoring alongside clearly defined control limits for your input features and output distributions. Automated alerts can let your team know when metrics drift outside those limits. Assign ownership, as someone needs to be responsible for reviewing algorithm health at regular intervals.

A documented, formalized change management strategy is also a great addition. Any new changes or features should trigger an impact assessment before they’re implemented into production.

Finally, it’s best to remember that the goal isn’t to eliminate variation. Instead, you need to understand that you want to distinguish common-cause variation from assignable-cause variation. A well-controlled algorithm, like any process, will allow problems to surface early on.

Conclusion

Automated systems fail for the same reason physical processes do: bad inputs, flawed logic, uncontrolled variation, and changes that outpace the controls in place. As such, it becomes important to approach a broken algorithm with a measurement plan, structured cause-and-effect analysis, and a tested improvement cycle to dispel the mystery surrounding the technology.

The algorithm is the process. Apply your methodology and training accordingly.

When Automation Fails: Using Root Cause Analysis to Fix “Broken” Algorithms

Algorithms as Processes

Defining the Failure Mode

What the Algorithm Receives

Fishbone the Algorithm

Targeted Remediation

Building Ideal Process Controls

Conclusion

About the Author

Liam Frady

Algorithms as Processes

Defining the Failure Mode

What the Algorithm Receives

Fishbone the Algorithm

Targeted Remediation

Building Ideal Process Controls

Conclusion

Join 65,000 Black Belts and Register For The Industry Leading ISIXSIGMA Newsletter Today

About the Author

Liam Frady