Veteran software developer David A. Kruger offered some thoughts on computer security recently at Expensivity and we appreciate the opportunity to republish them here. He starts with “Root Cause Analysis 101”
The classic line “I have a bad feeling about this” is repeated in every Star Wars movie. It’s become a meme for that uneasy feeling that as bad as things are now, they are about to get much worse. That’s an accurate portrayal of how many of us feel about cybersecurity. Our bad feeling has a sound empirical basis. Yearly cybersecurity losses and loss rates continually increase and never decrease despite annual US cybersecurity expenditures in the tens of billions of dollars and tens of millions of skilled cybersecurity man-hours. Cybersecurity’s record of continuously increasing failure should prompt thoughtful observers to ask questions like “Why are cybersecurity losses going up? Why isn’t cybersecurity technology reducing them? Are there things we don’t understand or are overlooking?”
That’s easy to answer: Of course, there are! After spending this much time, money, and brainpower on cybersecurity without managing to decrease losses, much less eliminating them, it’s painfully obvious something isn’t right.
This article explains what we get wrong about cybersecurity, how and why we get it wrong, and how to fix it. Fair warning: it’s a long and bumpy road. There a healthy dose of counterintuitive assertions, cybersecurity heresy, and toes stepped on, but at roads end you’ll know what the true cause of cybersecurity failure is and how to fix it.
Part One – Cybersecurity Technology
The Heart of the Matter
When confronted with a chronic problem, we human beings are prone to err by trying solutions without first asking the right questions. We tend to ask, “How do we stop this now?” and fail to ask, “What’s causing this?” Then we are shocked when our fixes don’t last. This tendency is so common that safety engineers developed a formal analytical method called a root cause analysis to prevent this error. Root cause analysis is designed to find unidentified causes of recurring failure. A root cause analysis starts with an effect, in this context, a failure, and works upstream all the way through the chain of causation until the root cause is found. In complex systems like computers, finding the root cause of failure is critically important because an unidentified root cause makes multiple downstream elements of the system much more prone to fail. You can tell when you’ve found the root cause, because if you fix it, the downstream recurring failures cease.
Identifying the root cause in complex systems can be hard because:
- A single root cause can spawn multiple instances and types of failure because a single root cause can spawn multiple chains of cause and effect. The chains can be long, having many intermediate cause and effect links between the root cause and the failure. The more links in the chain, the longer the “distance” between the root cause and the failure. Long chains branch and intersect with other chains, which makes it even more difficult to identify the root cause.
- Usually, the longer the distance is between an unidentified root cause and the failures it’s causing, the harder the root cause it is to identify. The shorter the distance between an intermediate cause and the failures, the easier the intermediate cause it is to identify. Intermediate causes are obvious, unidentified root causes are not—and that’s why root causes are so often overlooked.
Because of these difficulties, problem solvers can easily fall prey to the symptomatic solution fallacy, a mistaken belief that solving intermediate problems can permanently stop long distance failures. It’s called the “symptomatic” solution fallacy because it’s the engineering equivalent of a doctor believing that a treatment is curative when it only temporarily alleviates symptoms of an undiagnosed chronic disease. For example, a dose of pain medication can temporarily alleviate suffering, but it can’t cure the cancer that’s causing the pain.
To see how root cause analysis aids in finding and fixing unidentified root causes, we’ll review a common real world root cause analysis and then take the lessons learned and apply them to cybersecurity technology and then to cybersecurity policy.
Root Cause Analysis 101
The purpose of automaker safety recalls is to prevent recurrent failures attributable to a previously unidentified root cause. Recently, 700,000 Nissan Rogue SUVs were recalled because:
“In affected vehicles, if water and salt collect in the driver’s side foot well, it may wick up the dash side harness tape and enter the connector. If this occurs, the dash side harness connector may corrode and possibly cause issues such as driver’s power window or power seat inoperative, AWD warning light ON, battery discharge, and/or thermal damage to the connector. In rare cases, a fire could potentially occur, increasing the risk of injury.”
Lesson Learned 1. A root cause analysis, and ultimately the recall, was initiated by the automaker because it observed a pattern of multiple types of recurring failure that appear to be related, in this case multiple types of electrical failures.
Lesson Learned 2. From the perspective of the driver, if your power windows or seats stop working, or your care won’t start because the battery is dead or wiring in the dashboard of your 2014-2016 Nissan Rogue catches fire, it’s apparent that the problem is electrical. The root cause analysis revealed that the closest cause to these electrical failures was obvious, a corroded wiring harness connector.
Now, imagine the automaker had identified the wiring connector as the root cause and declared that replacing it was a permanent fix. It would soon be evident that the automaker had fallen prey to the symptomatic solution fallacy because replacing the connector would not be a permanent solution. The still unidentified and unfixed root cause would cause the replacement connector to corrode again, which, in turn, would cause one or more of the related failures to recur.
Key Point: After a fix has been applied, if related failures continue recurring, it’s evident that an intermediate cause was erroneously identified as the root cause.
Lesson Learned 3. Working the chain of causation backwards, the automaker deduced the cause of corrosion was exposure to moisture and a corrosive. What was the source? They deduced that the wiring harness tape wicked moisture and salt up to the connector, but where did the water and salt come from? They deduced the wiring harness was being wetted as it traversed the footwell.
The potential presence of water and salt in the footwell of an SUV is a known operating condition. A given vehicle may or may not encounter salt and water during its lifetime, but it is a known potential operating condition for all SUVs. The automaker neglected to take this known operating condition into account when selecting the routing and the physical characteristics of the tape used to wrap the wiring harness. Therefore, the root cause of failure is that the automaker neglected to compensate for a known operating condition in its design. Note that this finding is axiomatic; truly unforeseeable root causes are rare.
Key Point: In complex systems, it is axiomatic that recurring failures attributable to a previously unidentified root cause nearly always results from neglecting to compensate for known operating conditions in the design.
Lesson Learned 4. Now that the root cause has been identified, the automaker will conduct a requirements analysis to clarify operating conditions, needs, and goals of the fix, and then redesign to compensate for overlooked operating condition and minimize their and their customers’ risk and expense.
Lesson Learned 5. Since the automaker neglected to compensate for a known operating condition—potential exposure of an SUV to water and salt—in their design, the automaker is responsible legally, financially, and morally, for fixing the affected vehicles and making certain that the overlooked operating condition is compensated for in the design of all future models.
Summary of Root Cause Analysis Lessons Learned:
- Lesson Leaned 1: A pattern of multiple types of recurring related failures indicates the presence of an unidentified root cause.
- Lesson Learned 2: If repeated fixes fail to stop recurring failures, it indicates fixes are being applied to intermediate causes (symptoms), rather than to the root cause.
- Lesson Learned 3: It is axiomatic that neglecting to compensate for a known operating condition in the design is nearly always the root cause.
- Lesson Learned 4: To fix the root cause, a redesign compensating for the overlooked operating condition is required.
- Lesson Learned 5: The designers neglected to compensate for a known operating condition, therefore, they are responsible for fixing existing and new designs.
Next: What’s Wrong with Cybersecurity Technology?
Here are all thirteen segments in the series:
The true cause of cybersecurity failure and how to fix it Hint: The cause and fix are not what you think. David A. Kruger, a member of the Forbes Technology Council, says it’s getting worse: We’re in a hole so stop digging! Get back to root cause analysis.
What’s wrong with cybersecurity technology? Know your enemy: The target isn’t networks, computers, or users; they are pathways to the target —gaining control of data. The challenge: If a cyberdefender scores 1,000,000 and a cyberattacker scores 1, the cyberattacker wins, David Kruger points out.
Ingredients that cybersecurity needs to actually work Software makers continue to produce open data as if we were still living in the 50s, and the Internet had never been invented. Forbes Council’s David Kruger says, the goal should be safety (preventing harm) rather than, as so often now, security (reacting to hacks with new defenses).
Cybersecurity: Put a lid on the risks. We already own the lid. Security specialist David Kruger says, data must be contained when it is in storage and transit and controlled when it is in use. Cyberattackers are not the problem; sloppy methods are. We must solve the problem we created one piece of data or software at a time.
The sweet science of agile software development Effective security, as opposed to partial security, increases costs in the short run but decreases them in the long run. Software veteran: Getting makers to change their priorities to safer products safe rather than the next cool new feature will by no means be easy.
Computer safety expert: Start helping ruin cybercriminals’ lives. Okay, their businesses. Unfortunately, part of the problem is the design of programs, written with the best of intentions… First, we must confront the fact that software makers are not often held responsible for the built-in flaws of their systems.
The cybercriminal isn’t necessarily who you think… Chances are, the “human data collector” is just someone who works for a company that makes money collecting data about you. Did you know that his bosses have paid gazillions in fines for what he and his fellows do? Let’s learn more about what they are up to.
Sometimes, money really is the explanation. Today’s internet is a concentration of power, in terms of information, never before seen in history. The HDCs (human data collectors) treat us as guinea pigs in a thoroughly unethical experiment designed to learn how to manipulate the user most effectively.
How search engine results can be distorted Search providers such as Google are able to increase their ad revenues by distorting the search results delivered to users. Human data collectors (HDCs) have been able to evade responsibility for the preventable harms they cause by blame shifting and transferring risk to users.
How online human data collectors get free from responsibility Cybersecurity expert David A. Kruger talks about the Brave Old World in which you have much less power than Big Tech does. For Big Tech, government fines and other censures are merely a cost of doing business, which makes reform difficult at best.
Cybersecurity: Why a poke in the eye does not work. The current system punishes small businesses for data breaches they could not have prevented. Computer security expert David Kruger says the current system makes as much sense as fining the hit and run victim for not jumping out of the way.
Is your data about yourself too complex for you to manage? That’s the argument human data collectors (HDCs) make for why they should be allowed to collect and own your data. Policymakers should declare that human data is the property of the individual, not of the data collector, computer security expert David Kruger argues.
How software makers will push back against reforms Software makers will grumble but insurers may force their hand. That, however, is NOT the Big Battle… the Big Battle: Wall Street will oppose reforms that restore control to you because the market cap of Big Tech depends on human data collection.