Data Fundamentalism

Michael Roytman    April 26, 2013

A Tale of Two Uncertainties

There are fields where precision is of the utmost importance. In fields of exploration (physics, chemistry, arguably mathematics), we attempt to seek out the truths of the world around us, to get better and better models of what’s going on. In fields of manufacturing (chocolate making, farming, engine casting) precision matters because it produces better products.

What do these fields have in common? From where I sit, behind the terminal and on top of a wall of statistics, the least common denominator is that they have well defined basic data units. Mathematics has axioms and proven theorems, Chemistry has the periodic table, Physics has quarks, chocolate has cocoa beans, and engine casting has alloys and sandmolds.

Information security is nowhere close to those. Uncertainly lurks around every corner, it is inherent in every problem. Because at its core infosec is a game between attackers and defenders, and information is always intentionally imperfect. In this post I will discuss a far more terrifying type of uncertainty – data definitions.

Data Fundamentalism

Data analysis is a language. At it’s most useful it is a way to communicate complex findings to those who don’t have time, skills, or access to the same information that the analyst does. And much like any language it requires static, basic definitions. The first problem is well known: our definitions are uncertain and sometimes late to the game. The second is worse: we don’t understand the source(s) of the uncertainty.

For anyone working on vulnerability management or predictions (excuse the lack of ‘Big Data’ usage here), this second problem is huge.  As @katecrawford writes in her recent HBR article The Hidden Biases in Big Data, “data fundamentalism” has become a pervasive problem. This is the notion that predictive analytics and well-crafted correlations reflect the objective truth. Here’s the relevant sample:

“Former Wired editor-in-chief Chris Anderson embraced this idea in his comment, ‘with enough data, the numbers speak for themselves.’ But can big data really deliver on that promise? Can numbers actually speak for themselves? Sadly, they can’t. Data and data sets are not objective; they are creations of human design. We give numbers their voice, draw inferences from them, and define their meaning through our interpretations. Hidden biases in both the collection and analysis stages present considerable risks, and are as important to the big-data equation as the numbers themselves.”

CVE is NOT the Periodic Table of Elements

I work with vulnerabilities. There are a few places that define vulnerabilities, but CVE is the most universal set of definitions that I have to work with. Yet thinking of CVEs as elements on the periodic table is a grave mistake. Before creating synthetic polymers (read: useful analytics) out of these elements, we need to understand the biases and sources of uncertainty in the definitions themselves.

For example, take a look at this finding from a research team at Concordia University in their 2011 paper Trend Analysis of the CVE for Software Vulnerability Management:

“Our finding shows that the frequency of all vulnerabilities decreased by 28% from 2007 to 2010; also, the percentage of high severity incidents decreased for that period. Over 80% of the total vulnerabilities were exploitable by network access without authentication.”

There are many such papers out there; it is frightening to think they might guide organizational decision-making. This type of analysis misses the boat on what is being analyzed; it takes CVE to be analogous to the Constitution for legal scholars. An increase or decrease in their frequency, or the types that are being published over a time bucket can have wildly varying biases.

Let’s dive into some of them. The aforementioned HBR article also alludes to this: there’s no such thing as raw data. People working in data today need to take a clue from the less quantitative disciplines and take a look at where the data originates and how it’s gathered.

A Brief History of the Time: From @SushiDude to Today

CVE is a dictionary of known infosec vulnerabilities and exposures, and it is intended as a baseline index for assessing the coverage of tools. It is not intended as a baseline index for the state of infosec, as the aforementioned paper takes it.

Let’s start with this: the Wikipedia page is dead wrong. At this year’s RSA, I wanted to delve deeper into the exact process of CVE creation, and sought out @SushiDude, the father of CVE. Here’s the story:

At its’ inception in 1999, CVE was a very different form of data than it is today. Back then, it already had an established advisory board, but vulnerabilities would be sent to the advisory board for review, and the batch would come back after staying in the queue for 3-6 months. This system was clearly recognized as not fast enough to keep up with the rapid increases in vulnerability disclosures, and the candidate (CAN) system was introduced in 2000. A CAN vulnerability would be updated into a CVE vulnerability if accepted by the advisory board, and would stay a CAN if it were rejected. Then, in 2004, the volume of disclosures once again surpassed throughput capacity. This time MITRE dropped the CAN status code, and began evaluating the vulnerabilities internally. This required internal resources to de-duplicate vulnerabilities, write up descriptions, and categorize them. At first, one person did this work. Today, it’s a team, and they’re making their operation more efficient as we speak.

disclosures over time

Looking at the volume of CVEs seems to suggest that steadily increasing CVE disclosures mean ‘the state of security is getting worse, or some such poorly structured inference.

However, this is not a dictionary. This is a company with limited resources, attempting to streamline a process. This is easily seen when looking at the rate of disclosures over time. Note the changes in process cause a reduction in throughput first, then an increase. (Actually, this leads me to believe there was a change in the process in 2011 as well)

cvedisclosuresbyyear

Their objective, from my conversation with them, is to increase throughput. This makes perfect sense – inform the public about as many standardized vulnerabilities, as their resources will allow. However, in this process lies the essential biases inherent in the basic units of vulnerabilities:

1.     Descriptions – Some vulnerabilities are inherently more difficult to describe. Some attack vectors entail chaining of two, three, even five different weaknesses in an application. Analysts can publish five other vulnerabilities in the time it takes to write up a complicated chained one.

2.     Categorization – A CVE is meant to exist independently of the multiple perspectives on that vulnerability. In the submission process, a whitehat might find a vulnerability, a vendor might submit the same one, or a third party may discover it as well, all assessing it differently. The more of such sources, the harder it is for analysts to standardize this information. For another great argument about how the process of CVE categorization influences statistics, see this OSVDB post.

3.     Prioritization – There’s a vast and unexplored sea of vulnerabilities out there, and limited manpower. So how does MITRE decide which of them to look at? The advisory board helps made decisions about which vendors to prioritize, and some vulnerabilities get left in the backlog. A nice phrase employed to describe this sea of backlog is the “php.golf.” The opportunity costs of working on all of those are missing a Windows vulnerability, and so they stay put. In fact, there are a few CVE Numbering Authorities which get (rightfully so) preferential treatment. Also note how low severity vulnerabilities rarely enter the picture unless throughput is at high level (i.e. most of the high and medium severity stuff has been taken care of).

Here’s a little light reading to prove to you this isn’t make believe, from the CVE internal mailing lists:

In addition, disclosures about some software vendors (such as Microsoft) are given higher priority than a disclosure about a php.golf application written by an undergraduate student as a part of his programming class and posted to a blog (“php.golf” has become something of an internal catchphrase for “stuff we don’t care about”).

And if you’re truly interested, dive deep into the link above to see all the inner workings of how a CVE comes to be. Regardless, it’s always good to know exactly what you’re working with.

Leave a Reply

Your email address will not be published. Required fields are marked *