Skip to content

Machine Learning that's Killer


Finding terrorists who haven’t yet struck is like looking for a needle in a haystack. In fact, it may be looking for needles where there aren’t any.

Ars Technica put up an interesting but deeply concerning piece yesterday regarding the use of the NSA’s SKYNET program to automatically identify terrorists based on metadata.

Patrick Ball—a data scientist and the executive director at the [Human Rights Data Analysis Group](—who has previously given expert testimony before war crimes tribunals, described the NSA's methods as "ridiculously optimistic" and "completely bullshit." A flaw in how the NSA trains SKYNET's machine learning algorithm to analyse cellular metadata, [Ball]( told Ars, makes the results scientifically unsound. Somewhere between 2,500 and 4,000 people have been killed by drone strikes in Pakistan since 2004, and most of them were classified by the US government as "extremists," the Bureau of Investigative Journalism [reported]( Based on the classification date of "20070108" on one of the [SKYNET slide decks]( (which themselves appear to date from 2011 and 2012), the machine learning program may have been in development as early as 2007. In the years that have followed, thousands of innocent people in Pakistan may have been mislabelled as terrorists by that "scientifically unsound" algorithm, possibly resulting in their untimely demise.

Even in the slides the NSA uses to tout the program’s successes falsely identify an Al-Jazeera correspondent as a member of both Al-Qaeda and the Muslim brotherhood:

The program, the slides tell us, is based on the assumption that the behaviour of terrorists differs significantly from that of ordinary citizens with respect to some of these properties. However, as The Intercept's exposé last year made clear, the highest rated target according to this machine learning program was Ahmad Zaidan, Al-Jazeera's long-time bureau chief in Islamabad.

The article offers quite a bit of insight into the methods used to classify and identify terrorists, but the short version is that a common algorithm used to train a program to detect specific criteria is fed a very small number of known terrorists in order to train it. Then, as records and other indicators of possible terrorist leanings are fed into the system–now from a large population, such as the entire population of Pakistan–the program determines with some degree of confidence how likely any given person is to be a terrorist.

The problem is that the initial sample of known terrorists was extremely small: only seven people. If you’ve ever studied anything about statistics, you know this is far too small to be an effective sample. What it does is heavily bend the algorithm toward looking for characteristics that may be unique to those in the sample, or only coincidentally shared among them. You need a much larger sample to effectively train such a system. And if you don’t train it well, not only do you risk false negatives (that is, missing people you wanted to catch), you risk false positives, as well–designating innocent people as terrorists.

The NSA tries to compensate for this lack of confidence by letting the bottom half of those detected to be terrorists escape detection, the thought being that this would ensure those who remain are almost certainly terrorists:

If 50 percent of the false negatives (actual "terrorists") are allowed to survive, the NSA's false positive rate of 0.18 percent would still mean thousands of innocents misclassified as "terrorists" and potentially killed. Even the NSA's most optimistic result, the 0.008 percent false positive rate, would still result in many innocent people dying. "On the slide with the false positive rates, note the final line that says '+ Anchory Selectors,'" Danezis told Ars. "This is key, and the figures are unreported... if you apply a classifier with a false-positive rate of 0.18 percent to a population of 55 million you are indeed likely to kill thousands of innocent people. _[0.18 percent of 55 million = 99,000]._ If however you apply it to a population where you already expect a very high prevalence of 'terrorism'—because for example they are in the two-hop neighbourhood of a number of people of interest—then the prior goes up and you will kill fewer innocent people." Besides the obvious objection of how many innocent people it is ever acceptable to kill, this also assumes there are a lot of terrorists to identify. "We know that the 'true terrorist' proportion of the full population is very small," Ball pointed out. "As Cory [Doctorow] says, if this were not true, we would all be dead already. Therefore a small false positive rate will lead to misidentification of lots of people as terrorists."

Fractional percentage failure rates may sound good–after all, they are good enough for filtering email spam, even though they sometimes fail–but when it comes to bombing people, “good enough” isn’t and should not be good enough. But then, this system, since it operates in secret, is not subject to outside review. It is accountable only to the executive (that is, the President) and has no Congressional or judicial oversight. Given its level of secrecy–we wouldn’t even know about it if not for the Snowden disclosures–its unlikely the particulars of how it works are discussed much with Presidential staff.

Obviously, we don’t know if this has resulted in innocent people being killed. There is not enough information available to tell us that. But we do know that drone strikes are frequently used to kill people who have not been identified and have not been suspected of any wrongdoing. It’s been known for years that it is standard US policy that, once a drone strike has occurred, several minutes are allowed to elapse–time enough for emergency responders to arrive on scene–and then the same location is attacked again, killing anyone who has come to the rescue. It’s an abhorrent practice, and one that guarantees innocent civilian deaths. So, if this program does result in dead innocents, it would be a mistake to assume that eliminating it would make a huge difference.

But what should be understood is that machine learning is not infallible. It’s only as good as its implementation and training phase, and the latter is only as good as the seed data allows. Even a system that has been extremely well-trained is going to produce false results occasionally. Addressing problems like spam email are much less complex than trying to fit a set of criteria for what makes someone a terrorist, and yet spam filtering remains imperfect. I suspect at least part of the problem is coders and analysts having too much confidence in the systems they’ve created–or at least overstating their confidence to non-technical officials who are demanding results. No one wants to hear, “this system gets it right most of the time.” There’s too much liability inherent in such a possibility. So, you game the numbers to pretend that the likelihood of a mistake is so small as to be virtually impossible. But even fractional percentages of a population of many millions still represents many hundreds or thousands of people–people who deserve not to be killed on the mere suspicion, assigned by a flawed computer program, that they are violent terrorists.

I am a big fan of technology. Technological advances over the past several decades have changed and improved our lives dramatically. There are positive and negative aspects, but on the whole I think our digital lives have been beneficial. What will not be beneficial is a future in which our every action is monitored by government forces intent on finding–and punishing–suspected wrongdoing. What is applied to Pakistanis today, to find terrorists, may be applied to Americans tomorrow, to find everything from speeders to drug users to sex workers to gangsters. These tools are not going to be used to help people, but to harm and even kill them.

Today’s technological tools allow an unprecedented level of social and political engagement, but they can be used just as easily as tools of oppression and persecution.