There have been a bunch of articles published recently about the terrible accuracy of facial recognition technology used by the Welsh police. “Terrible accuracy” sounds pretty bad, but to understand if it is, you have to understand accuracy in biometrics. Unfortunately, this is not a simple topic. At the end of the day, use of these systems is a policy question that is far more complex than one of technical accuracy and should be considered in a different paradigm.
Let’s first look at what was reported:
During the UEFA Champions League Final week in Wales last June, when the facial recognition cameras were used for the first time, there were 2,470 alerts of possible matches from the automated system. Of these 2,297 turned out to be false positives and 173 were correctly identified – 92 per cent of matches were incorrect.
So what is happening here? The system looked at the faces of people attending the game and compared them to a database (watch list) of faces provided by UEFA, Interpol and other partner agencies. The computer provided 2,470 possible matches and only 173 were true matches.
Is that good or bad? Well, it’s actually hard to say based on what we know.
Let’s Talk About Accuracy Rates
Where FP is the number of false positives, TN is the number of true negatives and N=FP+TN is the total number of negatives. That’s super helpful, right?
Let’s say there is a stadium full of innocent people and one violent criminal, and we want to identify the violent criminal. There are two possibles kinds of errors we could make.
False Non Match/False Negative
This is the kind of error everyone is worried about. It is when the system looks at a photo of the violent criminal and then looks right at the violent criminal, and fails to identify them. It’s an incorrect, or false, non-match; it should have matched. That’s really bad.
False Match/False Positive
This is when someone gets stopped because they *might* be the criminal we are looking for, and we stop them to take a closer look, but it turns out they are someone else. This is really annoying, but not quite as bad.
In one case, we let a violent criminal go because of the error. In the second case, we are annoying innocent people in our search. They are both bad, but for law enforcement applications, missing the violent criminal is considered to be worse. And these errors are inversely proportional.
That means that as the odds of one kind of error happening are decreased, the chances of the other kind happening increases. You can only pick one to focus on and when you get that kind of error close to zero, the other kind will always go up. it’s just how it works out. Blame math. So the odds of a false non-match is set to be as close to zero as the math will allow and we accept that there will be some false matches.
Conditional Probability and Rare Events
The other thing that is important is the number of total people at the match. Why? Biometric matching with really large data sets is much harder than matching with smaller data sets. You need to have a much more accurate system. Why?
Probability isn’t my favorite thing, conditional probability even less so. When you are talking about rare events, the math just gets weird. (I’m sorry.) Let’s look at a medical example: suppose you get tested for a rare disease (only 1 in 10,000 people have this disease) and the test is correct 99% of the time.
If your test results come back positive, what are your chances of actually having the disease? Surprisingly, there is less than 1 percent chance that you have the disease.
A test that is 99% accurate is expected to be wrong 1% of the time. So if 100 people take the test, 99 results will be accurate and 1 could be wrong. So if 10,000 people take the test, we would expect 100 results to be wrong. But only 1 out of 10,000 people actually get this disease – it’s really rare – so the error rate overwhelms the number of actual cases.
Biometric Matching in Large Groups
So let’s look at biometric matching with large groups. Say the system has a near 0% chance of a false non-match and a 1% chance of a false match.
So if 100 people go to the game, we would be (nearly) 100% confident that we would identify a violent criminal, should one attend the game, but we would probably stop one person who ended up being a false match. This person would be interviewed as a suspect and let go. That is annoying for that one person, but maybe it’s acceptable if it lets us identify and stop a violent criminal.
Image XKCD Purity
Now, if 10,000 people go to the game, we are still nearly 100% confident that we would find the violent criminal, if one is at the game, but now we have interviewed 100 false matches and let them go. This is starting to be a problem; we are annoying 100 innocent people and we are having to spend a lot of time interviewing potential suspects.
You can see how this turns into a problem at a game where 170,000 people attend. That’s how many people were reported as attending the game. With a 99% accurate system, you are going to annoy almost 2,000 people.
What does all this mean? Well, from a technical perspective, the system really isn’t as bad as is being reported. With 500,000 images on the watch list and 170,000 people attending the game, they correctly identified 173 people on the watch list and had 2,297 false matches. So the false positive rate is 2297/170000 or 1.35%. That’s not nearly as bad as is being reported, although it’s not great, especially for the 2,297 innocent people who were stopped.
It’s easy to see that even if you deploy a very accurate system, if you are doing mass surveillance on large numbers of people, you are going to have a lot of false positives and you are going to be stopping and bothering a lot of innocent people.
Is it worth it?
If you had to bother 2300 innocent people to find Hitler in a crowd at a football game, some people would think it was worth it. But what if you are annoying 2300 innocent people to identify one person who didn’t pay a parking ticket? Still worth it? I don’t think so – but it’s a policy question, not a technology question.