Ethical AI: Demographic Bias in Biometrics
There is a tremendous amount of misleading and inaccurate reporting on the topic of demographic bias in biometric identification systems, especially regarding facial recognition technology. Part of the problem is that there isn't one thing that is "facial recognition technology". At the core of any system is a matching algorithm. The definitive resource on this topic is the NIST Face Recognition Vendor Test (FRVT) Part 3 Demographic Effects report. Warning: this 82-page report is not an easy read and you really should read parts 1 and 2 first to get the context. The latest NIST test report includes over 200 different algorithms.
Not surprisingly, there is a massive variation in performance between the best and worst performing algorithms. Some of the algorithms tested by NIST had extremely high error rates. But that doesn't really matter because those algorithms would never be used in a commercial system. The best performing algorithms, of which there are many, have essentially “undetectable” false positive demographic differences. So, what does that mean?
Some background: When talking about accuracy in any classification system, there are two kinds of errors a system can make and the terminology can be confusing. A database of faces is called a gallery. The face image compared to the gallery is called a probe. A false positive is when the system incorrectly matches the probe to an image in the gallery - it matches two faces that are not the same person. This is also called a false acceptance rate (FAR). A false negative is when a system incorrectly fails to match two faces that are of the same person - it says the faces do not match when they really do. This is also called a false rejection rate (FRR).
It's important to know that these two kinds of errors are inversely proportional - you can tune an algorithm to reduce one, but that means the other kind of error will increase. The relationship between the two kinds of errors can be seen in the oddly named Receiver Operating Characteristic (ROC) curve. The ROC curve is created by plotting the true acceptance rate against the false acceptance rate at various threshold settings and plotting the curve. You can rework the chart to show false acceptance and false reject rates. Generally, as the FAR goes up, the FRR goes down. But you can also see that the difference is very, very small.
Biometric matching is probabilistic. The algorithm doesn't tell you if the two faces are the same, it calculates the probability that two faces are the same. And that means you can never be 100% certain that the match is correct - but you can be 99.999% certain that the match is correct, which is effectively the same thing. The same is true for human vision and human face recognition.
Nothing related to biometrics or identity will ever be 100%. The math just doesn't work that way. But the errors on the best commercial algorithms are approaching zero and far outperform people doing the same task. Remember, the best performing algorithms have essentially “undetectable” false positive demographic differences. One was .003% for black females and .001% for white males. The best algorithms had non-significant false negative differences: 0.49% for black females and 0.85 for white males.
Remember, the best performing algorithms have essentially “undetectable” false positive demographic differences.
And that brings us back to misleading reporting about demographic bias is face recognition technology. There is only a .002% difference between the false positive rate of facial recognition algorithms on black females and white males. It can never be 0% because it's probabilistic but that's pretty close. But you can look at the same data and say that the false positive rate for black women is three times that of white males. They are both true - but three times a tiny number is still a tiny number.
At Blink Identity, we are algorithm agnostic but all the algorithms we use fall into the "best performing" category on the NIST tests. But even the NIST test doesn't really tell the whole story. NIST compares algorithms in a controlled environment using a set of "matched pair" faces as the data set. In practice, environmental considerations such as pose angle, lighting and changing expression have a significant impact on accuracy as well. Looking specifically at our own data at Blink Identity, we see no indication of demographic bias. We have essentially (but not literally) a 0% false positive error rate and our false negative rate is influenced by environmental issues far more than the algorithm itself. In other words, if our system fails to match you, it’s most likely because you were looking down at your phone and we couldn’t get an image of your face. No system can match a face if it can’t acquire an image of the face.
There are good reasons to restrict use of biometrics or any AI that could lead to mass surveillance, but it’s because the algorithms work well, not because they don’t. Algorithms should be used to support, not replace, human decision making.