Demographic Bias in Biometric Facial Recognition
Accuracy and demographic bias are complex and difficult topics to discuss. You can read a more detailed discussion of this topic at “Face Recognition & Accuracy - It’s Complicated” but in general, there are two types of errors. A false match is when the system matches two faces that are not the same. A false non-match is when the system fails to match two faces that are the same. A system has demographic bias if the error rates are different for different demographic groups.
In terms of accuracy and error rates, it’s important to consider what the expected outcome should be. If you divide any data set into groups based on some characteristic, whether it be skin tone, gender, height or weight – do you expect to see the same performance across the different categories? Intuitively, it feels like that should be the goal. And some in the biometric community feel that systems should perform identically in terms of raw similarity scores across any category. But does it make sense? It’s important to consider the use case and context. For a security or law enforcement application, an error can result in the loss of liberty of an innocent person. And government or law enforcement databases tend to be very large, which increases the error rate. But with an opt-in application, like ticketing, an error results in a person having to use a secondary ticketing method – an inconvenience. Biometric matching is probabilistic and the error rates will never be 0%.
Factors that Influence Accuracy
Algorithm
The National Institute of Standards and Technology (NIST) performs extensive, comprehensive analysis of biometric algorithms for accuracy and demographic bias, including the primary algorithms used by Blink Identity. The actual NIST analysis of demographic analysis is over 1200 pages with 17 appendices and can be read here: https://nvlpubs.nist.gov/nistpubs/ir/2019/NIST.IR.8280.pdf.
NIST found that the most accurate algorithms—and the algorithm that Blink Identity uses is in this category—did not display a significant demographic bias. The seventeen highest performing verification algorithms had similar levels of accuracy for black females and white males: false-negative rates of 0.49 percent or less for black females (equivalent to an error rate of less than 1 in 200) and 0.85 percent or less for white males (equivalent to an error rate of less than 1.7 in 200).
While the algorithm used by Blink Identity does technically exhibit a slight bias, this is not actually significant at all in practice. NIST uses a 12,000,000-person gallery to highlight the differences between algorithms, and measures bias as differences between the “candidate list” in positions 2-100+. There is no bias at the “rank one” candidate, which is the only measurement we use.
Size of the Gallery (Database)
For government or law enforcement biometric systems, the databases tend to be very large. NIST tests with multiple databases, but each contains around twelve million records. The number of errors increases with the size of the database. An algorithm that is 99% accurate (which is much less accurate than modern commercial algorithms) is expected to give an incorrect result 1% of the time. With a 12M record database, that is an expected 120,000 incorrect results. With a database of 30,000 records, that is an expected 300 incorrect results.
Lighting
Environmental considerations are a big factor for face recognition systems and something that NIST does not test because of the variability of deployments. A sub-optimal lighting situation can quickly shift a perfect system to one with poor performance. At Blink Identity, we control the lighting situation completely, providing our own non-visible light source, making our system generally immune from any ambient light issues aside from direct sun, and allowing us to work in complete darkness as well. Second, our light source illuminates just below the surface of the skin, making the actual skin color irrelevant as a factor for matching - all skin tones look the same to our matcher. This makes us largely immune from any bias tied to skin color.
Internal Testing
We perform two types of ongoing analysis around accuracy and demographic bias: live and offline. Our live analysis uses real transit data from our active deployments and demos. On every transit, our system looks for two possible red-flag conditions that would highlight a change in accuracy.
First, we highlight any situation where more than one person matches above our threshold, a clear indicator that our thresholds may not be set correctly. In practice, we only see this with twins, which is expected. Although we can often distinguish twins from each other, some identical twins will both match above our threshold. The second test is that we also look for any lower-scoring matches that might be near our threshold. This is an indication that we may be approaching a situation where our thresholds are not adequate. In practice, we don’t see this occur either - the highest non-matching candidate score is significantly lower than our threshold, This live analysis is done automatically by our back-end systems, and alerts are sent out if it ever happens so we can quickly understand the implications of a “near false match”
We also perform offline analysis with facial databases representing a larger gallery size. We use a combination of academic face databases and synthetic face databases to test background galleries of 100,000 with sample probes from our system. The use of synthetic faces allows us to generate different distributions of demographic faces, genders and ages without violating any real person's privacy.
Since we expect none of these databases to match, we have a ground truth that we can use to ensure the results are correct. We ensure our current system configuration has no false matches in test or synthetic galleries of 100,000-150,000 people at our threshold. We generally see that our background database would need to be 50-100x larger before we would see false matches. This means an actual false match will be exceedingly rare, and we have not seen one with our system in practice.
As described, we primarily focus on making false matches extremely rare. This theoretically increases the chance of a false non-match, but in practice this doesn’t really show up operationally. We do rarely see a true false non-match, but these have always been traced to poor quality enrollments - someone taking a picture in poor lighting, or taking a picture of a picture, or a picture with a cracked camera lens. However, in all cases where this has occurred, having the person re-enroll (or having staff assist) them has resulted in perfect match performance afterwards. Our recent efforts in automated quality measurements (with some human review) have eliminated most of these cases at the expense of some additional pre-event support. However, the most likely failure situation for our system is not a true false match - it is a failure to acquire. This is typically generated by a poorly defined capture volume, or a new user who is not aware of the motion capability and stops. In effect, they don’t actually enter into the true capture area. This is often remediated with proper signage, “red carpets”, etc., and once a user transits successfully once they don’t have issues after that.
Summary
Demographic bias is a complex topic with many facets more or less relevant to different biometric applications. Although bias has sometimes been an issue with some large-scale law enforcement biometric systems (10,000,000+), these results do not translate to a smaller-scale pure opt-in, voluntary systems that are able to optimize enrollment and transit images to near ideal conditions. The NIST analysis of our algorithm corrected for scale shows no demographic bias and Blink Identity’s internal analysis shows no bias as well.
Sources:
https://www.blinkidentity.com/forum/face-recognition-accuracy-its-complicated