Identity Technology – It’s Really Just Math
The field of biometrics is the measurement and analysis of unique physical characteristics (such as fingerprint or voice patterns) especially as a means of verifying or identifying a person. In the case of facial recognition, a feature set, called a template, is extracted from the image of the face during the enrollment process and stored in a database instead of the actual image. Templates are very small compared to images so comparing templates is faster and less computationally intensive as comparing images. Also, templates can never be converted back into a photograph, so working with templates is more secure than working with photographs. Here we’ll discuss the basics of what templates are the advantages of their use.
What is a Template?
A template is a mathematical structure that represents elements as linear transformations of vector spaces. These will be unique from person to person. During identification, the neural net templatizes the probe and proceeds to perform a 1:N comparison. This means that the probe template is compared to every template in the database. Similarity scores are computed and using a threshold the top scoring enrollment is returned as the probable identity.
For an example we’ll look at a CNN (convolutional neural network). These networks follow the common example of the human brain in the way that each “neuron” receives an input and based on some weights and biases spits out an output. What is being done here behind the scenes is the computation of the scalar product (plus a couple other functions). Here is a visual example that demonstrates the process. The scalar product is computed by taking vectors of equal dimensionality, multiplying two values of the same dimension and determining the sum over n total dimensions of the vector space. These neurons are connected together in methods defined by the layer. A layer is a collection of neurons that cooperate at some specified depth in the network. To give a simple layer by layer description, first the neural network will take in an image of bounded width and height and split the image into the three RGB color channels.
Next the network will take that previous layer’s output and convolve it to form the convolution layer. Convolution is the linear operation that multiplies a set of weights with the input. Effectively this performs a matrix multiplication operation in localized areas of the representation. The output is then provided as the input of a rectified linear activation layer (ReLU). ReLU is a fancy term for a function that returns a value if the input is greater than or equal to zero. This layer, however trivial sounding, is very useful in speeding up the process. Activation functions approach zero by limit definition but from a computation standpoint are still nontrivial, the ReLU function provides a simple computation with linear behavior. In the end this equates to an increase in performance. The output is passed to a pooling layer which reduces the size of the input. This layer is also tuned to retain valuable information which exhibit rotation and translation invariance.
Finally, the previous layers output is passed to the fully connected layer where the voting occurs. This is what determines if the templatized image matches a database template above the set threshold. This is a simplified explanation and a complex image requires many iterations of these layers, but the general idea is still valid. The threshold at which the image becomes a template is after the convolution layer. The image becomes abstract after the convolution in the sense that it no longer contains information that a human can perceive as facial data. Assuming the image is deleted after this process, the resultant data is of no value to anything other than a neural network.