How Computers See the World
NOTE: This is a reprint of an article originally published by Alex Kilpatrick in July 2015 on a different blog. We are reposting them on the Blink Identity blog because these issues are important and we want to keep our writing on these issues in one place.
We've been working with some aspect of computer vision for the better part of ten years and we have spent a lot of time trying to teach computers how to recognize people. When I examine the cases where the computer misses, I am often mystified by why the computer didn't see the obvious. But then I have to remind myself that computers are truly, profoundly stupid.
I've been studying Artificial "Intelligence" for the past 20 years or so, and I am no more in fear of Skynet now than I was at 20. We have made some advances in what I would call "stupid computer tricks" that make computers seem a little less dumb, but those things are still light-years (yes, I mean the distance) away from true intelligence. Don't get me wrong - computers are fantastic tools and they do tedious simplistic stuff really well. But intelligence and adaptability? Not so much.
This is especially apparent in computer vision. We don't even think of vision as something we actually do. Most people just think of vision as something we are. The world just comes pouring in and we perceive it the way it is. However, spend a little time studying computer vision and you will quickly understand that perception is not with the eyes at all. It is ALL brain, and super-mind-blowing complex brain stuff as well. Let's start with a simple example. Look at this image below. If you are under 55 years old, I expect you will pretty quickly see what this is.
Most people will get this. If you don't know what it is, it is a representation of this.
I find this image a truly incredible example of human perception. It is 15 pixels (15 and not 14 because the missing pixel is significant). It is only 8 unique colors. But it represents a very complex concept. To a human with a big ol' human brain, these images are seen in a context. To a computer this is 15 pixels and 8 colors. Nothing more. People tend to forget this when comparing computer and human vision. Humans see contextually with their brain. Computers do too, but their context is limited to an algorithm of a few thousand lines of code compared to the incredible complexity of the human brain.
Let's look at another example that is probably more recognizable. Do you know who this is?
[ Scroll to not make it too easy]
You will almost certainly recognize it as a human face. You will probably recognize who it is (it is someone very famous). Let's zoom out a little.
Now you will almost certainly recognize the image. It is pixelated for sure, but still very recognizable. This is a 300x300 pixel image, compressed via JPG at 5% quality. It is only 3 KB, and yet almost anyone in the western world would recognize it. Humans recognize faces from birth and start mimicking expressions as early as two days old. The part of our brain that recognize faces is different from the part that recognize other objects. Faces are really special. However, our cool contextual processors work with things other than faces too.
You probably know what this is:
Again, this is heavily compressed, but it's still easily recognizable to most people. Our brains are just amazing at "filling in the gaps". With something like this, we don't see pixels at all. We see an abstract form that is pretty similar to a very familiar object and bang! we classify it. Some work in neural networks is starting to allow computers to do this kind of thing, but it is still a very pale imitation of human contextual perception. So we have established that humans are very good at pulling out objects from very abstract data. But what do the computers see when they look at these images?
Let's compare these heavily compressed images to the original uncompressed images just to establish a baseline:
I'm sure you will agree that the photo on the right is better than the photo on the left, but these are clearly the same image. However, remember that the computer only sees pixels. It doesn't have your fancy brain to fill in all the extra context. So let's look at the pixels by zooming in on Obama's eyes:
You can clearly see that the compressed image has SO much less information than the uncompressed image, even though they look fairly similar to you when zoomed out. You "see" all that missing information, but that is really your brain filling it in for you. The information just...isn't.... there.
The iPhone view is very similar:
We run into this frequently because we use face matching as a big part of our method for electronic identity verification. A human could do the task easily, but when it needs to be done thousands of times a day, you really want an identity verification API. The key take-away is that when comparing computer vision to human vision, we really are comparing apples to roto-tillers. They are both called "vision", but since vision is all in the brain we are comparing computer brains to human brains. I wanted to close with an example that might take you humans down a peg by removing those context cues you use so well.
What do you think about this?
I'm assuming you can't make this out at all. You may be able to tell this is a quote because it has the general format of a quote. If so, stop that! That's using your brain. No cheating! But this is what the non-contextual world of muddy pixels looks like to a computer. Just like the computer's everyday world, you have no context that lets you figure out what this is. But I will fix it for you, so you don't have to suffer.
Next time you want to curse your computer because it doesn't recognize your face, or because Siri thinks you want to "slide a french weasel into a robust enigma", have a little sympathy. They are idiot savants, and doing what they can with the very limited brains we have been able to give them.