Human Versus Computer Vision
It wasn’t until I started learning about computer vision that I realized that human vision is just really amazing. Because we grow up doing it naturally, we don’t tend to give much thought to how we see the world. We don’t think of vision as something we do. We walk around and the world is just “out there”. What’s so interesting about that?
How Do Babies See?
Think about the first time a baby opens its eyes. The eyes are closed until around 26 weeks to allow the retina to develop but after that the eyes will start to blink in the womb. When mom goes out into bright sunlight, some can filter through her body and the baby can start to practice seeing. But it’s not until birth that the eyes really start working and vision continues to develop for the first six months.
Despite that, at birth a baby already prefers face shapes to non face shapes. And by three months a baby is able to recognize the face of their primary care giver as well as an adult can recognize faces. That’s amazing! This image is a representation of what we think babies see during those months. At birth, they haven’t developed color vision yet and they can only focus about twelve inches away. At three months, color vision and the ability to focus is more developed but it’s not until six months that vision is stable.
We don’t see with our eyes - we see with our brains. Your eyes are just the sensors. Vision is a really complicated, resource intensive task. About thirty percent of your brain is involved in processing vision, compared to 3% for hearing. There is a part of your brain, called the fusiform gyrus that specifically works to recognize faces.
Face recognition is critical to our survival - humans are social and being able to recognize each other is an important skill. We are always looking around and scanning for faces - we are so good and seeing faces that we sometimes see them when they aren’t really there. This is called Pareidolia. It used to be considered a sign of mental illness but we know now it’s a pretty normal thing. It’s just your brain, which is always looking for patterns, finding a face pattern when it’s not really there.
Some people are ‘super recognizers’ of faces and some people are just the opposite. Prosopagnosia or ‘face blindness’ is a cognitive disorder where people cannot recognize familiar faces even though they can see other objects. In extreme cases, they cannot recognize even their own face.
We See in Context
When we see, our brain is using the the context and experience of our entire lives to help which is one of the reasons we can see so much better than computers. Think about when you are driving a car. You can only see the back of the car in front of you, but your brain “sees” the entire car and allocates space for it. Even though you can’t actually see the entire car, you behave as if you can. Most optical illusions are playing on the things your brain “can see” that aren’t really there.
I’ve talked about this example before but I think it’s worth repeating. Look at the image below. It represents something you are probably familiar with. If you don’t know what it is, take a moment and see if you can figure it out.
I’m going to leave some white space so I don’t give it away too quickly.
Even though the first image is just 15 pixels and 7 colors, most people will figure this out. That’s amazing! A computer cannot do that. If you take a photograph of someone you know and tear it in half, you will probably still be able to recognize the person, but a computer will struggle. On the other hand, we are only really good at recognizing the faces of people that we are already familiar with. We aren’t nearly as good as we think we are at recognizing strangers. Computers are significantly better than humans in this respect. Imagine you are the person at the entrance to a bar who is checking IDs. You look at the driver’s license and then you look at the person - over and over again and usually in a location with poor lighting. It’s a boring repetitive task and humans struggle to maintain the focus required. Computers compare millions of faces very quickly without ever getting bored or tired.
So How Do Computers See?
In the old days, we talked about eigenfaces. This was an approach that tried to see an image holistically instead of pixel by pixel. The basic idea was to express a particular face as a “sum” of notional faces developed through a machine learning process. This way, a face is expressed as essentially a numerical expression. Faces were compared in their similarity in vector space, not the visual similarity that humans use.
Modern face recognition systems use Neural Networks, which are part of Machine Learning and Artificial Intelligence and what is now being called “Deep Learning” which is awesome because I needed more jargon in my life. The network learns to perform a task by analyzing training examples that have been hand-labeled in advance. To teach a computer to recognize a hot dog, for example, you would feed thousands of labeled images of hot dogs and other objects that are not hot dogs, and the computer will compare them and eventually learn how to identify a hot dog. This is called training the network. This is a simplistic explanation, but the key thing to know here is that the selection of training examples and how they are labelled in critical. One criticism of modern face recognition algorithms is that the face images used to train them were predominately young, white, males and so the algorithms identify young, white, males better than anyone else. We have never once had this problem with our system, but it is important for anyone training neural networks to be mindful of their training data.
Neural networks are loosely modeled on the human brain, or at least how we think the brain might work. But fundamentally, our understanding of human brains is very shallow, as is our understanding of why neural networks work. We can measure how effective they are, but they can never explain why they make a specific decision.
We also can’t correct neural networks the way we correct humans. For example, I can train someone how to recognize a ripe fruit by looking at it. I can ask them what they saw and why, and I can correct them until they are trained to expert level. During that process they are continually re-training their own internal neural network. Since a neural network can never explain its decision, our only choice is to try throwing more data at it, or using a different type of network. Compared to human learning, it is a very inefficient process.
Of course, most of the time you never think of why you recognize an object or a familiar face - it just happens effortlessly and seems so easy that it hardly seems like you are doing things at all. The next time you recognize your family member in a blurry picture take a moment and think about how amazing that ability is.