A funny lesson in humility
If we want to have more intuitive conversations with robots, the robots will have to learn to understand what we are saying. So far this problem has almost exclusively been a natural language processing (NLP) problem, and some of the largest research groups in the world from Google to Baidu have made considerable progress in this domain (both these companies’ most advanced NLP tech is built into our Mitra robot, by the way).
But at Invento, we learned something from the thousands of human interactions that Mitra has on a daily basis: most communication is actually non-verbal!
People communicate more with their facial expressions and body language than they do with their words. Moreover, the same spoken sentence can have dramatically different meanings based on the body language accompanying it. Therefore we naturally wanted to teach Mitra to recognise body language.
As an AI Engineer at Invento, I took on this exciting project with the goal of making Mitra able to learn more about human body language.
This entire project grew out of a mini AI program that I wrote in my first week at Invento: Namaste Recognition! Yes, Indians love doing Namaste, and since Mitra is the first (and most advanced) Indian humanoid robot, it would be a shame if it couldn’t respond to a polite Namaste:
This Namaste Recognition program is built on top of PoseNet, a neural network developed at Google to detect the positions of human body parts in a 2D image.
Once you have the predicted locations of individual joints, you can write simple logic to see if a person is doing the Namaste pose or not. I also set up a minimum confidence level and other settings to ensure that the robot only responds when it is over 90% sure that the person is actually doing Namaste and not doing other similar-looking things like clapping.
After the success of this experiment, Balaji, our CEO, asked me to build more gesture recognition skills for Mitra.
Now, AI research is often tricky because you have a lot of different approaches to choose from. It also takes time to explore any approach (and debug problems) before crossing it off the list and moving to the next one.
Let’s say you have the following image of me holding open my palm:
Now, the objective for the AI program is to accurately say that I am holding out an open palm (for saying hi or whatever).
There are several ways to do this:
The first is to train a simple Image Classification convolutional neural network which will look at the whole image and give out a single answer, i.e which hand gesture I am holding.
ConvNets have done very well on huge complex datasets like ImageNet, so I believed that it should be straightforward to train a network to do so.
Boy was I wrong. For a whole week, despite my best attempts, I saw constant overfitting (the model could learn brilliantly from my training examples, but did not perform well in the real world).
This is because I did not have enough of varied data, and even after augmenting the dataset to 10x the original volume through a variety of image processing methods, I still did not see any appreciable increase in the real-world accuracy.
The reason is that the hand usually makes <5% of the whole image, and even there, often the positions of different fingers etc don’t change 100% between different gestures:
Couple this with the limited training data I had, and you can imagine how it wasn’t so easy.
So I did some review of what other researchers in the community had done earlier, and I came across other approaches.
Chief among these is Hand3D: you detect only the hand, separate it from the rest of the image, and then run a sophisticated neural network similar to PoseNet, to get the positions of fingers and joints.
This was the result:
As you can see, it looks pretty cool and has some valuable information. But here’s the downside: it is slow and sometimes buggy. Moreover, it only gives you the 3D positions of joints – to convert them into gestures, you need to add another classifier to the pipeline, which makes it 3 different classifiers for a single task! Needless to say, this is extremely inefficient.
Moreover, with these “simple” (heh heh) approaches, you can only recognise one gesture at a time. What if multiple people are in the frame, or one person is doing a different gesture with each hand?
Another approach I considered, but which I later decided not to try, was Attention Residual Networks:
They are again single networks which learn to focus on specific parts of the image and ignore the noise. But before jumping into this exciting high-level technique, I wished to discard other simpler approaches first.
This is an important lesson!
Artificial Intelligence and Computer Vision are currently very hot research fields. Most of the work we see happening in this domain is coming out of research groups who are constantly looking for innovative ways to solve the same problem better and faster than before.
But the job of an AI engineer is different from the job of an AI researcher – an engineer is supposed to ignore “novelty” and focus on simple, cost-effective solutions which can be maintained by other engineers, first and foremost.
Finally, I decided to give a shot to regular object detection for gesture recognition. This approach has not been tried widely before, because of two reasons:
- There are no publicly available datasets, especially none with Indian gestures like Namaste, and slightly less polite gestures like the middle finger (which I wanted Mitra to be able to recognise for the possible chance of an angry customer!)
- Even if you get a dataset, it will very likely not be labelled with bounding boxes.
I was confident that it would be worth more to build my own dataset for this task than to dive into relatively untested technologies like Attention Residual Networks as described above.
So I made my own dataset of around 2,000 images. Oh boy was it a long day of work!
However, once I created my own dataset, the results were worth it! I trained a Yolo V2 network for 10,000 iterations. The project was a success.