If you are reading this, I’m assuming that you’ve read the “part 1” of the series and already have the context. If not, suffices to say that the article below was written a while ago, during my tenure as the Senior Research Scientist of the Tandemlaunch startup foundry in Montreal, Canada, and that without their kind permission you would not be seeing the text below here on the Cinteraction blog. Hope you find it interesting!
On Deep Learning and The Future of Computer Vision vol. 2 (written in October 2019)
This one was hard to name.
Somewhere within the process of toying with several topics that would make for a good column contribution, I realized that in the time between the August 2018 article titled “On Deep Learning and the Future of Computer Vision” quite a few ground-breaking things happened that relate, and even validate, what I wrote there. Once that worm was in my system, it was hard to get excited about the other topics I considered, and they simply got pushed back. So here we are, going into the summer, end of spring season edition, borrowing from that great sitcom legacy of “recap” episodes.
In all seriousness though, while I was aware of the speed at which computer vision and AI research was progressing and growing, what I saw at the Neural Information Processing Systems (NeurIPS)1, NVIDIA GPU Technology Conference (GTC)2 and the International Conference on Robotics and Automation (ICRA)3 still managed to surprise me..
I should at this point say that I feel truly blessed that both NeurIPS and ICRA took place right here at the Palais de Congrès in Montreal within the last 9 months, enabling all of us locals to experience them within the metropolitan environment we call home. Add to that the fact that Montreal’s own Yoshua Bengio has been, together with Geoffrey Hinton and Yann LeCun, awarded the Turing Award “for conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing” and see if you can blame me for wanting to write of nothing else than deep learning and robots, which is essentially what I wrote about in my first article for this column.
Looking back, there were only two people I mentioned by name in the text (although I mentioned FacebookAI, which is pretty much synonymous with Yann LeCun): Geoffrey Hinton and Josh Tenenbaum. While I remain a “crystal-dyslexic computer vision enthusiast”, it fills me with no small amount of joy to be able to report that in addition to the great recognition of Geoff Hinton and Yann, Josh has since been named R&D Magazine’s 2018 Innovator of the Year. The missing Turing laureate, Yoshua Bengio, is a giant in the deep learning domain. Even more so for his work on the Montreal Declaration for responsible AI development. However, Yoshua’s research focus has mainly been on natural language processing technology, so he did not find his place in “vol. 1” of this article. Just as well, more for the future.
Back to the technology, the limitations of deep learning and what the last three quarters have taught us. The basics have not changed: we are still working on creating a well-structured description of the world and developing ways to make our AI systems run on more devices (on the edge), rather than modern-day equivalents of the Harvard Mark I and IBM mainframes residing in the cloud. And boy are we moving!
On the structured-description front I’ve discussed Hinton’s capsule networks, which are inspired by both the way we handle objects in computer graphics and insights from neuroscience, as well as the efforts by FacebookAI and Josh Tenenbaum’s research team at MIT CSAIL, which have focused on representing the visual world with programs (probabilistic in the latter case).
I still think that Hinton is very much worth listening to and, if you haven’t already, take a look at his recent Google I/O talk. He muses that capsule nets might be “the thing he ends up not doing”, but gives a nod to transformer nets, which share the basic concept of controlling the flow of information through the network with the capsule nets. The bidirectional transformers (BERT)4 have already led to amazing improvements in natural language understanding, by improving the contextual representations used in the domain. So, the basic ideas created to help us understand the visual world better are now helping us create better representations for understanding text. This is something that I think we will see increasingly in the future.
In my August article I pointed out that the IMAGENET Large Scale Image Recognition Challenge classes, which allows us to train our first deep learning models for vision, were designed to represent the WordNet synsets (sets of cognitive synonyms), which was the result of efforts to build structured descriptions from text. Subsequent initiatives, such as the Common Objects in Context, have enabled us to do more challenging things, such as generate reasonable descriptions of images. The important thing here was not the achievement in and of itself, but to the successful integration two traditionally very much separate areas of AI – computer vision and natural language understanding. There was no well-structured description to speak of (or see with) as of yet, but it allowed us to bridge the gap caused by the different set of research skills you needed to deal with text and visual data.
So, you could generate a reasonable description of visual data without a real understanding of what is going on, but what proved more elusive is answering specific questions about the things you see. Ergo, Visual Question Answering (VQA), the pinnacle of which are the latest extensions of FacebookAI work5 and the MIT CSAIL (Josh Tenenbaum) stuff 6.
Welcome to the age of neuro-symbolic models which learn both visual concepts and words jointly and can do so in an unsupervised way, as is the case with MIT CSAIL’s fresh-of-the-press neuro-symbolic concept learner. Just in the last 9 months, significant breakthroughs have been made in the domain of structural descriptions for those who cross the now obsolete boundary between natural language processing and computer vision. I did not expect it to progress at this speed, but I stick to my original recommendation, these are the groups to pay attention to. If the last three quarters have taught us anything, it’s that I am hardly the only one out there who thinks so.
Since the last time I wrote on the subject, edge-focused deep learning accelerators have moved from the realm of press releases, and circuits integrated within mobile phones, to the palms of our hands. In my case, quite literally, as I could not resist getting my hands on a Jetson Nano right there at the Silicon Valley GTC in March. And since I got something that can achieve 64 frames per second inference with MobileNet-v27 for $99, I have no regrets. Google8 and Intel9 are not lagging and have made their own edge inference accelerators available as USB sticks with competitive pricing. And all these vendors are offering software to make standard deep learning models more suitable to run on these devices with limited, albeit impressive, computational capabilities. AI has not so much edged, but leapt, towards the edge in the last few months.
The only bad news that I have to report is of our microbot of the future, relying on neuromorphic hardware. The development in the domain of neuromorphic computing seems to be following a more conventional path. But never fear, groups such as that of Arijit Raychowdhury10 at Georgia Tech are already prototyping low-power AI accelerators focused on micro bot (swarm) applications. And although International Conference on Robotics and Automation (ICRA) has been a much smaller affair than NeurIPS or GTC, with a much more “same old stuff” feel, I never tired of watching Bitcraze Crazyflie11 swarm take off, fly around the enclosure and land to recharge, before doing it all again. The 5-person company has been around since 2011 and targets researchers while following a strict open source policy, allowing it to capitalize from improvements by teams around the world. The drones can follow trajectories autonomously (but they need to use a system based on anchors to determine their position) and are due to get peer-to-peer communication capability in the next few months. Again, the price of the drone is very affordable. You can get it for $200 or so. While we are waiting for the neuromorphic systems to deliver on their promise, we have all the things we need deploy AI and CV on the edge, and the edge might be moving under its power while hanging out with a swarm of its friends.
References:
1 https://nips.cc/
2 https://www.nvidia.com/en-us/gtc
3 https://www.icra2019.org
4 Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for
language understanding. arXiv preprint arXiv:1810.04805.
5 Vedantam, R., Desai, K., Lee, S., Rohrbach, M., Batra, D., & Parikh, D. (2019). Probabilistic Neural-symbolic Models
for Interpretable Visual Question Answering. arXiv preprint arXiv:1902.07864.
6 Mao, J., Gan, C., Kohli, P., Tenenbaum, J. B., & Wu, J. (2019). The Neuro-Symbolic concept learner: Interpreting
scenes, words, and sentences from natural supervision. arXiv preprint arXiv:1904.12584.
7 https://developer.nvidia.com/embedded/jetson-nano-dl-inference-benchmarks
8 https://coral.withgoogle.com/
9 https://software.intel.com/en-us/movidius-ncs
10 Cao, N., Chang, M., & Raychowdhury, A. (2019, February). 14.1 A 65nm 1.1-to-9.1 TOPS/W Hybrid-Digital-Mixed-
Signal Computing Platform for Accelerating Model-Based and Model-Free Swarm Robotics. In 2019 IEEE
International Solid-State Circuits Conference-(ISSCC) (pp. 222-224). IEEE.
11 https://www.bitcraze.io/