Humans can easily observe events and predict or anticipate events that could probably happen in the near future but this predictive behavior has always been difficult for AI. Well, not anymore. Researchers at Google have proposed VideoBERT, a self-supervised system that is able to perform predictions from unlabeled videos.
“Speech tends to be temporally aligned with the visual signals, and can be extracted by using off-the-shelf automatic speech recognition (ASR) systems, and thus provides a natural source of self-supervision.”, wrote Google researchers in a blog post.
VideoBERT makes use of Google’s BERT to learn the details of the video. Notably, BERT(Bidirectional Encoder Representations from Transformers) is the cutting-edge model used by Google for natural language based applications.
Google used image frames combined with automatic speech recognition sentence outputs to convert them into visual tokens of 1.5-second duration. These visual tokens are then concatenated with the word tokens. The missing tokens were filled out by using the VideoBERT model.
The blog explains how the researchers trained VideoBERT on over one million instructional videos on cooking, gardening, and vehicle repair. The researchers also verify the outputs of VideoBERT to evaluate the accuracy of the model.
According to the researchers, VideoBERT was able to predict that a bowl of flour and cocoa powder may be baked in an oven and may turn to a brownie or cupcake. The blog post also notes that VideoBERT often misses out on fine-grained visual information like smaller objects and subtle motions.
“Our results demonstrate the power of the BERT model for learning visual-linguistic and visual representations from unlabeled videos. We find that our models are not only useful for zero-shot action classification and recipe generation, but the learned temporal representations also transfer well to various downstream tasks, such as action anticipation.”, concluded the researchers.