Multimodal video captioning systems use video frames and speech to generate natural language descriptions of videos. Such systems are stepping stones toward the long-term goal of developing multimodal conversational systems that communicate effortlessly with users while simultaneously being aware of their surroundings via multimodal input streams.
Unlike video comprehension tasks, where the primary challenge lies in processing and understanding multimodal input videos, the study of multimodal video captioning also includes the challenge of creating informed captions. The most common method for this task is to co-train an encoder-decoder network with manually annotated data. Due to a lack of large-scale, manually annotated data, studying the annotation of reasoned subtitles for videos is labor-intensive and often impractical. Previous research such as VideoBERT and CoMVT use automatic speech recognition to pre-train their models on blank video (ASR). This only transfers the video encoder to the following tasks.
Scientists present a novel pre-training framework for multimodal video captioning in “End-to-End Generative Pre-training for Multimodal Video Captioning” to be presented at CVPR 2022 MV-GPT trains a multimodal video encoder and a sentence Decoder from uncaptioned videos using a future utterance as target text and a novel bidirectional generation task.
The experiment shows that MV-GPT effectively translates to multimodal video subtitling and achieves state-of-the-art performance on various benchmarks. In addition, the multimodal video encoder is competitive in multiple video comprehension tasks, including VideoQA, text-to-video retrieval, and action recognition.
Future utterance as a text signal supplement
Each multimodal video captioning training video clip is typically associated with two texts: (1) a voice transcript that is aligned with the clip as part of the multimodal input stream, and (2) a target caption that is often manually annotated. The encoder is taught to combine information from the transcript with visual content, and the desired subtitles are used to train the decoder for one generation. However, for unlabeled videos, each video clip contains only an ASR transcript and no manually annotated target caption. Also, the exact text cannot be used as both encoder input and decoder target, as that would make target generation trivial.
MV-GPT circumvents this difficulty by using a future utterance as an additional text signal and together enabling encoder and decoder pretraining. However, it is not ideal for training a model to generate future statements, which often do not rely on the input content.
Bidirectional population decline
The problem of unjustified text generation is mitigated by formulating a bi-directional generation loss that includes both forward and backward generation. Bold generation generates future utterances given visual frames and their corresponding transcripts, allowing the model to learn how to merge the visual content with the corresponding transcript. Backward generation uses the video’s visual frames and future utterances to train a model to create a transcript with more informed text. Bidirectional generation loss in MV-GPT allows both encoder and decoder to be introduced to handle texts with a vital visual component.
Results of multimodal video captioning
Compare MV-GPT with the same model architecture and YouCook2 with standard evaluation metrics with existing pre-training. Although all pretraining techniques improve subtitle performance, it is essential to pretrain the decoder along with the subtitler to improve model performance.
After applying a model pre-trained by MV-GPT to four captioning benchmarks: YouCook2, MSR-VTT, Vitt and Activity Net-Captions, the model achieves state-of-the-art performance on all four criteria with significant margins. MV-GPT shows relative improvements of over 12 percent versus the Meteor metric and all four benchmarks.
Present a new MV-GPT framework for generative pretraining for multimodal video captioning. The bi-directional generative target pre-trains a multi-modal encoder and a subtitle decoder using utterances sampled at different times from uncaptioned video. The pre-trained model achieves state-of-the-art performance on multiple benchmarks for video captioning and other video understanding tasks, including VideoQA, video retrieval, and action classification.
This Article is written as a summary article by Marktechpost Staff based on the paper 'End-to-end Generative Pretraining for Multimodal Video Captioning'. All Credit For This Research Goes To Researchers on This Project. Checkout the paper and post. Please Don't Forget To Join Our ML Subreddit