In computer vision, visual object tracking is an important but difficult research problem. Object tracking has made significant strides in recent years with the help of convolutional neural networks. Recently, NEC Laboratories, Palo Alto Research Center, Amazon, PARC and academics from Stanford University worked together to solve the problem of realistic modification of scene texts in videos. As a result of the above research approach, the researchers named their framework STRIVE (scene text replacement in VidEos). The main goal of this research is to develop custom content for marketing and promotional purposes.
Several attempts have been made to automate text replacement in still photos using deep style transfer concepts. The training of an image-based text style transmission module on individual frames with the addition of temporal consistency requirements in the network loss could be an approach to the solution of the video test replacement. As a result, the Research group used a unique strategy. You extract text first Areas of Interest (ROI) and train a Spatial-temporal Transformer Network (STTN) to frontalize the ROIs and make them consistent over time. They then scan the film for a reference image with good text quality that has been rated for text clarity, size, and shape.
Similarly, researchers from Japan did Tsukuba University and Waseda University have presented a uniform framework that enables a variety of remastering processes for digitally converted films. A fully folded network is used to implement the method. The researchers chose temporal turns instead of recursive models for video processing as they can take into account information from multiple input frames at the same time.
There were many approaches however, when it comes to recognizing scene text in videos, most only focus on recognizing scene text in individual frames to overcome difficulties with low resolution or complicated image backgrounds. Just one few approaches took spatial and temporal context information into account, ie not only recognized texts in individual frames, but also considered context information across many frames.
- The researchers of Barcelona proposed a real-time teletext identification system based on MSER without benchmark data set evaluation.
- The group of scholars from China offered a multi-strategy tracking system, although handmade criteria from tracking-by-detection, spatio-temporal context learning, and linear prediction were used to select the best match.
- the Chinese research team extended to the previously Method using dynamic programming to find the world’s best match in a unified framework.
- The research team at California developed a network flow-based technology.
- The scholars out Singapore Network flow used to create text lines in single images.
Technology behind it
The majority of Video text recognition algorithms have two steps: the first recognizes texts in individual frames or important frames and the second follows suggestions in the short and long term. Low-contrast and low-resolution photos, texts with multiple orientations and scales, and random movements hinder both processes. Many of the established approaches to text recognition in images can also be used to recognize text in video frames, and the use of temporal context Information is useful for video text recognition.
The analysis of spatiotemporal data requires the consideration of temporal and spatial relationships. Scoring both the temporal and spatial dimensions of data significantly increases the data analysis process for two main reasons:
1) Changes in the spatial and non-spatial properties of spatiotemporal objects over time, both continuous and discrete, and
2) The influence of collocated neighboring spatio-temporal objects on one another.
STTN is a network that can produce both spatially and temporally dense pixel correspondences. It is a transformer that uses a multi-scale patch-based attention module to search for coherent material across all frames in spatial and temporal dimensions. The module is responsible for collecting patches at different scales from all video frames to cover the many changes in appearance caused by complex video motion. The multi-head transformer calculates similarities spatial spots across different scales at the same time. This could be the first attempt at deep teletext substitution, like this Amazon and the team. A stronger research focus is needed in this area. We can expect further research contributions from Indian startups in the foreseeable future.
Join our Discord server. Become part of a dedicated online community. Join here.
Subscribe to our newsletter
Get the latest updates and relevant offers by sharing your email.