In the modern era, individuals are flooded with information. Developments in storage technology have enabled vast information from sources like entertainment media and historical archives to be preserved cheaply at scale. Consequently, there is a pressing need for approaches that efficiently search through this content.
In this thesis, we leverage developments in deep learning and various data sources to advance capabilities of retrieval systems in understanding the interactions between text, audio and video content. We do this through three core contributions.
First, we focus on text-video retrieval with natural queries, collecting and benchmarking a high-quality dataset with detailed, well-localised descriptions and corresponding long videos. We showcase the advantages of using multiple experts e.g. object and audio classification models to improve text-video retrieval performance.
Second, we propose new benchmarks for semantic text-audio retrieval using free-form text. We employ state-of-the-art multimodal text-video retrieval models for this task and investigate how useful visual support is for finding the correct audio file. Additionally, we propose a large free-form text-audio dataset to aid with training of large text-audio models. Lastly, we investigate if text-audio retrieval models understand temporal ordering of sound events. We then propose a new contrastive loss term to guide the model to focus on temporal cues.
Finally, we employ Large Language Models' (LLMs) understanding of the world to leverage large text-video datasets for text-audio understanding. We show that LLMs are capable of proposing plausible descriptions for video soundtracks, starting from the visual-based descriptions of the video content. This is important, as it can be used to scale up current text-audio retrieval datasets