91,149 research outputs found

    Discrete language models for video retrieval

    Get PDF
    Finding relevant video content is important for producers of television news, documentanes and commercials. As digital video collections become more widely available, content-based video retrieval tools will likely grow in importance for an even wider group of users. In this thesis we investigate language modelling approaches, that have been the focus of recent attention within the text information retrieval community, for the video search task. Language models are smoothed discrete generative probability distributions generally of text and provide a neat information retrieval formalism that we believe is equally applicable to traditional visual features as to text. We propose to model colour, edge and texture histogrambased features directly with discrete language models and this approach is compatible with further traditional visual feature representations. We provide a comprehensive and robust empirical study of smoothing methods, hierarchical semantic and physical structures, and fusion methods for this language modelling approach to video retrieval. The advantage of our approach is that it provides a consistent, effective and relatively efficient model for video retrieval

    TRECVID 2004 experiments in Dublin City University

    Get PDF
    In this paper, we describe our experiments for TRECVID 2004 for the Search task. In the interactive search task, we developed two versions of a video search/browse system based on the Físchlár Digital Video System: one with text- and image-based searching (System A); the other with only image (System B). These two systems produced eight interactive runs. In addition we submitted ten fully automatic supplemental runs and two manual runs. A.1, Submitted Runs: • DCUTREC13a_{1,3,5,7} for System A, four interactive runs based on text and image evidence. • DCUTREC13b_{2,4,6,8} for System B, also four interactive runs based on image evidence alone. • DCUTV2004_9, a manual run based on filtering faces from an underlying text search engine for certain queries. • DCUTV2004_10, a manual run based on manually generated queries processed automatically. • DCU_AUTOLM{1,2,3,4,5,6,7}, seven fully automatic runs based on language models operating over ASR text transcripts and visual features. • DCUauto_{01,02,03}, three fully automatic runs based on exploring the benefits of multiple sources of text evidence and automatic query expansion. A.2, In the interactive experiment it was confirmed that text and image based retrieval outperforms an image-only system. In the fully automatic runs, DCUauto_{01,02,03}, it was found that integrating ASR, CC and OCR text into the text ranking outperforms using ASR text alone. Furthermore, applying automatic query expansion to the initial results of ASR, CC, OCR text further increases performance (MAP), though not at high rank positions. For the language model-based fully automatic runs, DCU_AUTOLM{1,2,3,4,5,6,7}, we found that interpolated language models perform marginally better than other tested language models and that combining image and textual (ASR) evidence was found to marginally increase performance (MAP) over textual models alone. For our two manual runs we found that employing a face filter disimproved MAP when compared to employing textual evidence alone and that manually generated textual queries improved MAP over fully automatic runs, though the improvement was marginal. A.3, Our conclusions from our fully automatic text based runs suggest that integrating ASR, CC and OCR text into the retrieval mechanism boost retrieval performance over ASR alone. In addition, a text-only Language Modelling approach such as DCU_AUTOLM1 will outperform our best conventional text search system. From our interactive runs we conclude that textual evidence is an important lever for locating relevant content quickly, but that image evidence, if used by experienced users can aid retrieval performance. A.4, We learned that incorporating multiple text sources improves over ASR alone and that an LM approach which integrates shot text, neighbouring shots and entire video contents provides even better retrieval performance. These findings will influence how we integrate textual evidence into future Video IR systems. It was also found that a system based on image evidence alone can perform reasonably and given good query images can aid retrieval performance

    Zero-shot Audio Topic Reranking using Large Language Models

    Full text link
    The Multimodal Video Search by Examples (MVSE) project investigates using video clips as the query term for information retrieval, rather than the more traditional text query. This enables far richer search modalities such as images, speaker, content, topic, and emotion. A key element for this process is highly rapid, flexible, search to support large archives, which in MVSE is facilitated by representing video attributes by embeddings. This work aims to mitigate any performance loss from this rapid archive search by examining reranking approaches. In particular, zero-shot reranking methods using large language models are investigated as these are applicable to any video archive audio content. Performance is evaluated for topic-based retrieval on a publicly available video archive, the BBC Rewind corpus. Results demonstrate that reranking can achieve improved retrieval ranking without the need for any task-specific training data

    Tree-based Text-Vision BERT for Video Search in Baidu Video Advertising

    Full text link
    The advancement of the communication technology and the popularity of the smart phones foster the booming of video ads. Baidu, as one of the leading search engine companies in the world, receives billions of search queries per day. How to pair the video ads with the user search is the core task of Baidu video advertising. Due to the modality gap, the query-to-video retrieval is much more challenging than traditional query-to-document retrieval and image-to-image search. Traditionally, the query-to-video retrieval is tackled by the query-to-title retrieval, which is not reliable when the quality of tiles are not high. With the rapid progress achieved in computer vision and natural language processing in recent years, content-based search methods becomes promising for the query-to-video retrieval. Benefited from pretraining on large-scale datasets, some visionBERT methods based on cross-modal attention have achieved excellent performance in many vision-language tasks not only in academia but also in industry. Nevertheless, the expensive computation cost of cross-modal attention makes it impractical for large-scale search in industrial applications. In this work, we present a tree-based combo-attention network (TCAN) which has been recently launched in Baidu's dynamic video advertising platform. It provides a practical solution to deploy the heavy cross-modal attention for the large-scale query-to-video search. After launching tree-based combo-attention network, click-through rate gets improved by 2.29\% and conversion rate get improved by 2.63\%.Comment: This revision is based on a manuscript submitted in October 2020, to ICDE 2021. We thank the Program Committee for their valuable comment

    A Database Approach for Modeling and Querying Video Data

    Get PDF
    Indexing video data is essential for providing content based access. In this paper, we consider how database technology can offer an integrated framework for modeling and querying video data. As many concerns in video (e.g., modeling and querying) are also found in databases, databases provide an interesting angle to attack many of the problems. From a video applications perspective, database systems provide a nice basis for future video systems. More generally, database research will provide solutions to many video issues even if these are partial or fragmented. From a database perspective, video applications provide beautiful challenges. Next generation database systems will need to provide support for multimedia data (e.g., image, video, audio). These data types require new techniques for their management (i.e., storing, modeling, querying, etc.). Hence new solutions are significant. This paper develops a data model and a rule-based query language for video content based indexing and retrieval. The data model is designed around the object and constraint paradigms. A video sequence is split into a set of fragments. Each fragment can be analyzed to extract the information (symbolic descriptions) of interest that can be put into a database. This database can then be searched to find information of interest. Two types of information are considered: (1) the entities (objects) of interest in the domain of a video sequence, (2) video frames which contain these entities. To represent these information, our data model allows facts as well as objects and constraints. We present a declarative, rule-based, constraint query language that can be used to infer relationships about information represented in the model. The language has a clear declarative and operational semantics. This work is a major revision and a consolidation of [12, 13].This is an extended version of the article in: 15th International Conference on Data Engineering, Sydney, Australia, 1999

    Scenario-Based Query Processing for Video-Surveillance Archives

    Get PDF
    Cataloged from PDF version of article.Automated video surveillance has emerged as a trendy application domain in recent years, and accessing the semantic content of surveillance video has become a challenging research area. The results of a considerable amount of research dealing with automated access to video surveillance have appeared in the literature; however, significant semantic gaps in event models and content-based access to surveillance video remain. In this paper, we propose a scenario-based query-processing system for video surveillance archives. In our system, a scenario is specified as a sequence of event predicates that can be enriched with object-based low-level features and directional predicates. We introduce an inverted tracking scheme, which effectively tracks the moving objects and enables view-based addressing of the scene. Our query-processing system also supports inverse querying and view-based querying, for after-the-fact activity analysis. We propose a specific surveillance query language to express the supported query types in a scenario-based manner. We also present a visual query-specification interface devised to facilitate the query-specification process. We have conducted performance experiments to show that our query-processing technique has a high expressive power and satisfactory retrieval accuracy in video surveillance. (C) 2009 Elsevier Ltd. All rights reserved

    Multi modal multi-semantic image retrieval

    Get PDF
    PhDThe rapid growth in the volume of visual information, e.g. image, and video can overwhelm users’ ability to find and access the specific visual information of interest to them. In recent years, ontology knowledge-based (KB) image information retrieval techniques have been adopted into in order to attempt to extract knowledge from these images, enhancing the retrieval performance. A KB framework is presented to promote semi-automatic annotation and semantic image retrieval using multimodal cues (visual features and text captions). In addition, a hierarchical structure for the KB allows metadata to be shared that supports multi-semantics (polysemy) for concepts. The framework builds up an effective knowledge base pertaining to a domain specific image collection, e.g. sports, and is able to disambiguate and assign high level semantics to ‘unannotated’ images. Local feature analysis of visual content, namely using Scale Invariant Feature Transform (SIFT) descriptors, have been deployed in the ‘Bag of Visual Words’ model (BVW) as an effective method to represent visual content information and to enhance its classification and retrieval. Local features are more useful than global features, e.g. colour, shape or texture, as they are invariant to image scale, orientation and camera angle. An innovative approach is proposed for the representation, annotation and retrieval of visual content using a hybrid technique based upon the use of an unstructured visual word and upon a (structured) hierarchical ontology KB model. The structural model facilitates the disambiguation of unstructured visual words and a more effective classification of visual content, compared to a vector space model, through exploiting local conceptual structures and their relationships. The key contributions of this framework in using local features for image representation include: first, a method to generate visual words using the semantic local adaptive clustering (SLAC) algorithm which takes term weight and spatial locations of keypoints into account. Consequently, the semantic information is preserved. Second a technique is used to detect the domain specific ‘non-informative visual words’ which are ineffective at representing the content of visual data and degrade its categorisation ability. Third, a method to combine an ontology model with xi a visual word model to resolve synonym (visual heterogeneity) and polysemy problems, is proposed. The experimental results show that this approach can discover semantically meaningful visual content descriptions and recognise specific events, e.g., sports events, depicted in images efficiently. Since discovering the semantics of an image is an extremely challenging problem, one promising approach to enhance visual content interpretation is to use any associated textual information that accompanies an image, as a cue to predict the meaning of an image, by transforming this textual information into a structured annotation for an image e.g. using XML, RDF, OWL or MPEG-7. Although, text and image are distinct types of information representation and modality, there are some strong, invariant, implicit, connections between images and any accompanying text information. Semantic analysis of image captions can be used by image retrieval systems to retrieve selected images more precisely. To do this, a Natural Language Processing (NLP) is exploited firstly in order to extract concepts from image captions. Next, an ontology-based knowledge model is deployed in order to resolve natural language ambiguities. To deal with the accompanying text information, two methods to extract knowledge from textual information have been proposed. First, metadata can be extracted automatically from text captions and restructured with respect to a semantic model. Second, the use of LSI in relation to a domain-specific ontology-based knowledge model enables the combined framework to tolerate ambiguities and variations (incompleteness) of metadata. The use of the ontology-based knowledge model allows the system to find indirectly relevant concepts in image captions and thus leverage these to represent the semantics of images at a higher level. Experimental results show that the proposed framework significantly enhances image retrieval and leads to narrowing of the semantic gap between lower level machinederived and higher level human-understandable conceptualisation

    Overview of VideoCLEF 2008: Automatic generation of topic-based feeds for dual language audio-visual content

    Get PDF
    The VideoCLEF track, introduced in 2008, aims to develop and evaluate tasks related to analysis of and access to multilingual multimedia content. In its first year, VideoCLEF piloted the Vid2RSS task, whose main subtask was the classification of dual language video (Dutchlanguage television content featuring English-speaking experts and studio guests). The task offered two additional discretionary subtasks: feed translation and automatic keyframe extraction. Task participants were supplied with Dutch archival metadata, Dutch speech transcripts, English speech transcripts and 10 thematic category labels, which they were required to assign to the test set videos. The videos were grouped by class label into topic-based RSS-feeds, displaying title, description and keyframe for each video. Five groups participated in the 2008 VideoCLEF track. Participants were required to collect their own training data; both Wikipedia and general web content were used. Groups deployed various classifiers (SVM, Naive Bayes and k-NN) or treated the problem as an information retrieval task. Both the Dutch speech transcripts and the archival metadata performed well as sources of indexing features, but no group succeeded in exploiting combinations of feature sources to significantly enhance performance. A small scale fluency/adequacy evaluation of the translation task output revealed the translation to be of sufficient quality to make it valuable to a non-Dutch speaking English speaker. For keyframe extraction, the strategy chosen was to select the keyframe from the shot with the most representative speech transcript content. The automatically selected shots were shown, with a small user study, to be competitive with manually selected shots. Future years of VideoCLEF will aim to expand the corpus and the class label list, as well as to extend the track to additional tasks
    corecore