90,454 research outputs found

    Video retrieval based on deep convolutional neural network

    Full text link
    Recently, with the enormous growth of online videos, fast video retrieval research has received increasing attention. As an extension of image hashing techniques, traditional video hashing methods mainly depend on hand-crafted features and transform the real-valued features into binary hash codes. As videos provide far more diverse and complex visual information than images, extracting features from videos is much more challenging than that from images. Therefore, high-level semantic features to represent videos are needed rather than low-level hand-crafted methods. In this paper, a deep convolutional neural network is proposed to extract high-level semantic features and a binary hash function is then integrated into this framework to achieve an end-to-end optimization. Particularly, our approach also combines triplet loss function which preserves the relative similarity and difference of videos and classification loss function as the optimization objective. Experiments have been performed on two public datasets and the results demonstrate the superiority of our proposed method compared with other state-of-the-art video retrieval methods

    An integrated semantic-based approach in concept based video retrieval

    Get PDF
    Multimedia content has been growing quickly and video retrieval is regarded as one of the most famous issues in multimedia research. In order to retrieve a desirable video, users express their needs in terms of queries. Queries can be on object, motion, texture, color, audio, etc. Low-level representations of video are different from the higher level concepts which a user associates with video. Therefore, query based on semantics is more realistic and tangible for end user. Comprehending the semantics of query has opened a new insight in video retrieval and bridging the semantic gap. However, the problem is that the video needs to be manually annotated in order to support queries expressed in terms of semantic concepts. Annotating semantic concepts which appear in video shots is a challenging and time-consuming task. Moreover, it is not possible to provide annotation for every concept in the real world. In this study, an integrated semantic-based approach for similarity computation is proposed with respect to enhance the retrieval effectiveness in concept-based video retrieval. The proposed method is based on the integration of knowledge-based and corpus-based semantic word similarity measures in order to retrieve video shots for concepts whose annotations are not available for the system. The TRECVID 2005 dataset is used for evaluation purpose, and the results of applying proposed method are then compared against the individual knowledge-based and corpus-based semantic word similarity measures which were utilized in previous studies in the same domain. The superiority of integrated similarity method is shown and evaluated in terms of Mean Average Precision (MAP)

    An examination of automatic video retrieval technology on access to the contents of an historical video archive

    Get PDF
    Purpose – This paper aims to provide an initial understanding of the constraints that historical video collections pose to video retrieval technology and the potential that online access offers to both archive and users. Design/methodology/approach – A small and unique collection of videos on customs and folklore was used as a case study. Multiple methods were employed to investigate the effectiveness of technology and the modality of user access. Automatic keyframe extraction was tested on the visual content while the audio stream was used for automatic classification of speech and music clips. The user access (search vs browse) was assessed in a controlled user evaluation. A focus group and a survey provided insight on the actual use of the analogue archive. The results of these multiple studies were then compared and integrated (triangulation). Findings – The amateur material challenged automatic techniques for video and audio indexing, thus suggesting that the technology must be tested against the material before deciding on a digitisation strategy. Two user interaction modalities, browsing vs searching, were tested in a user evaluation. Results show users preferred searching, but browsing becomes essential when the search engine fails in matching query and indexed words. Browsing was also valued for serendipitous discovery; however the organisation of the archive was judged cryptic and therefore of limited use. This indicates that the categorisation of an online archive should be thought of in terms of users who might not understand the current classification. The focus group and the survey showed clearly the advantage of online access even when the quality of the video surrogate is poor. The evidence gathered suggests that the creation of a digital version of a video archive requires a rethinking of the collection in terms of the new medium: a new archive should be specially designed to exploit the potential that the digital medium offers. Similarly, users' needs have to be considered before designing the digital library interface, as needs are likely to be different from those imagined. Originality/value – This paper is the first attempt to understand the advantages offered and limitations held by video retrieval technology for small video archives like those often found in special collections

    Domain-Agnostic Multi-Modal Video Retrieval

    Get PDF
    The rapid proliferation of multimedia content has necessitated the development of efficient video retrieval systems. Multi-modal video retrieval is a non-trivial task involving retrieval of relevant information across different modalities, such as text, audio, and videos. Traditional approaches for multi-modal retrieval often rely on domain-specific techniques and models, limiting their generalizability across different domains. This thesis aims to develop a domain-agnostic approach for multi-modal video retrieval, enabling effective retrieval irrespective of the specific domain or data modality. The research explores techniques such as transfer learning, where pre-trained models from different domains are fine-tuned using domain-agnostic strategies. Additionally, attention mechanisms and fusion techniques are investigated to leverage cross-modal interactions and capture relevant information from diverse modalities. An important aspect of the research is to find robust methods for audio-video integration as both of them individually provide retrieval cues for the text query. To this end, the loss functions and the architectural design of the model is developed with a strong focus on increasing the mutual information between text and audio-video feature. The proposed approach is quantitatively evaluated on various video benchmark datasets such as MSR-VTT and YouCook2. The results showcase that the approach not only holds its own against state-of-the-art methods but also outperforms them in certain scenarios, with a notable 6% improvement in R@5 and R@10 metrics in the best-performing cases. Qualitative evaluations further illuminated the utility of audio, especially in instances where there's a direct word match between text and audio, exemplified by queries like "A man is calling his colleagues" aligning with video audio containing the word "colleague". In essence, the findings of this research pave the way for a versatile and integrated solution for multi-modal retrieval, with potential applications spanning a wide range of domains

    Crowd-based Semantic Event Detection and Video Annotation for Sports Videos

    Get PDF
    Recent developments in sport analytics have heightened the interest in collecting data on the behavior of individuals and of the entire team in sports events. Rather than using dedicated sensors for recording the data, the detection of semantic events reflecting a team's behavior and the subsequent annotation of video data is nowadays mostly performed by paid experts. In this paper, we present an approach to generating such annotations by leveraging the wisdom of the crowd. We present the CrowdSport application that allows to collect data for soccer games. It presents crowd workers short video snippets of soccer matches and allows them to annotate these snippets with event information. Finally, the various annotations collected from the crowd are automatically disambiguated and integrated into a coherent data set. To improve the quality of the data entered, we have implemented a rating system that assigns each worker a trustworthiness score denoting the confidence towards newly entered data. Using the DBSCAN clustering algorithm and the confidence score, the integration ensures that the generated event labels are of high quality, despite of the heterogeneity of the participating workers. These annotations finally serve as a basis for a video retrieval system that allows users to search for video sequences on the basis of a graphical specification of team behavior or motion of the individual player. Our evaluations of the crowd-based semantic event detection and video annotation using the Microworkers platform have shown the effectiveness of the approach and have led to results that are in most cases close to the ground truth and can successfully be used for various retrieval tasks

    Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings

    Get PDF
    We address the problem of cross-modal fine-grained action retrieval between text and video. Cross-modal retrieval is commonly achieved through learning a shared embedding space, that can indifferently embed modalities. In this paper, we propose to enrich the embedding by disentangling parts-of-speech (PoS) in the accompanying captions. We build a separate multi-modal embedding space for each PoS tag. The outputs of multiple PoS embeddings are then used as input to an integrated multi-modal space, where we perform action retrieval. All embeddings are trained jointly through a combination of PoS-aware and PoS-agnostic losses. Our proposal enables learning specialised embedding spaces that offer multiple views of the same embedded entities. We report the first retrieval results on fine-grained actions for the large-scale EPIC dataset, in a generalised zero-shot setting. Results show the advantage of our approach for both video-to-text and text-to-video action retrieval. We also demonstrate the benefit of disentangling the PoS for the generic task of cross-modal video retrieval on the MSR-VTT dataset.Comment: Accepted for presentation at ICCV. Project Page: https://mwray.github.io/FGA
    corecore