9 research outputs found

    Gesture retrieval and its application to the study of multimodal communication

    Full text link
    Comprehending communication is dependent on analyzing the different modalities of conversation, including audio, visual, and others. This is a natural process for humans, but in digital libraries, where preservation and dissemination of digital information are crucial, it is a complex task. A rich conversational model, encompassing all modalities and their co-occurrences, is required to effectively analyze and interact with digital information. Currently, the analysis of co-speech gestures in videos is done through manual annotation by linguistic experts based on textual searches. However, this approach is limited and does not fully utilize the visual modality of gestures. This paper proposes a visual gesture retrieval method using a deep learning architecture to extend current research in this area. The method is based on body keypoints and uses an attention mechanism to focus on specific groups. Experiments were conducted on a subset of the NewsScape dataset, which presents challenges such as multiple people, camera perspective changes, and occlusions. A user study was conducted to assess the usability of the results, establishing a baseline for future gesture retrieval methods in real-world video collections. The results of the experiment demonstrate the high potential of the proposed method in multimodal communication research and highlight the significance of visual gesture retrieval in enhancing interaction with video content. The integration of visual similarity search for gestures in the open-source multimedia retrieval stack, vitrivr, can greatly contribute to the field of computational linguistics. This research advances the understanding of the role of the visual modality in co-speech gestures and highlights the need for further development in this area

    Gesture Similarity Learning and Retrieval in Large-Scale Real-world Video Collections

    Get PDF
    Analyzing and understanding gestures plays a key role in our comprehension of communication. Investigating the co-occurrence of gestures and speech is currently a labor-intensive task in linguistics. Although, with advances in natural language processing methods, there have been various contributions in this field, computer vision tools and methods are not prominently used to aid the researchers in analyzing hand and body gestures. In this thesis, we present different contributions tailored to tackle the challenges in real-world gesture retrieval which is an under-explored field in computer vision. The methods aim to systematically answer the questions of 'when' a gesture was performed and 'who' performed it in a video. Along the way, we develop different components to address various challenges in these videos, such as the presence of multiple persons in the scene, heavily occluded hand gestures and abrupt gesture cuts due to the change of camera angle. In contrast to the majority of the existing methods developed for gesture recognition, our proposed methods do not rely on the depth modality or sensor signals, which is available in some datasets to aid the identification of gestures. Our vision-based methods are built upon the best practices in learning the representations of complicated actions using Deep Neural Networks. We have conducted a comprehensive analysis to choose the architectures and configurations to extract discriminative spatio-temporal features. These features enable the retrieval pipeline to find the 'similar' hand gestures. We have additionally explored the notion of similarity in the context of hand gestures through field studies and experiments. Finally, we conduct exhaustive experiments on different benchmarks and to the best of the author's knowledge, run the largest gesture retrieval evaluations using the real-world news footage, the Newscape dataset, which is a collection of more than 400 000 videos with numerous challenging scenes for a retrieval method. The assessed results by experts from the linguistics domain suggest high potential of our proposed method in inter-disciplinary research and studies

    Multi-modal Video Retrieval in Virtual Reality with vitrivr-VR

    Full text link
    In multimedia search, appropriate user interfaces (UIs) are essential to enable effective specification of the user’s information needs and the user-friendly presentation of search results. vitrivr-VR addresses these challenges and provides a novel Virtual Reality-based UI on top of the multimedia retrieval system vitrivr. In this paper we present the version of vitrivr-VR participating in the Video Browser Showdown (VBS) 2022. We describe our visual-text co-embedding feature and new query interfaces, namely text entry, pose queries and temporal queries

    Multi-modal Video Retrieval in Virtual Reality with vitrivr-VR

    No full text
    In multimedia search, appropriate user interfaces (UIs) are essential to enable effective specification of the user`s information needs and the user-friendly presentation of search results. vitrivr-VR addresses these challenges and provides a novel Virtual Reality-based UI on top of the multimedia retrieval system vitrivr. In this paper we present the version of vitrivr-VR participating in the Video Browser Showdown (VBS) 2022. We describe our visual-text co-embedding feature and new query interfaces, namely text entry, pose queries and temporal queries

    Interactive Multimodal Lifelog Retrieval with vitrivr at LSC 2021

    No full text
    The Lifelog Search Challenge (LSC) is an annual benchmarking competition for interactive multimedia retrieval systems, where participating systems compete in finding events based on textual descriptions containing hints about structured, semi-structured, and/or unstructured data. In this paper, we present the multimedia retrieval system vitrivr, a long-time participant to LSC, with a focus on new functionality. Specifically, we introduce the image stabilisation module which is added prior to the feature extraction to reduce the image degradation caused by lifelogger movements, and discuss how geodata is used during query formulation, query execution, and result presentation

    Multi-modal Interactive Video Retrieval with Temporal Queries

    Full text link
    This paper presents the version of vitrivr participating at the Video Browser Showdown (VBS) 2022. vitrivr already supports a wide range of query modalities, such as color and semantic sketches, OCR, ASR and text embedding. In this paper, we briefly introduce the system, then describe our new approach to queries specifying temporal context, ideas for color-based sketches in a competitive retrieval setting and a novel approach to pose-based queries

    On the User-centric Comparative Remote Evaluation of Interactive Video Search Systems

    Get PDF
    In the research of video retrieval systems, comparative assessments during dedicated retrieval competitions provide priceless insights into the performance of individual systems. The scope and depth of such evaluations are unfortunately hard to improve, due to the limitations by the set-up costs, logistics, and organization complexity of large events. We show that this easily impairs the statistical significance of the collected results, and the reproducibility of the competition outcomes. In this article, we present a methodology for remote comparative evaluations of content-based video retrieval systems and demonstrate that such evaluations scale-up to sizes that reliably produce statistically robust results, and propose additional measures that increase the replicability of the experiment. The proposed remote evaluation methodology forms a major contribution toward open science in interactive retrieval benchmarks. At the same time, the detailed evaluation reports form an interesting source of new observations about many subtle, previously inaccessible aspects of video retrieval

    On the User-centric Comparative Remote Evaluation of Interactive Video Search Systems

    Get PDF
    In the research of video retrieval systems, comparative assessments during dedicated retrieval competitions provide priceless insights into the performance of individual systems. The scope and depth of such evaluations is unfortunately hard to improve, due to the limitations by the set-up costs, logistics and organization complexity of large events. We show that this easily impairs the statistical significance of the collected results, and the reproducibility of the competition outcomes. In this paper, we present a methodology for remote comparative evaluations of content-based video retrieval systems and demonstrate that such evaluations scale-up to sizes that reliably produce statistically robust results, and propose additional measures that increase the replicability of the experiment. The proposed remote evaluation methodology forms a major contribution towards open science in interactive retrieval benchmarks. At the same time, the detailed evaluation reports form an interesting source of new observations about many subtle, previously inaccessible aspects of video retrieval

    Intraretinal hyper-reflective foci are almost universally present and co-localize with intraretinal fluid in diabetic macular edema

    No full text
    Purpose: In diabetic macular edema (DME), hyper-reflective foci (HRF) has been linked to disease severity and progression. Using an automated approach, we aimed to investigate the baseline distribution of HRF in DME and their co-localization with cystoid intraretinal fluid (IRF).Methods: Baseline spectral-domain optical coherence tomography (SD-OCT) volume scans (N = 1527) from phase III clinical trials YOSEMITE (NCT03622580) and RHINE (NCT03622593) were segmented using a deep-learning–based algorithm (developed using B-scans from BOULEVARD NCT02699450) to detect HRF. The HRF count and volume were assessed. HRF distributions were analyzed in relation to best-corrected visual acuity (BCVA), central subfield thickness (CST), and IRF volume in quartiles, and Diabetic Retinopathy Severity Scores (DRSS) in groups. Co-localization of HRF with IRF was calculated in the central 3-mm diameter using the en face projection.Results: HRF were present in most patients (up to 99.7%). Median (interquartile range [IQR]) HRF volume within the 3-mm diameter Early Treatment Diabetic Retinopathy Study ring was 1964.3 (3325.2) pL, and median count was 64.0 (IQR = 96.0). Median HRF volumes were greater with decreasing BCVA (nominal P = 0.0109), and increasing CST (nominal P < 0.0001), IRF (nominal P < 0.0001), and DRSS up to very severe nonproliferative diabetic retinopathy (nominal P < 0.0001). HRF co-localized with IRF in the en face projection.Conclusions: Using automated HRF segmentation of full SD-OCT volumes, we observed that HRF are a ubiquitous feature in DME and exhibit relationships with BCVA, CST, IRF, and DRSS, supporting a potential link to disease severity. The spatial distribution of HRF closely followed that of IRF
    corecore