42 research outputs found

    Identification, synchronisation and composition of user-generated videos

    Get PDF
    Cotutela Universitat Politècnica de Catalunya i Queen Mary University of LondonThe increasing availability of smartphones is facilitating people to capture videos of their experience when attending events such as concerts, sports competitions and public rallies. Smartphones are equipped with inertial sensors which could be beneficial for event understanding. The captured User-Generated Videos (UGVs) are made available on media sharing websites. Searching and mining of UGVs of the same event are challenging due to inconsistent tags or incorrect timestamps. A UGV recorded from a fixed location contains monotonic content and unintentional camera motions, which may make it less interesting to playback. In this thesis, we propose the following identification, synchronisation and video composition frameworks for UGVs. We propose a framework for the automatic identification and synchronisation of unedited multi-camera UGVs within a database. The proposed framework analyses the sound to match and cluster UGVs that capture the same spatio-temporal event, and estimate their relative time-shift to temporally align them. We design a novel descriptor derived from the pairwise matching of audio chroma features of UGVs. The descriptor facilitates the definition of a classification threshold for automatic query-by-example event identification. We contribute a database of 263 multi-camera UGVs of 48 real-world events. We evaluate the proposed framework on this database and compare it with state-of-the-art methods. Experimental results show the effectiveness of the proposed approach in the presence of audio degradations (channel noise, ambient noise, reverberations). Moreover, we present an automatic audio and visual-based camera selection framework for composing uninterrupted recording from synchronised multi-camera UGVs of the same event. We design an automatic audio-based cut-point selection method that provides a common reference for audio and video segmentation. To filter low quality video segments, spatial and spatio-temporal assessments are computed. The framework combines segments of UGVs using a rank-based camera selection strategy by considering visual quality scores and view diversity. The proposed framework is validated on a dataset of 13 events (93~UGVs) through subjective tests and compared with state-of-the-art methods. Suitable cut-point selection, specific visual quality assessments and rank-based camera selection contribute to the superiority of the proposed framework over the existing methods. Finally, we contribute a method for Camera Motion Detection using Gyroscope for UGVs captured from smartphones and design a gyro-based quality score for video composition. The gyroscope measures the angular velocity of the smartphone that can be use for camera motion analysis. We evaluate the proposed camera motion detection method on a dataset of 24 multi-modal UGVs captured by us, and compare it with existing visual and inertial sensor-based methods. By designing a gyro-based score to quantify the goodness of the multi-camera UGVs, we develop a gyro-based video composition framework. A gyro-based score substitutes the spatial and spatio-temporal scores and reduces the computational complexity. We contribute a multi-modal dataset of 3 events (12~UGVs), which is used to validate the proposed gyro-based video composition framework.El incremento de la disponibilidad de teléfonos inteligentes o smartphones posibilita a la gente capturar videos de sus experiencias cuando asisten a eventos así como como conciertos, competiciones deportivas o mítines públicos. Los Videos Generados por Usuarios (UGVs) pueden estar disponibles en sitios web públicos especializados en compartir archivos. La búsqueda y la minería de datos de los UGVs del mismo evento son un reto debido a que los etiquetajes son incoherentes o las marcas de tiempo erróneas. Por otra parte, un UGV grabado desde una ubicación fija, contiene información monótona y movimientos de cámara no intencionados haciendo menos interesante su reproducción. En esta tesis, se propone una identificación, sincronización y composición de tramas de vídeo para UGVs. Se ha propuesto un sistema para la identificación y sincronización automática de UGVs no editados provenientes de diferentes cámaras dentro de una base de datos. El sistema propuesto analiza el sonido con el fin de hacerlo coincidir e integrar UGVs que capturan el mismo evento en el espacio y en el tiempo, estimando sus respectivos desfases temporales y alinearlos en el tiempo. Se ha diseñado un nuevo descriptor a partir de la coincidencia por parejas de características de la croma del audio de los UGVs. Este descriptor facilita la determinación de una clasificación por umbral para una identificación de eventos automática basada en búsqueda mediante ejemplo (en inglés, query by example). Se ha contribuido con una base de datos de 263 multi-cámaras UGVs de un total de 48 eventos reales. Se ha evaluado la trama propuesta en esta base de datos y se ha comparado con los métodos elaborados en el estado del arte. Los resultados experimentales muestran la efectividad del enfoque propuesto con la presencia alteraciones en el audio. Además, se ha presentado una selección automática de tramas en base a la reproducción de video y audio componiendo una grabación ininterrumpida de multi-cámaras UGVs sincronizadas en el mismo evento. También se ha diseñado un método de selección de puntos de corte automático basado en audio que proporciona una referencia común para la segmentación de audio y video. Con el fin de filtrar segmentos de videos de baja calidad, se han calculado algunas medidas espaciales y espacio-temporales. El sistema combina segmentos de UGVs empleando una estrategia de selección de cámaras basadas en la evaluación a través de un ranking considerando puntuaciones de calidad visuales y diversidad de visión. El sistema propuesto se ha validado con un conjunto de datos de 13 eventos (93 UGVs) a través de pruebas subjetivas y se han comparado con los métodos elaborados en el estado del arte. La selección de puntos de corte adecuados, evaluaciones de calidad visual específicas y la selección de cámara basada en ranking contribuyen en la mejoría de calidad del sistema propuesto respecto a otros métodos existentes. Finalmente, se ha realizado un método para la Detección de Movimiento de Cámara usando giróscopos para las UGVs capturadas desde smartphones y se ha diseñado un método de puntuación de calidad basada en el giro. El método de detección de movimiento de la cámara con una base de datos de 24 UGVs multi-modales y se ha comparado con los métodos actuales basados en visión y sistemas inerciales. A través del diseño de puntuación para cuantificar con el giróscopo cuán bien funcionan los sistemas de UGVs con multi-cámara, se ha desarrollado un sistema de composición de video basada en el movimiento del giroscopio. Este sistema basado en la puntuación a través del giróscopo sustituye a los sistemas de puntuaciones basados en parámetros espacio-temporales reduciendo la complejidad computacional. Además, se ha contribuido con un conjunto de datos de 3 eventos (12 UGVs), que se han empleado para validar los sistemas de composición de video basados en giróscopo.Postprint (published version

    Multimodal Focused Interaction Dataset

    Get PDF
    Recording of daily life experiences from a first-person perspective has become more prevalent with the increasing availability of wearable cameras and sensors. This dataset was captured during development of a system for automatic detection of social interactions in such data streams, and in particular focused interactions in which co-present individuals, having mutual focus of attention, interact by establishing face-to-face engagement and direct conversation. Existing public datasets for social interaction captured from first person perspective tend to be limited in terms of duration, number of people appearing, continuity and variability of the recording. We contribute the Focused Interaction Dataset which includes video acquired using a shoulder-mounted GoPro Hero 4 camera, as well as inertial sensor data and GPS data, and output from a voice activity detector. The dataset contains 377 minutes (including 566,000 video frames) of continuous multimodal recording captured during 19 sessions, with 17 conversational partners in 18 different indoor/outdoor locations. The sessions include periods in which the camera wearer is engaged in focused interactions, in unfocused interactions, and in no interaction. Annotations are provided for all focused and unfocused interactions for the complete duration of the dataset. Anonymised IDs for 13 people involved in the focused interactions are also provided. In addition to development of social interaction analysis, the dataset may be useful for applications such as activity detection, personal location of interest understanding, and person association

    Finding Time Together:Detection and Classification of Focused Interaction in Egocentric Video

    Get PDF
    Focused interaction occurs when co-present individuals, having mutual focus of attention, interact by establishing face-to-face engagement and direct conversation. Face-to-face engagement is often not maintained throughout the entirety of a focused interaction. In this paper, we present an online method for automatic classification of unconstrained egocentric (first-person perspective) videos into segments having no focused interaction, focused interaction when the camera wearer is stationary and focused interaction when the camera wearer is moving. We extract features from both audio and video data streams and perform temporal segmentation by using support vector machines with linear and non-linear kernels. We provide empirical evidence that fusion of visual face track scores, camera motion profile and audio voice activity scores is an effective combination for focused interaction classification

    FetReg2021: A Challenge on Placental Vessel Segmentation and Registration in Fetoscopy

    Get PDF
    Fetoscopy laser photocoagulation is a widely adopted procedure for treating Twin-to-Twin Transfusion Syndrome (TTTS). The procedure involves photocoagulation pathological anastomoses to regulate blood exchange among twins. The procedure is particularly challenging due to the limited field of view, poor manoeuvrability of the fetoscope, poor visibility, and variability in illumination. These challenges may lead to increased surgery time and incomplete ablation. Computer-assisted intervention (CAI) can provide surgeons with decision support and context awareness by identifying key structures in the scene and expanding the fetoscopic field of view through video mosaicking. Research in this domain has been hampered by the lack of high-quality data to design, develop and test CAI algorithms. Through the Fetoscopic Placental Vessel Segmentation and Registration (FetReg2021) challenge, which was organized as part of the MICCAI2021 Endoscopic Vision challenge, we released the first largescale multicentre TTTS dataset for the development of generalized and robust semantic segmentation and video mosaicking algorithms. For this challenge, we released a dataset of 2060 images, pixel-annotated for vessels, tool, fetus and background classes, from 18 in-vivo TTTS fetoscopy procedures and 18 short video clips. Seven teams participated in this challenge and their model performance was assessed on an unseen test dataset of 658 pixel-annotated images from 6 fetoscopic procedures and 6 short clips. The challenge provided an opportunity for creating generalized solutions for fetoscopic scene understanding and mosaicking. In this paper, we present the findings of the FetReg2021 challenge alongside reporting a detailed literature review for CAI in TTTS fetoscopy. Through this challenge, its analysis and the release of multi-centre fetoscopic data, we provide a benchmark for future research in this field

    Multimodal Egocentric Analysis of Focused Interactions

    Get PDF
    Continuous detection of social interactions from wearable sensor data streams has a range of potential applications in domains, including health and social care, security, and assistive technology. We contribute an annotated, multimodal data set capturing such interactions using video, audio, GPS, and inertial sensing. We present methods for automatic detection and temporal segmentation of focused interactions using support vector machines and recurrent neural networks with features extracted from both audio and video streams. The focused interaction occurs when the co-present individuals, having the mutual focus of attention, interact by first establishing the face-to-face engagement and direct conversation. We describe an evaluation protocol, including framewise, extended framewise, and event-based measures, and provide empirical evidence that the fusion of visual face track scores with audio voice activity scores provides an effective combination. The methods, contributed data set, and protocol together provide a benchmark for the future research on this problem

    Detector-Free Dense Feature Matching for Fetoscopic Mosaicking

    Get PDF
    Fetoscopic Laser Photocoagulation (FLP) is used to treat Twin-to-twin transfusion syndrome, however, this procedure is hindered because of difficulty in visualizing the intraoperative surgical environment due to limited surgical field-of-view, unusual placenta position, limited maneuverability of the fetoscope and poor visibility due to fluid turbidity and occlusions. Fetoscopic video mosaicking can create an expanded field-of-view (FOV) image of the fetoscopic intraoperative environment, which may support the surgeons in localizing the vascular anastomoses during the FLP procedure. However, existing classical video mosaicking methods tend to perform poorly in vivo fetoscopic videos. We propose the use of transformer-based detector-free local feature matching method as a dense feature matching technique for creating reliable mosaics with minimal drifting error. Using the publicly available fetoscopy placenta dataset, we experimentally show the robustness of the proposed method over the state-of-the-art vessel-based fetoscopic mosaicking method

    PseudoSegRT: efficient pseudo-labelling for intraoperative OCT segmentation

    Get PDF
    PURPOSE: Robotic ophthalmic microsurgery has significant potential to help improve the success of challenging procedures and overcome the physical limitations of the surgeon. Intraoperative optical coherence tomography (iOCT) has been reported for the visualisation of ophthalmic surgical manoeuvres, where deep learning methods can be used for real-time tissue segmentation and surgical tool tracking. However, many of these methods rely heavily on labelled datasets, where producing annotated segmentation datasets is a time-consuming and tedious task. METHODS: To address this challenge, we propose a robust and efficient semi-supervised method for boundary segmentation in retinal OCT to guide a robotic surgical system. The proposed method uses U-Net as the base model and implements a pseudo-labelling strategy which combines the labelled data with unlabelled OCT scans during training. After training, the model is optimised and accelerated with the use of TensorRT. RESULTS: Compared with fully supervised learning, the pseudo-labelling method can improve the generalisability of the model and show better performance for unseen data from a different distribution using only 2% of labelled training samples. The accelerated GPU inference takes less than 1 millisecond per frame with FP16 precision. CONCLUSION: Our approach demonstrates the potential of using pseudo-labelling strategies in real-time OCT segmentation tasks to guide robotic systems. Furthermore, the accelerated GPU inference of our network is highly promising for segmenting OCT images and guiding the position of a surgical tool (e.g. needle) for sub-retinal injections

    Placental vessel-guided hybrid framework for fetoscopic mosaicking

    Get PDF
    Fetoscopic laser photocoagulation is used to treat twin-to-twin transfusion syndrome; however, this procedure is hindered because of difficulty in visualising the intraoperative surgical environment due to limited surgical field-of-view, unusual placenta position, limited manoeuvrability of the fetoscope and poor visibility due to fluid turbidity and occlusions. Fetoscopic video mosaicking can create an expanded field-of-view image of the fetoscopic intraoperative environment, which could support the surgeons in localising the vascular anastomoses during the fetoscopic procedure. However, classical handcrafted feature matching methods fail on in vivo fetoscopic videos. An existing state-of-the-art method on fetoscopic mosaicking relies on vessel presence and fails when vessels are not present in the view. We propose a vessel-guided hybrid fetoscopic mosaicking framework that mutually benefits from a placental vessel-based registration and a deep learning-based dense matching method to optimise the overall performance. A selection mechanism is implemented based on vessels’ appearance consistency and photometric error minimisation for choosing the best pairwise transformation. Using the extended fetoscopy placenta dataset, we experimentally show the robustness of the proposed framework, over the state-of-the-art methods, even in vessel-free, low-textured, or low illumination non-planar fetoscopic views

    Intra-Domain Adaptation for Robust Visual Guidance in Intratympanic Injections

    Get PDF
    Intratympanic steroid injections are commonly used for the treatment of ear diseases. During this treatment, an expert Ear, Nose & Throat (ENT) clinician delivers the drug by viewing through a large microscope that provides a close-up view of the anatomical landmarks on the middle ear. A steady hand and swift response to any patient movement are required to avoid improper placement of the needle. To assist the clinician during this treatment, a fluidic soft robot is proposed in \cite{lindenroth2021fluidic} that can steer inside a lumen for providing steady guidance for drug delivery. For robust visual guidance, stable anatomical landmarks (tympanic membrane, malleus, umbo) segmentation is required. In this work, we perform intra-domain adaptation to learn a generalized model that provides stable and consistent segmentation on unseen patients and phantom ear data

    Intrinsic Force Sensing for Motion Estimation in a Parallel, Fluidic Soft Robot for Endoluminal Interventions

    Get PDF
    Determining the externally-induced motion of a soft robot in minimally-invasive procedures is highly challenging and commonly demands specific tools and dedicated sensors. Intrinsic force sensing paired with a model describing the robot's compliance offers an alternative pathway which relies heavily on knowledge of the characteristic mechanical behaviour of the investigated system. In this work, we apply quasi-static intrinsic force sensing to a miniature, parallel soft robot designed for endoluminal ear interventions. We characterize the soft robot's nonlinear mechanical behaviour and devise methods for inferring forces applied to the actuators of the robot from fluid pressure and volume information of the working fluid. We demonstrate that it is possible to detect the presence of an external contact acting on the soft robot's actuators, infer the applied reaction force with an accuracy of 28.1 mN and extrapolate from individual actuator force sensing to determining forces acting on the combined parallel soft robot when it is deployed in a lumen, which can be achieved with an accuracy of 75.45 mN for external forces and 0.47 Nmm for external torques. The intrinsically-sensed external forces can be employed to estimate the induced motion of the soft robot in response to these forces with an accuracy of 0.11 mm in translation and 2.47 in rotational deflection. The derived methodologies could enable designs for more perceptive endoscopic systems and pave the way for developing sensing and control strategies in endoluminal and transluminal soft robots
    corecore