7 research outputs found

    Audio Fingerprinting for Multi-Device Self-Localization

    Get PDF
    This work was supported by the U.K. Engineering and Physical Sciences Research Council (EPSRC) under Grant EP/K007491/1

    Monocular visual traffic surveillance: a review

    Get PDF
    To facilitate the monitoring and management of modern transportation systems, monocular visual traffic surveillance systems have been widely adopted for speed measurement, accident detection, and accident prediction. Thanks to the recent innovations in computer vision and deep learning research, the performance of visual traffic surveillance systems has been significantly improved. However, despite this success, there is a lack of survey papers that systematically review these new methods. Therefore, we conduct a systematic review of relevant studies to fill this gap and provide guidance to future studies. This paper is structured along the visual information processing pipeline that includes object detection, object tracking, and camera calibration. Moreover, we also include important applications of visual traffic surveillance systems, such as speed measurement, behavior learning, accident detection and prediction. Finally, future research directions of visual traffic surveillance systems are outlined

    Precise and Robust Visual SLAM with Inertial Sensors and Deep Learning.

    Get PDF
    Dotar a los robots con el sentido de la percepción destaca como el componente más importante para conseguir máquinas completamente autónomas. Una vez que las máquinas sean capaces de percibir el mundo, podrán interactuar con él. A este respecto, la localización y la reconstrucción de mapas de manera simultánea, SLAM (por sus siglas en inglés) comprende todas las técnicas que permiten a los robots estimar su posición y reconstruir el mapa de su entorno al mismo tiempo, usando únicamente el conjunto de sensores a bordo. El SLAM constituye el elemento clave para la percepción de las máquinas, estando ya presente en diferentes tecnologías y aplicaciones como la conducción autónoma, la realidad virtual y aumentada o los robots de servicio. Incrementar la robustez del SLAM expandiría su uso y aplicación, haciendo las máquinas más seguras y requiriendo una menor intervención humana.En esta tesis hemos combinado sensores inerciales (IMU) y visuales para incrementar la robustez del SLAM ante movimientos rápidos, oclusiones breves o entornos con poca textura. Primero hemos propuesto dos técnicas rápidas para la inicialización del sensor inercial, con un bajo error de escala. Estas han permitido empezar a usar la IMU tan pronto como 2 segundos después de lanzar el sistema. Una de estas inicializaciones ha sido integrada en un nuevo sistema de SLAM visual inercial, acuñado como ORB-SLAM3, el cual representa la mayor contribución de esta tesis. Este es el sistema de SLAM visual-inercial de código abierto más completo hasta la fecha, que funciona con cámaras monoculares o estéreo, estenopeicas o de ojo de pez, y con capacidades multimapa. ORB-SLAM3 se basa en una formulación de Máximo a Posteriori, tanto en la inicialización como en el refinamiento y el ajuste de haces visual-inercial. También explota la asociación de datos en el corto, medio y largo plazo. Todo esto hace que ORB-SLAM3 sea el sistema SLAM visual-inercial más preciso, como así demuestran nuestros resultados en experimentos públicos.Además, hemos explorado la aplicación de técnicas de aprendizaje profundo para mejorar la robustez del SLAM. En este aspecto, primero hemos propuesto DynaSLAM II, un sistema SLAM estéreo para entornos dinámicos. Los objetos dinámicos son segmentados mediante una red neuronal, y sus puntos y medidas son incluidas eficientemente en la optimización de ajuste de haces. Esto permite estimar y hacer seguimiento de los objetos en movimiento, al mismo tiempo que se mejora la estimación de la trayectoria de la cámara. En segundo lugar, hemos desarrollado un SLAM monocular y directo basado en predicciones de profundidad a través de redes neuronales. Optimizamos de manera conjunta tanto los residuos de predicción de profundidad como los fotométricos de distintas vistas, lo que da lugar a un sistema monocular capaz de estimar la escala. No sufre el problema de deriva de escala, siendo más robusto y varias veces más preciso que los sistemas monoculares clásicos.<br /

    Visual Geo-Localization and Location-Aware Image Understanding

    Get PDF
    Geo-localization is the problem of discovering the location where an image or video was captured. Recently, large scale geo-localization methods which are devised for ground-level imagery and employ techniques similar to image matching have attracted much interest. In these methods, given a reference dataset composed of geo-tagged images, the problem is to estimate the geo-location of a query by finding its matching reference images. In this dissertation, we address three questions central to geo-spatial analysis of ground-level imagery: 1) How to geo-localize images and videos captured at unknown locations? 2) How to refine the geo-location of already geo-tagged data? 3) How to utilize the extracted geo-tags? We present a new framework for geo-locating an image utilizing a novel multiple nearest neighbor feature matching method using Generalized Minimum Clique Graphs (GMCP). First, we extract local features (e.g., SIFT) from the query image and retrieve a number of nearest neighbors for each query feature from the reference data set. Next, we apply our GMCP-based feature matching to select a single nearest neighbor for each query feature such that all matches are globally consistent. Our approach to feature matching is based on the proposition that the first nearest neighbors are not necessarily the best choices for finding correspondences in image matching. Therefore, the proposed method considers multiple reference nearest neighbors as potential matches and selects the correct ones by enforcing the consistency among their global features (e.g., GIST) using GMCP. Our evaluations using a new data set of 102k Street View images shows the proposed method outperforms the state-of-the-art by 10 percent. Geo-localization of images can be extended to geo-localization of a video. We have developed a novel method for estimating the geo-spatial trajectory of a moving camera with unknown intrinsic parameters in a city-scale. The proposed method is based on a three step process: 1) individual geo-localization of video frames using Street View images to obtain the likelihood of the location (latitude and longitude) given the current observation, 2) Bayesian tracking to estimate the frame location and video\u27s temporal evolution using previous state probabilities and current likelihood, and 3) applying a novel Minimum Spanning Trees based trajectory reconstruction to eliminate trajectory loops or noisy estimations. Thus far, we have assumed reliable geo-tags for reference imagery are available through crowdsourcing. However, crowdsourced images are well known to suffer from the acute shortcoming of having inaccurate geo-tags. We have developed the first method for refinement of GPS-tags which automatically discovers the subset of corrupted geo-tags and refines them. We employ Random Walks to discover the uncontaminated subset of location estimations and robustify Random Walks with a novel adaptive damping factor that conforms to the level of noise in the input. In location-aware image understanding, we are interested in improving the image analysis by putting it in the right geo-spatial context. This approach is of particular importance as the majority of cameras and mobile devices are now being equipped with GPS chips. Therefore, developing techniques which can leverage the geo-tags of images for improving the performance of traditional computer vision tasks is of particular interest. We have developed a location-aware multimodal approach which incorporates business directories, textual information, and web images to identify businesses in a geo-tagged query image

    Scalable methods for single and multi camera trajectory forecasting

    Get PDF
    Predicting the future trajectory of objects in video is a critical task within computer vision with numerous application domains. For example, reliable anticipation of pedestrian trajectory is imperative for the operation of intelligent vehicles and can significantly enhance the functionality of advanced driver assistance systems. Trajectory forecasting can also enable more accurate tracking of objects in video, particularly if the objects are not always visible, such as during occlusion or entering a blind spot in a non-overlapping multicamera network. However, due to the considerable human labour required to manually annotate data amenable to trajectory forecasting, the scale and variety of existing datasets used to study the problem is limited. In this thesis, we propose a set of strategies for pedestrian trajectory forecasting. We address the lack of training data by introducing a scalable machine annotation scheme that enables models to be trained using a large Single-Camera Trajectory Forecasting (SCTF) dataset without human annotation. Using newly collected datasets annotated using our proposed methods, we develop two models for SCTF. The first model, Dynamic Trajectory Predictor (DTP), forecasts pedestrian trajectory from on board a moving vehicle up to one second into the future. DTP is trained using both human and machine-annotated data and anticipates dynamic motion that linear models do not capture. Our second model, Spatio-Temporal Encoder-Decoder (STED), predicts full object bounding boxes in addition to trajectory. STED combines visual and temporal features to model both object-motion and ego-motion. In addition to our SCTF contributions, we also introduce a new task: Multi-Camera Trajectory Forecasting (MCTF), where the future trajectory of an object is predicted in a network of cameras. Prior works consider forecasting trajectories in a single camera view. Our work is the first to consider the challenging scenario of forecasting across multiple non-overlapping camera views. This has wide applicability in tasks such as re-identification and multitarget multi-camera tracking. To facilitate research in this new area, we collect a unique dataset of multi-camera pedestrian trajectories from a network of 15 synchronized cameras. We also develop a semi-automated annotation method to accurately label this large dataset containing 600 hours of video footage. We introduce an MCTF framework that simultaneously uses all estimated relative object locations from several camera viewpoints and predicts the object's future location in all possible camera viewpoints. Our framework follows a Which- When-Where approach that predicts in which camera(s) the objects appear and when and where within the camera views they appear. Experimental results demonstrate the effectiveness of our MCTF model, which outperforms existing SCTF approaches adapted to the MCTF framework

    Segmentation mutuelle d'objets d'intérêt dans des séquences d'images stéréo multispectrales

    Get PDF
    Les systèmes de vidéosurveillance automatisés actuellement déployés dans le monde sont encore bien loin de ceux qui sont représentés depuis des années dans les oeuvres de sciencefiction. Une des raisons derrière ce retard de développement est le manque d’outils de bas niveau permettant de traiter les données brutes captées sur le terrain. Le pré-traitement de ces données sert à réduire la quantité d’information qui transige vers des serveurs centralisés, qui eux effectuent l’interprétation complète du contenu visuel capté. L’identification d’objets d’intérêt dans les images brutes à partir de leur mouvement est un exemple de pré-traitement qui peut être réalisé. Toutefois, dans un contexte de vidéosurveillance, une méthode de pré-traitement ne peut généralement pas se fier à un modèle d’apparence ou de forme qui caractérise ces objets, car leur nature exacte n’est pas connue d’avance. Cela complique donc l’élaboration des méthodes de traitement de bas niveau. Dans cette thèse, nous présentons différentes méthodes permettant de détecter et de segmenter des objets d’intérêt à partir de séquences vidéo de manière complètement automatisée. Nous explorons d’abord les approches de segmentation vidéo monoculaire par soustraction d’arrière-plan. Ces approches se basent sur l’idée que l’arrière-plan d’une scène peut être modélisé au fil du temps, et que toute variation importante d’apparence non prédite par le modèle dévoile en fait la présence d’un objet en intrusion. Le principal défi devant être relevé par ce type de méthode est que leur modèle d’arrière-plan doit pouvoir s’adapter aux changements dynamiques des conditions d’observation de la scène. La méthode conçue doit aussi pouvoir rester sensible à l’apparition de nouveaux objets d’intérêt, malgré cette robustesse accrue aux comportements dynamiques prévisibles. Nous proposons deux méthodes introduisant différentes techniques de modélisation qui permettent de mieux caractériser l’apparence de l’arrière-plan sans que le modèle soit affecté par les changements d’illumination, et qui analysent la persistance locale de l’arrière-plan afin de mieux détecter les objets d’intérêt temporairement immobilisés. Nous introduisons aussi de nouveaux mécanismes de rétroaction servant à ajuster les hyperparamètres de nos méthodes en fonction du dynamisme observé de la scène et de la qualité des résultats produits.----------ABSTRACT: The automated video surveillance systems currently deployed around the world are still quite far in terms of capabilities from the ones that have inspired countless science fiction works over the past few years. One of the reasons behind this lag in development is the lack of lowlevel tools that allow raw image data to be processed directly in the field. This preprocessing is used to reduce the amount of information transferred to centralized servers that have to interpret the captured visual content for further use. The identification of objects of interest in raw images based on motion is an example of a reprocessing step that might be required by a large system. However, in a surveillance context, the preprocessing method can seldom rely on an appearance or shape model to recognize these objects since their exact nature cannot be known exactly in advance. This complicates the elaboration of low-level image processing methods. In this thesis, we present different methods that detect and segment objects of interest from video sequences in a fully unsupervised fashion. We first explore monocular video segmentation approaches based on background subtraction. These approaches are based on the idea that the background of an observed scene can be modeled over time, and that any drastic variation in appearance that is not predicted by the model actually reveals the presence of an intruding object. The main challenge that must be met by background subtraction methods is that their model should be able to adapt to dynamic changes in scene conditions. The designed methods must also remain sensitive to the emergence of new objects of interest despite this increased robustness to predictable dynamic scene behaviors. We propose two methods that introduce different modeling techniques to improve background appearance description in an illumination-invariant way, and that analyze local background persistence to improve the detection of temporarily stationary objects. We also introduce new feedback mechanisms used to adjust the hyperparameters of our methods based on the observed dynamics of the scene and the quality of the generated output

    Blind Retrospective Motion Correction of MR Images

    Get PDF
    Die Bewegung des Patienten während einer MRI Untersuchung kann die Bildqualität stark verringern. Eine Verschiebung des abzubildenden Objektes von nur ein Paar Millimetern ist genug um Bewegungsartefakte zu erzeugen und der Scan unbrauchbar für die medizinische Diagnostik zu machen. Obwohl in den letzten 20 Jahren mehrere Verfahren entwickelt wurden, ist die Bewegungskorrektur immer noch ein ungelöstes Problem. Wir schlagen einen neuen retrospektiven Bewegungskorrekturalgorithmus vor, mit dem man die Qualität von 3D MR Bildern verbessern kann. Mit diesem Verfahren ist es möglich sowohl starre als auch nicht starre Körperbewegungen zu korrigieren. Der wichtigste Aspekt unserer Algorithmen ist, dass keine Informationen über die Bewegungstrajektorie, z. B. von Kameras, nötig sind um die Bewegungskorrektur durchzuführen. Unsere Verfahren verwenden die RAW-Dateien von normalen MRT-Sequenzen und brauchen keinerlei Anderungen im Scanablauf. Wir benutzen Grafikprozessoren um die Bewegungskorrektur zu beschleunigen – im Fall von starren Körperbewegungen sind nur wenige Sekunden erforderlich, bei nicht starrer Körperbewegung nur einige Minuten Unser Bewegungskorrekturalgorithmus für starre Körper basiert auf der Minimierung einer Kostenfunktion, die die objektive Qualit ̈at des korrigierten Bildes abschätzt. Die Hauptidee ist, durch Optimierung eine Bewegungstrajektorie zu finden, die den kleinsten Betrag der Kostenfunktion liefert. Wir verwenden die Entropie der Bildgradienten als Bildqualitätsfunktion. Um nicht starre Körperbewegungen zu korrigieren, erweitern wir unser mathematisches Modell von Bewegungseffekten. Wir approximieren nicht starre Körperbewegungen als mehrere lokale starre Körperbewegungen. Um solche Bewegungen zu korrigieren, entwickeln wir ein neues annealing-basiert Optimierungsverfahren. Während der Optimierung wechseln wir zwei Schritte ab - die Kostenfunktionsminimierung durch Bild- und Bewegungsparameter. Wir haben mehrere Simulationen sowie in vivo Versuche am Menschen durchgeführt – beide lieferten wesentliche Bildqualitätsverbesserungen
    corecore