97 research outputs found

    A Stereo Vision Framework for 3-D Underwater Mosaicking

    Get PDF

    Smart environment monitoring through micro unmanned aerial vehicles

    Get PDF
    In recent years, the improvements of small-scale Unmanned Aerial Vehicles (UAVs) in terms of flight time, automatic control, and remote transmission are promoting the development of a wide range of practical applications. In aerial video surveillance, the monitoring of broad areas still has many challenges due to the achievement of different tasks in real-time, including mosaicking, change detection, and object detection. In this thesis work, a small-scale UAV based vision system to maintain regular surveillance over target areas is proposed. The system works in two modes. The first mode allows to monitor an area of interest by performing several flights. During the first flight, it creates an incremental geo-referenced mosaic of an area of interest and classifies all the known elements (e.g., persons) found on the ground by an improved Faster R-CNN architecture previously trained. In subsequent reconnaissance flights, the system searches for any changes (e.g., disappearance of persons) that may occur in the mosaic by a histogram equalization and RGB-Local Binary Pattern (RGB-LBP) based algorithm. If present, the mosaic is updated. The second mode, allows to perform a real-time classification by using, again, our improved Faster R-CNN model, useful for time-critical operations. Thanks to different design features, the system works in real-time and performs mosaicking and change detection tasks at low-altitude, thus allowing the classification even of small objects. The proposed system was tested by using the whole set of challenging video sequences contained in the UAV Mosaicking and Change Detection (UMCD) dataset and other public datasets. The evaluation of the system by well-known performance metrics has shown remarkable results in terms of mosaic creation and updating, as well as in terms of change detection and object detection

    Multiperspective mosaics and layered representation for scene visualization

    Get PDF
    This thesis documents the efforts made to implement multiperspective mosaicking for the purpose of mosaicking undervehicle and roadside sequences. For the undervehicle sequences, it is desired to create a large, high-resolution mosaic that may used to quickly inspect the entire scene shot by a camera making a single pass underneath the vehicle. Several constraints are placed on the video data, in order to facilitate the assumption that the entire scene in the sequence exists on a single plane. Therefore, a single mosaic is used to represent a single video sequence. Phase correlation is used to perform motion analysis in this case. For roadside video sequences, it is assumed that the scene is composed of several planar layers, as opposed to a single plane. Layer extraction techniques are implemented in order to perform this decomposition. Instead of using phase correlation to perform motion analysis, the Lucas-Kanade motion tracking algorithm is used in order to create dense motion maps. Using these motion maps, spatial support for each layer is determined based on a pre-initialized layer model. By separating the pixels in the scene into motion-specific layers, it is possible to sample each element in the scene correctly while performing multiperspective mosaicking. It is also possible to fill in many gaps in the mosaics caused by occlusions, hence creating more complete representations of the objects of interest. The results are several mosaics with each mosaic representing a single planar layer of the scene

    Weighted and filtered mutual information: A Metric for the automated creation of panoramas from views of complex scenes

    Get PDF
    To contribute a novel approach in the field of image registration and panorama creation, this algorithm foregoes any scene knowledge, requiring only modest scene overlap and an acceptable amount of entropy within each overlapping view. The weighted and filtered mutual information (WFMI) algorithm has been developed for multiple stationary, color, surveillance video camera views and relies on color gradients for feature correspondence. This is a novel extension of well-established maximization of mutual information (MMI) algorithms. Where MMI algorithms are typically applied to high altitude photography and medical imaging (scenes with relatively simple shapes and affine relationships between views), the WFMI algorithm has been designed for scenes with occluded objects and significant parallax variation between non-affine related views. Despite these typically non-affine surveillance scenarios, searching in the affine space for a homography is a practical assumption that provides computational efficiency and accurate results, even with complex scene views. The WFMI algorithm can perfectly register affine views, performs exceptionally well with near-affine related views, and in complex scene views (well beyond affine constraints) the WFMI algorithm provides an accurate estimate of the overlap regions between the views. The WFMI algorithm uses simple calculations (vector field color gradient, Laplacian filtering, and feature histograms) to generate the WFMI metric and provide the optimal affine relationship. This algorithm is unique when compared to typical MMI algorithms and modern registration algorithms because it avoids almost all a priori knowledge and calculations, while still providing an accurate or useful estimate for realistic scenes. With mutual information weighting and the Laplacian filtering operation, the WFMI algorithm overcomes the failures of typical MMI algorithms in scenes where complex or occluded shapes do not provide sufficiently large peaks in the mutual information maps to determine the overlap region. This work has currently been applied to individual video frames and it will be shown that future work could easily extend the algorithm into utilizing motion information or temporal frame registrations to enhance scenes with smaller overlap regions, lower entropy, or even more significant parallax and occlusion variations between views

    Design, Implementation and Evaluation of Hardware Vision Systems Dedicated to Real-Time Face Recognition

    Get PDF
    Human face recognition is an active area of research spanning several disciplines such as image processing, pattern recognition, and computer vision. Most researches have concentrated on the algorithms of segmentation, feature extraction, and recognition of human faces, which are generally realized by software implementation on standard computers. However, many applications of human face recognition such as human-computer interfaces, model-based video coding, and security control (Kobayashi, 2001, Yeh & Lee, 1999) need to be high-speed and real-time, for example, passing through customs quickly while ensuring security. For the last years, our laboratory has focused on face processing and obtained interesting results concerning face tracking and recognition by implementing original dedicated hardware systems. Our aim is to implement on embedded systems efficient models of unconstrained face tracking and identity verification in arbitrary scenes. The main goal of these various systems is to provide efficient robustness algorithms that only require moderated computation in order 1) to obtain high success rates of face tracking and identity verification and 2) to cope with the drastic real-time constraints. The goal of this chapter is to describe three different hardware platforms dedicated to face recognition. Each of them has been designed, implemented and evaluated in our laboratory

    Video Processing with Additional Information

    Get PDF
    Cameras are frequently deployed along with many additional sensors in aerial and ground-based platforms. Many video datasets have metadata containing measurements from inertial sensors, GPS units, etc. Hence the development of better video processing algorithms using additional information attains special significance. We first describe an intensity-based algorithm for stabilizing low resolution and low quality aerial videos. The primary contribution is the idea of minimizing the discrepancy in the intensity of selected pixels between two images. This is an application of inverse compositional alignment for registering images of low resolution and low quality, for which minimizing the intensity difference over salient pixels with high gradients results in faster and better convergence than when using all the pixels. Secondly, we describe a feature-based method for stabilization of aerial videos and segmentation of small moving objects. We use the coherency of background motion to jointly track features through the sequence. This enables accurate tracking of large numbers of features in the presence of repetitive texture, lack of well conditioned feature windows etc. We incorporate the segmentation problem within the joint feature tracking framework and propose the first combined joint-tracking and segmentation algorithm. The proposed approach enables highly accurate tracking, and segmentation of feature tracks that is used in a MAP-MRF framework for obtaining dense pixelwise labeling of the scene. We demonstrate competitive moving object detection in challenging video sequences of the VIVID dataset containing moving vehicles and humans that are small enough to cause background subtraction approaches to fail. Structure from Motion (SfM) has matured to a stage, where the emphasis is on developing fast, scalable and robust algorithms for large reconstruction problems. The availability of additional sensors such as inertial units and GPS along with video cameras motivate the development of SfM algorithms that leverage these additional measurements. In the third part, we study the benefits of the availability of a specific form of additional information - the vertical direction (gravity) and the height of the camera both of which can be conveniently measured using inertial sensors, and a monocular video sequence for 3D urban modeling. We show that in the presence of this information, the SfM equations can be rewritten in a bilinear form. This allows us to derive a fast, robust, and scalable SfM algorithm for large scale applications. The proposed SfM algorithm is experimentally demonstrated to have favorable properties compared to the sparse bundle adjustment algorithm. We provide experimental evidence indicating that the proposed algorithm converges in many cases to solutions with lower error than state-of-art implementations of bundle adjustment. We also demonstrate that for the case of large reconstruction problems, the proposed algorithm takes lesser time to reach its solution compared to bundle adjustment. We also present SfM results using our algorithm on the Google StreetView research dataset, and several other datasets

    Large-area visually augmented navigation for autonomous underwater vehicles

    Get PDF
    Submitted to the Joint Program in Applied Ocean Science & Engineering in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the Massachusetts Institute of Technology and the Woods Hole Oceanographic Institution June 2005This thesis describes a vision-based, large-area, simultaneous localization and mapping (SLAM) algorithm that respects the low-overlap imagery constraints typical of autonomous underwater vehicles (AUVs) while exploiting the inertial sensor information that is routinely available on such platforms. We adopt a systems-level approach exploiting the complementary aspects of inertial sensing and visual perception from a calibrated pose-instrumented platform. This systems-level strategy yields a robust solution to underwater imaging that overcomes many of the unique challenges of a marine environment (e.g., unstructured terrain, low-overlap imagery, moving light source). Our large-area SLAM algorithm recursively incorporates relative-pose constraints using a view-based representation that exploits exact sparsity in the Gaussian canonical form. This sparsity allows for efficient O(n) update complexity in the number of images composing the view-based map by utilizing recent multilevel relaxation techniques. We show that our algorithmic formulation is inherently sparse unlike other feature-based canonical SLAM algorithms, which impose sparseness via pruning approximations. In particular, we investigate the sparsification methodology employed by sparse extended information filters (SEIFs) and offer new insight as to why, and how, its approximation can lead to inconsistencies in the estimated state errors. Lastly, we present a novel algorithm for efficiently extracting consistent marginal covariances useful for data association from the information matrix. In summary, this thesis advances the current state-of-the-art in underwater visual navigation by demonstrating end-to-end automatic processing of the largest visually navigated dataset to date using data collected from a survey of the RMS Titanic (path length over 3 km and 3100 m2 of mapped area). This accomplishment embodies the summed contributions of this thesis to several current SLAM research issues including scalability, 6 degree of freedom motion, unstructured environments, and visual perception.This work was funded in part by the CenSSIS ERC of the National Science Foundation under grant EEC-9986821, in part by the Woods Hole Oceanographic Institution through a grant from the Penzance Foundation, and in part by a NDSEG Fellowship awarded through the Department of Defense

    Analysis of affine motion-compensated prediction and its application in aerial video coding

    Get PDF
    Motion-compensated prediction is used in video coding standards like High Efficiency Video Coding (HEVC) as one key element of data compression. Commonly, a purely translational motion model is employed. In order to also cover non-translational motion types like rotation or scaling (zoom) contained in aerial video sequences such as captured from unmanned aerial vehicles, an affine motion model can be applied. In this work, a model for affine motion-compensated prediction in video coding is derived by extending a model of purely translational motion-compensated prediction. Using the rate-distortion theory and the displacement estimation error caused by inaccurate affine motion parameter estimation, the minimum required bit rate for encoding the prediction error is determined. In this model, the affine transformation parameters are assumed to be affected by statistically independent estimation errors, which all follow a zero-mean Gaussian distributed probability density function (pdf). The joint pdf of the estimation errors is derived and transformed into the pdf of the location-dependent displacement estimation error in the image. The latter is related to the minimum required bit rate for encoding the prediction error. Similar to the derivations of the fully affine motion model, a four-parameter simplified affine model is investigated. It is of particular interest since such a model is considered for the upcoming video coding standard Versatile Video Coding (VVC) succeeding HEVC. As the simplified affine motion model is able to describe most motions contained in aerial surveillance videos, its application in video coding is justified. Both models provide valuable information about the minimum bit rate for encoding the prediction error as a function of affine estimation accuracies. Although the bit rate in motion-compensated prediction can be considerably reduced by using a motion model which is able to describe motion types occurring in the scene, the total video bit rate may remain quite high, depending on the motion estimation accuracy. Thus, at the example of aerial surveillance sequences, a codec independent region of interest- ( ROI -) based aerial video coding system is proposed that exploits the characteristic of such sequences. Assuming the captured scene to be planar, one frame can be projected into another using global motion compensation. Consequently, only new emerging areas have to be encoded. At the decoder, all new areas are registered into a so-called mosaic. From this, reconstructed frames are extracted and concatenated as a video sequence. To also preserve moving objects in the reconstructed video, local motion is detected and encoded in addition to the new areas. The proposed general ROI coding system was evaluated for very low and low bit rates between 100 and 5000 kbit/s for aerial sequences of HD resolution. It is able to reduce the bit rate by 90% compared to common HEVC coding of similar quality. Subjective tests confirm that the overall image quality of the ROI coding system exceeds that of a common HEVC encoder especially at very low bit rates below 1 Mbit/s. To prevent discontinuities introduced by inaccurate global motion estimation, as may be caused by radial lens distortion, a fully automatic in-loop radial distortion compensation is proposed. For this purpose, an unknown radial distortion compensation parameter that is constant for a group of frames is jointly estimated with the global motion. This parameter is optimized to minimize the distortions of the projections of frames in the mosaic. By this approach, the global motion compensation was improved by 0.27dB and discontinuities in the frames extracted from the mosaic are diminished. As an additional benefit, the generation of long-term mosaics becomes possible, constructed by more than 1500 aerial frames with unknown radial lens distortion and without any calibration or manual lens distortion compensation.Bewegungskompensierte Prädiktion wird in Videocodierstandards wie High Efficiency Video Coding (HEVC) als ein Schlüsselelement zur Datenkompression verwendet. Typischerweise kommt dabei ein rein translatorisches Bewegungsmodell zum Einsatz. Um auch nicht-translatorische Bewegungen wie Rotation oder Skalierung (Zoom) beschreiben zu können, welche beispielsweise in von unbemannten Luftfahrzeugen aufgezeichneten Luftbildvideosequenzen enthalten sind, kann ein affines Bewegungsmodell verwendet werden. In dieser Arbeit wird aufbauend auf einem rein translatorischen Bewegungsmodell ein Modell für affine bewegungskompensierte Prädiktion hergeleitet. Unter Verwendung der Raten-Verzerrungs-Theorie und des Verschiebungsschätzfehlers, welcher aus einer inexakten affinen Bewegungsschätzung resultiert, wird die minimal erforderliche Bitrate zur Codierung des Prädiktionsfehlers hergeleitet. Für die Modellierung wird angenommen, dass die sechs Parameter einer affinen Transformation durch statistisch unabhängige Schätzfehler gestört sind. Für jeden dieser Schätzfehler wird angenommen, dass die Wahrscheinlichkeitsdichteverteilung einer mittelwertfreien Gaußverteilung entspricht. Aus der Verbundwahrscheinlichkeitsdichte der Schätzfehler wird die Wahrscheinlichkeitsdichte des ortsabhängigen Verschiebungsschätzfehlers im Bild berechnet. Letztere wird schließlich zu der minimalen Bitrate in Beziehung gesetzt, welche für die Codierung des Prädiktionsfehlers benötigt wird. Analog zur obigen Ableitung des Modells für das voll-affine Bewegungsmodell wird ein vereinfachtes affines Bewegungsmodell mit vier Freiheitsgraden untersucht. Ein solches Modell wird derzeit auch im Rahmen der Standardisierung des HEVC-Nachfolgestandards Versatile Video Coding (VVC) evaluiert. Da das vereinfachte Modell bereits die meisten in Luftbildvideosequenzen vorkommenden Bewegungen abbilden kann, ist der Einsatz des vereinfachten affinen Modells in der Videocodierung gerechtfertigt. Beide Modelle liefern wertvolle Informationen über die minimal benötigte Bitrate zur Codierung des Prädiktionsfehlers in Abhängigkeit von der affinen Schätzgenauigkeit. Zwar kann die Bitrate mittels bewegungskompensierter Prädiktion durch Wahl eines geeigneten Bewegungsmodells und akkurater affiner Bewegungsschätzung stark reduziert werden, die verbleibende Gesamtbitrate kann allerdings dennoch relativ hoch sein. Deshalb wird am Beispiel von Luftbildvideosequenzen ein Regionen-von-Interesse- (ROI-) basiertes Codiersystem vorgeschlagen, welches spezielle Eigenschaften solcher Sequenzen ausnutzt. Unter der Annahme, dass eine aufgenommene Szene planar ist, kann ein Bild durch globale Bewegungskompensation in ein anderes projiziert werden. Deshalb müssen vom aktuellen Bild prinzipiell nur noch neu im Bild erscheinende Bereiche codiert werden. Am Decoder werden alle neuen Bildbereiche in einem gemeinsamen Mosaikbild registriert, aus dem schließlich die Einzelbilder der Videosequenz rekonstruiert werden können. Um auch lokale Bewegungen abzubilden, werden bewegte Objekte detektiert und zusätzlich zu neuen Bildbereichen als ROI codiert. Die Leistungsfähigkeit des ROI-Codiersystems wurde insbesondere für sehr niedrige und niedrige Bitraten von 100 bis 5000 kbit/s für Bilder in HD-Auflösung evaluiert. Im Vergleich zu einer gewöhnlichen HEVC-Codierung kann die Bitrate um 90% reduziert werden. Durch subjektive Tests wurde bestätigt, dass das ROI-Codiersystem insbesondere für sehr niedrige Bitraten von unter 1 Mbit/s deutlich leistungsfähiger in Bezug auf Detailauflösung und Gesamteindruck ist als ein herkömmliches HEVC-Referenzsystem. Um Diskontinuitäten in den rekonstruierten Videobildern zu vermeiden, die durch eine durch Linsenverzeichnungen induzierte ungenaue globale Bewegungsschätzung entstehen können, wird eine automatische Radialverzeichnungskorrektur vorgeschlagen. Dabei wird ein unbekannter, jedoch über mehrere Bilder konstanter Korrekturparameter gemeinsam mit der globalen Bewegung geschätzt. Dieser Parameter wird derart optimiert, dass die Projektionen der Bilder in das Mosaik möglichst wenig verzerrt werden. Daraus resultiert eine um 0,27dB verbesserte globale Bewegungskompensation, wodurch weniger Diskontinuitäten in den aus dem Mosaik rekonstruierten Bildern entstehen. Dieses Verfahren ermöglicht zusätzlich die Erstellung von Langzeitmosaiken aus über 1500 Luftbildern mit unbekannter Radialverzeichnung und ohne manuelle Korrektur
    corecore