594 research outputs found

    Object detection and tracking in video image

    Get PDF
    In recent days, capturing images with high quality and good size is so easy because of rapid improvement in quality of capturing device with less costly but superior technology. Videos are a collection of sequential images with a constant time interval. So video can provide more information about our object when scenarios are changing with respect to time. Therefore, manually handling videos are quite impossible. So we need an automated devise to process these videos. In this thesis one such attempt has been made to track objects in videos. Many algorithms and technology have been developed to automate monitoring the object in a video file. Object detection and tracking is a one of the challenging task in computer vision. Mainly there are three basic steps in video analysis: Detection of objects of interest from moving objects, Tracking of that interested objects in consecutive frames, and Analysis of object tracks to understand their behavior. Simple object detection compares a static background frame at the pixel level with the current frame of video. The existing method in this domain first tries to detect the interest object in video frames. One of the main difficulties in object tracking among many others is to choose suitable features and models for recognizing and tracking the interested object from a video. Some common choice to choose suitable feature to categories, visual objects are intensity, shape, color and feature points. In this thesis, we studied about mean shift tracking based on the color pdf, optical flow tracking based on the intensity and motion; SIFT tracking based on scale invariant local feature points. Preliminary results from experiments have shown that the adopted method is able to track targets with translation, rotation, partial occlusion and deformation

    Target detection, tracking, and localization using multi-spectral image fusion and RF Doppler differentials

    Get PDF
    It is critical for defense and security applications to have a high probability of detection and low false alarm rate while operating over a wide variety of conditions. Sensor fusion, which is the the process of combining data from two or more sensors, has been utilized to improve the performance of a system by exploiting the strengths of each sensor. This dissertation presents algorithms to fuse multi-sensor data that improves system performance by increasing detection rates, lowering false alarms, and improving track performance. Furthermore, this dissertation presents a framework for comparing algorithm error for image registration which is a critical pre-processing step for multi-spectral image fusion. First, I present an algorithm to improve detection and tracking performance for moving targets in a cluttered urban environment by fusing foreground maps from multi-spectral imagery. Most research in image fusion consider visible and long-wave infrared bands; I examine these bands along with near infrared and mid-wave infrared. To localize and track a particular target of interest, I present an algorithm to fuse output from the multi-spectral image tracker with a constellation of RF sensors measuring a specific cellular emanation. The fusion algorithm matches the Doppler differential from the RF sensors with the theoretical Doppler Differential of the video tracker output by selecting the sensor pair that minimizes the absolute difference or root-mean-square difference. Finally, a framework to quantify shift-estimation error for both area- and feature-based algorithms is presented. By exploiting synthetically generated visible and long-wave infrared imagery, error metrics are computed and compared for a number of area- and feature-based shift estimation algorithms. A number of key results are presented in this dissertation. The multi-spectral image tracker improves the location accuracy of the algorithm while improving the detection rate and lowering false alarms for most spectral bands. All 12 moving targets were tracked through the video sequence with only one lost track that was later recovered. Targets from the multi-spectral tracking algorithm were correctly associated with their corresponding cellular emanation for all targets at lower measurement uncertainty using the root-mean-square difference while also having a high confidence ratio for selecting the true target from background targets. For the area-based algorithms and the synthetic air-field image pair, the DFT and ECC algorithms produces sub-pixel shift-estimation error in regions such as shadows and high contrast painted line regions. The edge orientation feature descriptors increase the number of sub-field estimates while improving the shift-estimation error compared to the Lowe descriptor

    Video foreground extraction for mobile camera platforms

    Get PDF
    Foreground object detection is a fundamental task in computer vision with many applications in areas such as object tracking, event identification, and behavior analysis. Most conventional foreground object detection methods work only in a stable illumination environments using fixed cameras. In real-world applications, however, it is often the case that the algorithm needs to operate under the following challenging conditions: drastic lighting changes, object shape complexity, moving cameras, low frame capture rates, and low resolution images. This thesis presents four novel approaches for foreground object detection on real-world datasets using cameras deployed on moving vehicles.The first problem addresses passenger detection and tracking tasks for public transport buses investigating the problem of changing illumination conditions and low frame capture rates. Our approach integrates a stable SIFT (Scale Invariant Feature Transform) background seat modelling method with a human shape model into a weighted Bayesian framework to detect passengers. To deal with the problem of tracking multiple targets, we employ the Reversible Jump Monte Carlo Markov Chain tracking algorithm. Using the SVM classifier, the appearance transformation models capture changes in the appearance of the foreground objects across two consecutives frames under low frame rate conditions. In the second problem, we present a system for pedestrian detection involving scenes captured by a mobile bus surveillance system. It integrates scene localization, foreground-background separation, and pedestrian detection modules into a unified detection framework. The scene localization module performs a two stage clustering of the video data.In the first stage, SIFT Homography is applied to cluster frames in terms of their structural similarity, and the second stage further clusters these aligned frames according to consistency in illumination. This produces clusters of images that are differential in viewpoint and lighting. A kernel density estimation (KDE) technique for colour and gradient is then used to construct background models for each image cluster, which is further used to detect candidate foreground pixels. Finally, using a hierarchical template matching approach, pedestrians can be detected.In addition to the second problem, we present three direct pedestrian detection methods that extend the HOG (Histogram of Oriented Gradient) techniques (Dalal and Triggs, 2005) and provide a comparative evaluation of these approaches. The three approaches include: a) a new histogram feature, that is formed by the weighted sum of both the gradient magnitude and the filter responses from a set of elongated Gaussian filters (Leung and Malik, 2001) corresponding to the quantised orientation, which we refer to as the Histogram of Oriented Gradient Banks (HOGB) approach; b) the codebook based HOG feature with branch-and-bound (efficient subwindow search) algorithm (Lampert et al., 2008) and; c) the codebook based HOGB approach.In the third problem, a unified framework that combines 3D and 2D background modelling is proposed to detect scene changes using a camera mounted on a moving vehicle. The 3D scene is first reconstructed from a set of videos taken at different times. The 3D background modelling identifies inconsistent scene structures as foreground objects. For the 2D approach, foreground objects are detected using the spatio-temporal MRF algorithm. Finally, the 3D and 2D results are combined using morphological operations.The significance of these research is that it provides basic frameworks for automatic large-scale mobile surveillance applications and facilitates many higher-level applications such as object tracking and behaviour analysis

    AnchorNet: A Weakly Supervised Network to Learn Geometry-sensitive Features For Semantic Matching

    Full text link
    Despite significant progress of deep learning in recent years, state-of-the-art semantic matching methods still rely on legacy features such as SIFT or HoG. We argue that the strong invariance properties that are key to the success of recent deep architectures on the classification task make them unfit for dense correspondence tasks, unless a large amount of supervision is used. In this work, we propose a deep network, termed AnchorNet, that produces image representations that are well-suited for semantic matching. It relies on a set of filters whose response is geometrically consistent across different object instances, even in the presence of strong intra-class, scale, or viewpoint variations. Trained only with weak image-level labels, the final representation successfully captures information about the object structure and improves results of state-of-the-art semantic matching methods such as the deformable spatial pyramid or the proposal flow methods. We show positive results on the cross-instance matching task where different instances of the same object category are matched as well as on a new cross-category semantic matching task aligning pairs of instances each from a different object class.Comment: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 201

    Feature extraction using MPEG-CDVS and Deep Learning with application to robotic navigation and image classification

    Get PDF
    The main contributions of this thesis are the evaluation of MPEG Compact Descriptor for Visual Search in the context of indoor robotic navigation and the introduction of a new method for training Convolutional Neural Networks with applications to object classification. The choice for image descriptor in a visual navigation system is not straightforward. Visual descriptors must be distinctive enough to allow for correct localisation while still offering low matching complexity and short descriptor size for real-time applications. MPEG Compact Descriptor for Visual Search is a low complexity image descriptor that offers several levels of compromises between descriptor distinctiveness and size. In this work, we describe how these trade-offs can be used for efficient loop-detection in a typical indoor environment. We first describe a probabilistic approach to loop detection based on the standard’s suggested similarity metric. We then evaluate the performance of CDVS compression modes in terms of matching speed, feature extraction, and storage requirements and compare them with the state of the art SIFT descriptor for five different types of indoor floors. During the second part of this thesis we focus on the new paradigm to machine learning and computer vision called Deep Learning. Under this paradigm visual features are no longer extracted using fine-grained, highly engineered feature extractor, but rather using a Convolutional Neural Networks (CNN) that extracts hierarchical features learned directly from data at the cost of long training periods. In this context, we propose a method for speeding up the training of Convolutional Neural Networks (CNN) by exploiting the spatial scaling property of convolutions. This is done by first training a pre-train CNN of smaller kernel resolutions for a few epochs, followed by properly rescaling its kernels to the target’s original dimensions and continuing training at full resolution. We show that the overall training time of a target CNN architecture can be reduced by exploiting the spatial scaling property of convolutions during early stages of learning. Moreover, by rescaling the kernels at different epochs, we identify a trade-off between total training time and maximum obtainable accuracy. Finally, we propose a method for choosing when to rescale kernels and evaluate our approach on recent architectures showing savings in training times of nearly 20% while test set accuracy is preserved

    Video Shot Boundary Detection using the Scale Invariant Feature Transform and RGB Color Channels

    Get PDF
    Segmentation of the video sequence by detecting shot changes is essential for video analysis, indexing and retrieval. In this context, a shot boundary detection algorithm is proposed in this paper based on the scale invariant feature transform (SIFT). The first step of our method consists on a top down search scheme to detect the locations of transitions by comparing the ratio of matched features extracted via SIFT for every RGB channel of video frames. The overview step provides the locations of boundaries. Secondly, a moving average calculation is performed to determine the type of transition. The proposed method can be used for detecting gradual transitions and abrupt changes without requiring any training of the video content in advance. Experiments have been conducted on a multi type video database and show that this algorithm achieves well performances

    Real Time Stereo Cameras System Calibration Tool and Attitude and Pose Computation with Low Cost Cameras

    Get PDF
    The Engineering in autonomous systems has many strands. The area in which this work falls, the artificial vision, has become one of great interest in multiple contexts and focuses on robotics. This work seeks to address and overcome some real difficulties encountered when developing technologies with artificial vision systems which are, the calibration process and pose computation of robots in real-time. Initially, it aims to perform real-time camera intrinsic (3.2.1) and extrinsic (3.3) stereo camera systems calibration needed to the main goal of this work, the real-time pose (position and orientation) computation of an active coloured target with stereo vision systems. Designed to be intuitive, easy-to-use and able to run under real-time applications, this work was developed for use either with low-cost and easy-to-acquire or more complex and high resolution stereo vision systems in order to compute all the parameters inherent to this same system such as the intrinsic values of each one of the cameras and the extrinsic matrices computation between both cameras. More oriented towards the underwater environments, which are very dynamic and computationally more complex due to its particularities such as light reflections. The available calibration information, whether generated by this tool or loaded configurations from other tools allows, in a simplistic way, to proceed to the calibration of an environment colorspace and the detection parameters of a specific target with active visual markers (4.1.1), useful within unstructured environments. With a calibrated system and environment, it is possible to detect and compute, in real time, the pose of a target of interest. The combination of position and orientation or attitude is referred as the pose of an object. For performance analysis and quality of the information obtained, this tools are compared with others already existent.A engenharia de sistemas autónomos actua em diversas vertentes. Uma delas, a visão artificial, em que este trabalho assenta, tornou-se uma das de maior interesse em múltiplos contextos e focos na robótica. Assim, este trabalho procura abordar e superar algumas dificuldades encontradas aquando do desenvolvimento de tecnologias baseadas na visão artificial. Inicialmente, propõe-se a fornecer ferramentas para realizar as calibrações necessárias de intrínsecos (3.2.1) e extrínsecos (3.3) de sistemas de visão stereo em tempo real para atingir o objectivo principal, uma ferramenta de cálculo da posição e orientação de um alvo activo e colorido através de sistemas de visão stereo. Desenhadas para serem intuitivas, fáceis de utilizar e capazes de operar em tempo real, estas ferramentas foram desenvolvidas tendo em vista a sua integração quer com camaras de baixo custo e aquisição fácil como com camaras mais complexas e de maior resolução. Propõem-se a realizar a calibração dos parâmetros inerentes ao sistema de visão stereo como os intrínsecos de cada uma das camaras e as matrizes de extrínsecos que relacionam ambas as camaras. Este trabalho foi orientado para utilização em meio subaquático onde se presenciam ambientes com elevada dinâmica visual e maior complexidade computacional devido `a suas particularidades como reflexões de luz e má visibilidade. Com a informação de calibração disponível, quer gerada pelas ferramentas fornecidas, quer obtida a partir de outras, pode ser carregada para proceder a uma calibração simplista do espaço de cor e dos parâmetros de deteção de um alvo específico com marcadores ativos coloridos (4.1.1). Estes marcadores são ´uteis em ambientes não estruturados. Para análise da performance e qualidade da informação obtida, as ferramentas de calibração e cálculo de pose (posição e orientação), serão comparadas com outras já existentes

    Human activity detection and action recognition in videos using convolutional neural networks

    Get PDF
    Human activity recognition from video scenes has become a significant area of research in the field of computer vision applications. Action recognition is one of the most challenging problems in the area of video analysis and it finds applications in human-computer interaction, anomalous activity detection, crowd monitoring and patient monitoring. Several approaches have been presented for human activity recognition using machine learning techniques. The main aim of this work is to detect and track human activity, and classify actions for two publicly available video databases. In this work, a novel approach of feature extraction from video sequence by combining Scale Invariant Feature Transform and optical flow computation are used where shape, gradient and orientation features are also incorporated for robust feature formulation. Tracking of human activity in the video is implemented using the Gaussian Mixture Model. Convolutional Neural Network based classification approach is used for database training and testing purposes. The activity recognition performance is evaluated for two public datasets namely Weizmann dataset and Kungliga Tekniska Hogskolan dataset with action recognition accuracy of 98.43% and 94.96%, respectively. Experimental and comparative studies have shown that the proposed approach outperformed state-of the art techniques

    Object recognition using multi-view imaging

    No full text
    Single view imaging data has been used in most previous research in computer vision and image understanding and lots of techniques have been developed. Recently with the fast development and dropping cost of multiple cameras, it has become possible to have many more views to achieve image processing tasks. This thesis will consider how to use the obtained multiple images in the application of target object recognition. In this context, we present two algorithms for object recognition based on scale- invariant feature points. The first is single view object recognition method (SOR), which operates on single images and uses a chirality constraint to reduce the recognition errors that arise when only a small number of feature points are matched. The procedure is extended in the second multi-view object recognition algorithm (MOR) which operates on a multi-view image sequence and, by tracking feature points using a dynamic programming method in the plenoptic domain subject to the epipolar constraint, is able to fuse feature point matches from all the available images, resulting in more robust recognition. We evaluated these algorithms using a number of data sets of real images capturing both indoor and outdoor scenes. We demonstrate that MOR is better than SOR particularly for noisy and low resolution images, and it is also able to recognize objects that are partially occluded by combining it with some segmentation techniques
    corecore