448 research outputs found
Object detection and tracking in video image
In recent days, capturing images with high quality and good size is so easy because of rapid improvement in quality of capturing device with less costly but superior technology. Videos are a collection of sequential images with a constant time interval. So video can provide more information about our object when scenarios are changing with respect to time. Therefore, manually handling videos are quite impossible. So we need an automated devise to process these videos. In this thesis one such attempt has been made to track objects in videos. Many algorithms and technology have been developed to automate monitoring the object in a video file. Object detection and tracking is a one of the challenging task in computer vision. Mainly there are three basic steps in video analysis: Detection of objects of interest from moving objects, Tracking of that interested objects in consecutive frames, and Analysis of object tracks to understand their behavior. Simple object detection compares a static background frame at the pixel level with the current frame of video. The existing method in this domain first tries to detect the interest object in video frames. One of the main difficulties in object tracking among many others is to choose suitable features and models for recognizing and tracking the interested object from a video. Some common choice to choose suitable feature to categories, visual objects are intensity, shape, color and feature points. In this thesis, we studied about mean shift tracking based on the color pdf, optical flow tracking based on the intensity and motion; SIFT tracking based on scale invariant local feature points. Preliminary results from experiments have shown that the adopted method is able to track targets with translation, rotation, partial occlusion and deformation
Visual system identification: learning physical parameters and latent spaces from pixels
In this thesis, we develop machine learning systems that are able to leverage the knowledge
of equations of motion (scene-specific or scene-agnostic) to perform object discovery,
physical parameter estimation, position and velocity estimation, camera pose
estimation, and learn structured latent spaces that satisfy physical dynamics rules.
These systems are unsupervised, learning from unlabelled videos, and use as inductive
biases the general equations of motion followed by objects of interest in the scene.
This is an important task as in many complex real world environments ground-truth
states are not available, although there is physical knowledge of the underlying system.
Our goals with this approach, i.e. integration of physics knowledge with unsupervised
learning models, are to improve vision-based prediction, enable new forms of control,
increase data-efficiency and provide model interpretability, all of which are key areas
of interest in machine learning. With the above goals in mind, we start by asking the
following question: given a scene in which the objects’ motions are known up to some
physical parameters (e.g. a ball bouncing off the floor with unknown restitution coefficient),
how do we build a model that uses such knowledge to discover the objects in the
scene and estimate these physical parameters?
Our first model, PAIG (Physics-as-Inverse-Graphics), approaches this problem from a
vision-as-inverse-graphics perspective, describing the visual scene as a composition of
objects defined by their location and appearance, which are rendered onto the frame in
a graphics manner. This is a known approach in the unsupervised learning literature,
where the fundamental problem then becomes that of derendering, that is, inferring and
discovering these locations and appearances for each object. In PAIG we introduce a
key rendering component, the Coordinate-Consistent Decoder, which enables the integration
of the known equations of motion with an inverse-graphics autoencoder architecture
(trainable end-to-end), to perform simultaneous object discovery and physical
parameter estimation. Although trained on simple simulated 2D scenes, we show that
knowledge of the physical equations of motion of the objects in the scene can be used
to greatly improve future prediction and provide physical scene interpretability.
Our second model, V-SysId, tackles the limitations shown by the PAIG architecture,
namely the training difficulty, the restriction to simulated 2D scenes, and the need for
noiseless scenes without distractors. Here, we approach the problem from rst principles
by asking the question: are neural networks a necessary component to solve this
problem? Can we use simpler ideas from classical computer vision instead? With V-
SysId, we approach the problem of object discovery and physical parameter estimation
from a keypoint extraction, tracking and selection perspective, composed of 3 separate
stages: proposal keypoint extraction and tracking, 3D equation tting and camera pose
estimation from 2D trajectories, and entropy-based trajectory selection. Since all the
stages use lightweight algorithms and optimisers, V-SysId is able to perform joint object
discovery, physical parameter and camera pose estimation from even a single video,
drastically improving data-efficiency. Additionally, due to the fact that it does not use a
rendering/derendering approach, it can be used in real 3D scenes with many distractor
objects. We show that this approach enables a number of interest applications, such as
vision-based robot end-effector localisation and remote breath rate measurement.
Finally, we move into the area of structured recurrent variational models from vision,
where we are motivated by the following observation: in existing models, applying a
force in the direction from a start point and an end point (in latent space), does not
result in a movement from the start point towards the end point, even on the simplest
unconstrained environments. This means that the latent space learned by these models
does not follow Newton’s law, where the acceleration vector has the same direction
as the force vector (in point-mass systems), and prevents the use of PID controllers,
which are the simplest and most well understood type of controller. We solve this problem
by building inductive biases from Newtonian physics into the latent variable model,
which we call NewtonianVAE. Crucially, Newtonian correctness in the latent space brings
about the ability to perform proportional (or PID) control, as opposed to the more computationally
expensive model predictive control (MPC). PID controllers are ubiquitous
in industrial applications, but had thus far lacked integration with unsupervised vision
models. We show that the NewtonianVAE learns physically correct latent spaces in simulated
2D and 3D control systems, which can be used to perform goal-based discovery
and control in imitation learning, and path following via Dynamic Motion Primitives
Feature extraction using MPEG-CDVS and Deep Learning with application to robotic navigation and image classification
The main contributions of this thesis are the evaluation of MPEG Compact Descriptor for Visual Search in the context of indoor robotic navigation and the introduction of a new method for training Convolutional Neural Networks with applications to
object classification.
The choice for image descriptor in a visual navigation system is not straightforward. Visual descriptors must be distinctive enough to allow for correct localisation while still offering low matching complexity and short descriptor size for real-time applications. MPEG Compact Descriptor for Visual Search is a low complexity image descriptor that offers several levels of compromises between descriptor distinctiveness and size. In this work, we describe how these trade-offs can be used for efficient loop-detection in a typical indoor environment. We first describe a probabilistic approach to loop detection based on the standard’s suggested similarity metric. We then evaluate the performance of CDVS compression modes in terms
of matching speed, feature extraction, and storage requirements and compare them with the state of the art SIFT descriptor for five different types of indoor floors.
During the second part of this thesis we focus on the new paradigm to machine learning and computer vision called Deep Learning. Under this paradigm visual features are no longer extracted using fine-grained, highly engineered feature extractor, but rather using a Convolutional Neural Networks (CNN) that extracts hierarchical features learned directly from data at the cost of long training periods.
In this context, we propose a method for speeding up the training of Convolutional Neural Networks (CNN) by exploiting the spatial scaling property of convolutions. This is done by first training a pre-train CNN of smaller kernel resolutions for a few epochs, followed by properly rescaling its kernels to the target’s original dimensions and continuing training at full resolution. We show that the overall training time of a target CNN architecture can be reduced by exploiting the spatial scaling property of convolutions during early stages of learning. Moreover, by rescaling the kernels at different epochs, we identify a trade-off between total training time and maximum obtainable accuracy. Finally, we propose a method for choosing when to rescale kernels and evaluate our approach on recent architectures showing savings in training times of nearly 20% while test set accuracy is preserved
Future Urban Scenes Generation Through Vehicles Synthesis
In this work we propose a deep learning pipeline to predict the visual future appearance of an urban scene. Despite recent advances, generating the entire scene in an end-to-end fashion is still far from being achieved. Instead, here we follow a two stages approach, where interpretable information is included in the loop and each actor is modelled independently. We leverage a per-object novel view synthesis paradigm; i.e. generating a synthetic representation of an object undergoing a geometrical roto-translation in the 3D space. Our model can be easily conditioned with constraints (e.g. input trajectories) provided by state-of-the-art tracking methods or by the user itself. This allows us to generate a set of diverse realistic futures starting from the same input in a multi-modal fashion. We visually and quantitatively show the superiority of this approach over traditional end-to-end scene-generation methods on CityFlow, a challenging real world dataset
The e-Bike Motor Assembly: Towards Advanced Robotic Manipulation for Flexible Manufacturing
Robotic manipulation is currently undergoing a profound paradigm shift due to
the increasing needs for flexible manufacturing systems, and at the same time,
because of the advances in enabling technologies such as sensing, learning,
optimization, and hardware. This demands for robots that can observe and reason
about their workspace, and that are skillfull enough to complete various
assembly processes in weakly-structured settings. Moreover, it remains a great
challenge to enable operators for teaching robots on-site, while managing the
inherent complexity of perception, control, motion planning and reaction to
unexpected situations. Motivated by real-world industrial applications, this
paper demonstrates the potential of such a paradigm shift in robotics on the
industrial case of an e-Bike motor assembly. The paper presents a concept for
teaching and programming adaptive robots on-site and demonstrates their
potential for the named applications. The framework includes: (i) a method to
teach perception systems onsite in a self-supervised manner, (ii) a general
representation of object-centric motion skills and force-sensitive assembly
skills, both learned from demonstration, (iii) a sequencing approach that
exploits a human-designed plan to perform complex tasks, and (iv) a system
solution for adapting and optimizing skills online. The aforementioned
components are interfaced through a four-layer software architecture that makes
our framework a tangible industrial technology. To demonstrate the generality
of the proposed framework, we provide, in addition to the motivating e-Bike
motor assembly, a further case study on dense box packing for logistics
automation
Knowledge Transfer for Human Trajectory Prediction
Human trajectory prediction task aims to analyze human future movements given their past status, which is an important topic in several application domains such as socially-aware robots, intelligent tracking systems and self-driving cars.
The goal of this work is to transfer knowledge from already seen scenes from the training dataset to the test set.
In order to do that we create a module that, for each human in the scene, extracts local features from a patch around the agent’s current position.
To test its ability the proposed model was placed next to both the SAR and the Goal SAR architectures.
To this end, the resulting feature descriptors coming from our approach are concatenated to the lightweight attention-based recurrent backbone that acts solely on past observed positions for both the aforementioned architectures.
We conducted extensive experiments on training the model on the SDD dataset and tested it in the ETH dataset to show our approach performances compared to the baseline.Human trajectory prediction task aims to analyze human future movements given their past status, which is an important topic in several application domains such as socially-aware robots, intelligent tracking systems and self-driving cars.
The goal of this work is to transfer knowledge from already seen scenes from the training dataset to the test set.
In order to do that we create a module that, for each human in the scene, extracts local features from a patch around the agent’s current position.
To test its ability the proposed model was placed next to both the SAR and the Goal SAR architectures.
To this end, the resulting feature descriptors coming from our approach are concatenated to the lightweight attention-based recurrent backbone that acts solely on past observed positions for both the aforementioned architectures.
We conducted extensive experiments on training the model on the SDD dataset and tested it in the ETH dataset to show our approach performances compared to the baseline
- …