48 research outputs found
Robust density modelling using the student's t-distribution for human action recognition
The extraction of human features from videos is often inaccurate and prone to outliers. Such outliers can severely affect density modelling when the Gaussian distribution is used as the model since it is highly sensitive to outliers. The Gaussian distribution is also often used as base component of graphical models for recognising human actions in the videos (hidden Markov model and others) and the presence of outliers can significantly affect the recognition accuracy. In contrast, the Student's t-distribution is more robust to outliers and can be exploited to improve the recognition rate in the presence of abnormal data. In this paper, we present an HMM which uses mixtures of t-distributions as observation probabilities and show how experiments over two well-known datasets (Weizmann, MuHAVi) reported a remarkable improvement in classification accuracy. © 2011 IEEE
Artificial Intelligence for Multimedia Signal Processing
Artificial intelligence technologies are also actively applied to broadcasting and multimedia processing technologies. A lot of research has been conducted in a wide variety of fields, such as content creation, transmission, and security, and these attempts have been made in the past two to three years to improve image, video, speech, and other data compression efficiency in areas related to MPEG media processing technology. Additionally, technologies such as media creation, processing, editing, and creating scenarios are very important areas of research in multimedia processing and engineering. This book contains a collection of some topics broadly across advanced computational intelligence algorithms and technologies for emerging multimedia signal processing as: Computer vision field, speech/sound/text processing, and content analysis/information mining
Local features for view matching across independently moving cameras.
PhD ThesisMoving platforms, such as wearable and robotic cameras, need to recognise the same place
observed from different viewpoints in order to collaboratively reconstruct a 3D scene and to support
augmented reality or autonomous navigation. However, matching views is challenging for
independently moving cameras that directly interact with each other due to severe geometric and
photometric differences, such as viewpoint, scale, and illumination changes, can considerably
decrease the matching performance. This thesis proposes novel, compact, local features that can
cope with with scale and viewpoint variations. We extract and describe an image patch at different
scales of an image pyramid by comparing intensity values between learnt pixel pairs (binary
test), and employ a cross-scale distance when matching these features. We capture, at multiple
scales, the temporal changes of a 3D point, as observed in the image sequence of a camera, by
tracking local binary descriptors. After validating the feature-point trajectories through 3D reconstruction,
we reduce, for each scale, the sequence of binary features to a compact, fixed-length
descriptor that identifies the most frequent and the most stable binary tests over time. We then
propose XC-PR, a cross-camera place recognition approach that stores locally, for each uncalibrated
camera, spatio-temporal descriptors, extracted at a single scale, in a tree that is selectively
updated, as the camera moves. Cameras exchange descriptors selected from previous frames
within an adaptive temporal window and with the highest number of local features corresponding
to the descriptors. The other camera locally searches and matches the received descriptors to
identify and geometrically validate a previously seen place. Experiments on different scenarios
show the improved matching accuracy of the joint multi-scale extraction and temporal reduction
through comparisons of different temporal reduction strategies, as well as the cross-camera
matching strategy based on Bag of Binary Words, and the application to several binary descriptors.
We also show that XC-PR achieves similar accuracy but faster, on average, than a baseline
consisting of an incremental list of spatio-temporal descriptors. Moreover, XC-PR achieves similar
accuracy of a frame-based Bag of Binary Words approach adapted to our approach, while
avoiding to match features that cannot be informative, e.g. for 3D reconstruction
Signal Processing on Textured Meshes
In this thesis we extend signal processing techniques originally formulated in the context of image processing to techniques that can be applied to signals on arbitrary triangles meshes.
We develop methods for the two most common representations of signals on triangle meshes: signals sampled at the vertices of a finely tessellated mesh, and signals mapped to a coarsely tessellated mesh through texture maps.
Our first contribution is the combination of Lagrangian Integration and the Finite Elements Method in the formulation of two signal processing tasks: Shock Filters for texture and geometry sharpening, and Optical Flow for texture registration.
Our second contribution is the formulation of Gradient-Domain processing within the texture atlas. We define a function space that handles chart discontinuities, and linear operators that capture the metric distortion introduced by the parameterization.
Our third contribution is the construction of a spatiotemporal atlas parameterization for evolving meshes. Our method introduces localized remeshing operations and a compact parameterization that improves geometry and texture video compression. We show temporally coherent signal processing using partial correspondences
Recommended from our members
Machine learning based small bowel video capsule endoscopy analysis: Challenges and opportunities
YesVideo capsule endoscopy (VCE) is a revolutionary technology for the early diagnosis of gastric disorders. However, owing to the high redundancy and subtle manifestation of anomalies among thousands of frames, the manual construal of VCE videos requires considerable patience, focus, and time. The automatic analysis of these videos using computational methods is a challenge as the capsule is untamed in motion and captures frames inaptly. Several machine learning (ML) methods, including recent deep convolutional neural networks approaches, have been adopted after evaluating their potential of improving the VCE analysis. However, the clinical impact of these methods is yet to be investigated. This survey aimed to highlight the gaps between existing ML-based research methodologies and clinically significant rules recently established by gastroenterologists based on VCE. A framework for interpreting raw frames into contextually relevant frame-level findings and subsequently merging these findings with meta-data to obtain a disease-level diagnosis was formulated. Frame-level findings can be more intelligible for discriminative learning when organized in a taxonomical hierarchy. The proposed taxonomical hierarchy, which is formulated based on pathological and visual similarities, may yield better classification metrics by setting inference classes at a higher level than training classes. Mapping from the frame level to the disease level was structured in the form of a graph based on clinical relevance inspired by the recent international consensus developed by domain experts. Furthermore, existing methods for VCE summarization, classification, segmentation, detection, and localization were critically evaluated and compared based on aspects deemed significant by clinicians. Numerous studies pertain to single anomaly detection instead of a pragmatic approach in a clinical setting. The challenges and opportunities associated with VCE analysis were delineated. A focus on maximizing the discriminative power of features corresponding to various subtle lesions and anomalies may help cope with the diverse and mimicking nature of different VCE frames. Large multicenter datasets must be created to cope with data sparsity, bias, and class imbalance. Explainability, reliability, traceability, and transparency are important for an ML-based diagnostics system in a VCE. Existing ethical and legal bindings narrow the scope of possibilities where ML can potentially be leveraged in healthcare. Despite these limitations, ML based video capsule endoscopy will revolutionize clinical practice, aiding clinicians in rapid and accurate diagnosis
Inverse problem theory in shape and action modeling
In this thesis we consider shape and action modeling problems under the perspective of
inverse problem theory. Inverse problem theory proposes a mathematical framework for
solving model parameter estimation problems. Inverse problems are typically ill-posed,
which makes their solution challenging. Regularization theory and Bayesian statistical
methods, which are proposed in the context of inverse problem theory, provide suitable
methods for dealing with ill-posed problems.
Regarding the application of inverse problem theory in shape and action modeling,
we first discuss the problem of saliency prediction, considering a model proposed by the
coherence theory of attention. According to coherence theory, salience regions emerge
via proto-objects which we model using harmonic functions (thin-membranes). We also
discuss the modeling of the 3D scene, as it is fundamental for extracting suitable scene
features, which guide the generation of proto-objects.
The next application we consider is the problem of image fusion. In this context,
we propose a variational image fusion framework, based on confidence driven total
variation regularization, and we consider its application to the problem of depth image
fusion, which is an important step in the dense 3D scene reconstruction pipeline.
The third problem we encounter regards action modeling, and in particular the
recognition of human actions based on 3D data. Here, we employ a Bayesian nonparametric
model to capture the idiosyncratic motions of the different body parts. Recognition
is achieved by comparing the motion behaviors of the subject to a dictionary of
behaviors for each action, learned by examples collected from other subjects.
Next, we consider the 3D modeling of articulated objects from images taken from
the web, with application to the 3D modeling of animals. By decomposing the full
object in rigid components and by considering different aspects of these components,
we model the object up this hierarchy, in order to obtain a 3D model of the entire object.
Single view 3D modeling as well as model registration is performed, based on
regularization methods.
The last problem we consider, is the modeling of 3D specular (non-Lambertian)
surfaces from a single image. To solve this challenging problem we propose a Bayesian
non-parametric model for estimating the normal field of the surface from its appearance,
by identifying the material of the surface. After computing an initial model of the
surface, we apply regularization of its normal field considering also a photo-consistency
constraint, in order to estimate the final shape of the surface.
Finally, we conclude this thesis by summarizing the most significant results and
by suggesting future directions regarding the application of inverse problem theory to
challenging computer vision problems, as the ones encountered in this work
Exploiting Spatio-Temporal Coherence for Video Object Detection in Robotics
This paper proposes a method to enhance video object detection for indoor environments in robotics. Concretely, it exploits knowledge about the camera motion between frames to propagate previously detected objects to successive frames. The proposal is rooted in the concepts of planar homography to propose regions of interest where to find objects, and recursive Bayesian filtering to integrate observations over time. The proposal is evaluated on six virtual, indoor environments, accounting for the detection of nine object classes over a total of ∼ 7k frames. Results show that our proposal improves the recall and the F1-score by a factor of 1.41 and 1.27, respectively, as well as it achieves a significant reduction of the object categorization entropy (58.8%) when compared to a two-stage video object detection method used as baseline, at the cost of small time overheads (120 ms) and precision loss (0.92).</p
Handbook of Digital Face Manipulation and Detection
This open access book provides the first comprehensive collection of studies dealing with the hot topic of digital face manipulation such as DeepFakes, Face Morphing, or Reenactment. It combines the research fields of biometrics and media forensics including contributions from academia and industry. Appealing to a broad readership, introductory chapters provide a comprehensive overview of the topic, which address readers wishing to gain a brief overview of the state-of-the-art. Subsequent chapters, which delve deeper into various research challenges, are oriented towards advanced readers. Moreover, the book provides a good starting point for young researchers as well as a reference guide pointing at further literature. Hence, the primary readership is academic institutions and industry currently involved in digital face manipulation and detection. The book could easily be used as a recommended text for courses in image processing, machine learning, media forensics, biometrics, and the general security area
Image-set, Temporal and Spatiotemporal Representations of Videos for Recognizing, Localizing and Quantifying Actions
This dissertation addresses the problem of learning video representations, which is defined here as transforming the video so that its essential structure is made more visible or accessible for action recognition and quantification. In the literature, a video can be represented by a set of images, by modeling motion or temporal dynamics, and by a 3D graph with pixels as nodes. This dissertation contributes in proposing a set of models to localize, track, segment, recognize and assess actions such as (1) image-set models via aggregating subset features given by regularizing normalized CNNs, (2) image-set models via inter-frame principal recovery and sparsely coding residual actions, (3) temporally local models with spatially global motion estimated by robust feature matching and local motion estimated by action detection with motion model added, (4) spatiotemporal models 3D graph and 3D CNN to model time as a space dimension, (5) supervised hashing by jointly learning embedding and quantization, respectively. State-of-the-art performances are achieved for tasks such as quantifying facial pain and human diving. Primary conclusions of this dissertation are categorized as follows: (i) Image set can capture facial actions that are about collective representation; (ii) Sparse and low-rank representations can have the expression, identity and pose cues untangled and can be learned via an image-set model and also a linear model; (iii) Norm is related with recognizability; similarity metrics and loss functions matter; (v) Combining the MIL based boosting tracker with the Particle Filter motion model induces a good trade-off between the appearance similarity and motion consistence; (iv) Segmenting object locally makes it amenable to assign shape priors; it is feasible to learn knowledge such as shape priors online from Web data with weak supervision; (v) It works locally in both space and time to represent videos as 3D graphs; 3D CNNs work effectively when inputted with temporally meaningful clips; (vi) the rich labeled images or videos help to learn better hash functions after learning binary embedded codes than the random projections. In addition, models proposed for videos can be adapted to other sequential images such as volumetric medical images which are not included in this dissertation