131 research outputs found
On Deep Machine Learning Methods for Anomaly Detection within Computer Vision
This thesis concerns deep learning approaches for anomaly detection in images. Anomaly detection addresses how to find any kind of pattern that differs from the regularities found in normal data and is receiving increasingly more attention in deep learning research. This is due in part to its wide set of potential applications ranging from automated CCTV surveillance to quality control across a range of industries. We introduce three original methods for anomaly detection applicable to two specific deployment scenarios. In the first, we detect anomalous activity in potentially crowded scenes through imagery captured via CCTV or other video recording devices. In the second, we segment defects in textures and demonstrate use cases representative of automated quality inspection on industrial production lines. In the context of detecting anomalous activity in scenes, we take an existing state-of-the-art method and introduce several enhancements including the use of a region proposal network for region extraction and a more information-preserving feature preprocessing strategy. This results in a simpler method that is significantly faster and suitable for real-time application. In addition, the increased efficiency facilitates building higher-dimensional models capable of improved anomaly detection performance, which we demonstrate on the pedestrian-based UCSD Ped2 dataset. In the context of texture defect detection, we introduce a method based on the idea of texture restoration that surpasses all state-of-the-art methods on the texture classes of the challenging MVTecAD dataset. In the same context, we additionally introduce a method that utilises transformer networks for future pixel and feature prediction. This novel method is able to perform competitive anomaly detection on most of the challenging MVTecAD dataset texture classes and illustrates both the promise and limitations of state-of-the-art deep learning transformers for the task of texture anomaly detection
Low-Shot Learning for the Semantic Segmentation of Remote Sensing Imagery
Deep-learning frameworks have made remarkable progress thanks to the creation of large annotated datasets such as ImageNet, which has over one million training images. Although this works well for color (RGB) imagery, labeled datasets for other sensor modalities (e.g., multispectral and hyperspectral) are minuscule in comparison. This is because annotated datasets are expensive and man-power intensive to complete; and since this would be impractical to accomplish for each type of sensor, current state-of-the-art approaches in computer vision are not ideal for remote sensing problems. The shortage of annotated remote sensing imagery beyond the visual spectrum has forced researchers to embrace unsupervised feature extracting frameworks. These features are learned on a per-image basis, so they tend to not generalize well across other datasets. In this dissertation, we propose three new strategies for learning feature extracting frameworks with only a small quantity of annotated image data; including 1) self-taught feature learning, 2) domain adaptation with synthetic imagery, and 3) semi-supervised classification. ``Self-taught\u27\u27 feature learning frameworks are trained with large quantities of unlabeled imagery, and then these networks extract spatial-spectral features from annotated data for supervised classification. Synthetic remote sensing imagery can be used to boot-strap a deep convolutional neural network, and then we can fine-tune the network with real imagery. Semi-supervised classifiers prevent overfitting by jointly optimizing the supervised classification task along side one or more unsupervised learning tasks (i.e., reconstruction). Although obtaining large quantities of annotated image data would be ideal, our work shows that we can make due with less cost-prohibitive methods which are more practical to the end-user
High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs
We present a new method for synthesizing high-resolution photo-realistic
images from semantic label maps using conditional generative adversarial
networks (conditional GANs). Conditional GANs have enabled a variety of
applications, but the results are often limited to low-resolution and still far
from realistic. In this work, we generate 2048x1024 visually appealing results
with a novel adversarial loss, as well as new multi-scale generator and
discriminator architectures. Furthermore, we extend our framework to
interactive visual manipulation with two additional features. First, we
incorporate object instance segmentation information, which enables object
manipulations such as removing/adding objects and changing the object category.
Second, we propose a method to generate diverse results given the same input,
allowing users to edit the object appearance interactively. Human opinion
studies demonstrate that our method significantly outperforms existing methods,
advancing both the quality and the resolution of deep image synthesis and
editing.Comment: v2: CVPR camera ready, adding more results for edge-to-photo example
Physics-Informed Computer Vision: A Review and Perspectives
Incorporation of physical information in machine learning frameworks are
opening and transforming many application domains. Here the learning process is
augmented through the induction of fundamental knowledge and governing physical
laws. In this work we explore their utility for computer vision tasks in
interpreting and understanding visual data. We present a systematic literature
review of formulation and approaches to computer vision tasks guided by
physical laws. We begin by decomposing the popular computer vision pipeline
into a taxonomy of stages and investigate approaches to incorporate governing
physical equations in each stage. Existing approaches in each task are analyzed
with regard to what governing physical processes are modeled, formulated and
how they are incorporated, i.e. modify data (observation bias), modify networks
(inductive bias), and modify losses (learning bias). The taxonomy offers a
unified view of the application of the physics-informed capability,
highlighting where physics-informed learning has been conducted and where the
gaps and opportunities are. Finally, we highlight open problems and challenges
to inform future research. While still in its early days, the study of
physics-informed computer vision has the promise to develop better computer
vision models that can improve physical plausibility, accuracy, data efficiency
and generalization in increasingly realistic applications
Deep Learning in EEG: Advance of the Last Ten-Year Critical Period
Deep learning has achieved excellent performance in a wide range of domains, especially in speech recognition and computer vision. Relatively less work has been done for EEG, but there is still significant progress attained in the last decade. Due to the lack of a comprehensive and topic widely covered survey for deep learning in EEG, we attempt to summarize recent progress to provide an overview, as well as perspectives for future developments. We first briefly mention the artifacts removal for EEG signal and then introduce deep learning models that have been utilized in EEG processing and classification. Subsequently, the applications of deep learning in EEG are reviewed by categorizing them into groups such as brain-computer interface, disease detection, and emotion recognition. They are followed by the discussion, in which the pros and cons of deep learning are presented and future directions and challenges for deep learning in EEG are proposed. We hope that this paper could serve as a summary of past work for deep learning in EEG and the beginning of further developments and achievements of EEG studies based on deep learning
Conditional generative modeling for images, 3D animations, and video
Generative modeling for computer vision has shown immense progress in the last few years, revolutionizing the way we perceive, understand, and manipulate visual data. This rapidly evolving field has witnessed advancements in image generation, 3D animation, and video prediction that unlock diverse applications across multiple fields including entertainment, design, healthcare, and education. As the demand for sophisticated computer vision systems continues to grow, this dissertation attempts to drive innovation in the field by exploring novel formulations of conditional generative models, and innovative applications in images, 3D animations, and video.
Our research focuses on architectures that offer reversible transformations of noise and visual data, and the application of encoder-decoder architectures for generative tasks and 3D content manipulation. In all instances, we incorporate conditional information to enhance the synthesis of visual data, improving the efficiency of the generation process as well as the generated content.
Prior successful generative techniques which are reversible between noise and data include normalizing flows and denoising diffusion models. The continuous variant of normalizing flows is powered by Neural Ordinary Differential Equations (Neural ODEs), and have shown some success in modeling the real image distribution. However, they often involve huge number of parameters, and high training time. Denoising diffusion models have recently gained huge popularity for their generalization capabilities especially in text-to-image applications.
In this dissertation, we introduce the use of Neural ODEs to model video dynamics using an encoder-decoder architecture, demonstrating their ability to predict future video frames despite being trained solely to reconstruct current frames. In our next contribution, we propose a conditional variant of continuous normalizing flows that enables higher-resolution image generation based on lower-resolution input. This allows us to achieve comparable image quality to regular normalizing flows, while significantly reducing the number of parameters and training time.
Our next contribution focuses on a flexible encoder-decoder architecture for accurate estimation and editing of full 3D human pose. We present a comprehensive pipeline that takes human images as input, automatically aligns a user-specified 3D human/non-human character with the pose of the human, and facilitates pose editing based on partial input information.
We then proceed to use denoising diffusion models for image and video generation. Regular diffusion models involve the use of a Gaussian process to add noise to clean images. In our next contribution, we derive the relevant mathematical details for denoising diffusion models that use non-isotropic Gaussian processes, present non-isotropic noise, and show that the quality of generated images is comparable with the original formulation. In our final contribution, devise a novel framework building on denoising diffusion models that is capable of solving all three video tasks of prediction, generation, and interpolation. We perform ablation studies using this framework, and show state-of-the-art results on multiple datasets.
Our contributions are published articles at peer-reviewed venues. Overall, our research aims to make a meaningful contribution to the pursuit of more efficient and flexible generative models, with the potential to shape the future of computer vision.La modélisation générative pour la vision par ordinateur a connu d’immenses progrès ces
dernières années, révolutionnant notre façon de percevoir, comprendre et manipuler les
données visuelles. Ce domaine en constante évolution a connu des avancées dans la génération
d’images, l’animation 3D et la prédiction vidéo, débloquant ainsi diverses applications dans
plusieurs domaines tels que le divertissement, le design, la santé et l’éducation. Alors que la
demande de systèmes de vision par ordinateur sophistiqués ne cesse de croître, cette thèse
s’efforce de stimuler l’innovation dans le domaine en explorant de nouvelles formulations de
modèles génératifs conditionnels et des applications innovantes dans les images, les animations
3D et la vidéo.
Notre recherche se concentre sur des architectures offrant des transformations réversibles
du bruit et des données visuelles, ainsi que sur l’application d’architectures encodeur-décodeur
pour les tâches génératives et la manipulation de contenu 3D. Dans tous les cas, nous
incorporons des informations conditionnelles pour améliorer la synthèse des données visuelles,
améliorant ainsi l’efficacité du processus de génération ainsi que le contenu généré.
Les techniques génératives antérieures qui sont réversibles entre le bruit et les données
et qui ont connu un certain succès comprennent les flux de normalisation et les modèles de
diffusion de débruitage. La variante continue des flux de normalisation est alimentée par
les équations différentielles ordinaires neuronales (Neural ODEs) et a montré une certaine
réussite dans la modélisation de la distribution d’images réelles. Cependant, elles impliquent
souvent un grand nombre de paramètres et un temps d’entraînement élevé. Les modèles de
diffusion de débruitage ont récemment gagné énormément en popularité en raison de leurs
capacités de généralisation, notamment dans les applications de texte vers image.
Dans cette thèse, nous introduisons l’utilisation des Neural ODEs pour modéliser la
dynamique vidĂ©o Ă l’aide d’une architecture encodeur-dĂ©codeur, dĂ©montrant leur capacitĂ© Ă
prédire les images vidéo futures malgré le fait d’être entraînées uniquement à reconstruire
les images actuelles. Dans notre prochaine contribution, nous proposons une variante
conditionnelle des flux de normalisation continus qui permet une gĂ©nĂ©ration d’images Ă
résolution supérieure à partir d’une entrée à résolution inférieure. Cela nous permet d’obtenir une qualité d’image comparable à celle des flux de normalisation réguliers, tout en réduisant
considérablement le nombre de paramètres et le temps d’entraînement.
Notre prochaine contribution se concentre sur une architecture encodeur-décodeur flexible
pour l’estimation et l’édition précises de la pose humaine en 3D. Nous présentons un pipeline
complet qui prend des images de personnes en entrée, aligne automatiquement un personnage
3D humain/non humain spécifié par l’utilisateur sur la pose de la personne, et facilite l’édition
de la pose en fonction d’informations partielles.
Nous utilisons ensuite des modèles de diffusion de débruitage pour la génération d’images
et de vidéos. Les modèles de diffusion réguliers impliquent l’utilisation d’un processus gaussien
pour ajouter du bruit aux images propres. Dans notre prochaine contribution, nous dérivons
les détails mathématiques pertinents pour les modèles de diffusion de débruitage qui utilisent
des processus gaussiens non isotropes, présentons du bruit non isotrope, et montrons que la
qualité des images générées est comparable à la formulation d’origine. Dans notre dernière
contribution, nous concevons un nouveau cadre basé sur les modèles de diffusion de débruitage,
capable de résoudre les trois tâches vidéo de prédiction, de génération et d’interpolation.
Nous réalisons des études d’ablation en utilisant ce cadre et montrons des résultats de pointe
sur plusieurs ensembles de données.
Nos contributions sont des articles publiés dans des revues à comité de lecture. Dans
l’ensemble, notre recherche vise à apporter une contribution significative à la poursuite de
modèles génératifs plus efficaces et flexibles, avec le potentiel de façonner l’avenir de la vision
par ordinateur
- …