236 research outputs found

    Face Image and Video Analysis in Biometrics and Health Applications

    Get PDF
    Computer Vision (CV) enables computers and systems to derive meaningful information from acquired visual inputs, such as images and videos, and make decisions based on the extracted information. Its goal is to acquire, process, analyze, and understand the information by developing a theoretical and algorithmic model. Biometrics are distinctive and measurable human characteristics used to label or describe individuals by combining computer vision with knowledge of human physiology (e.g., face, iris, fingerprint) and behavior (e.g., gait, gaze, voice). Face is one of the most informative biometric traits. Many studies have investigated the human face from the perspectives of various different disciplines, ranging from computer vision, deep learning, to neuroscience and biometrics. In this work, we analyze the face characteristics from digital images and videos in the areas of morphing attack and defense, and autism diagnosis. For face morphing attacks generation, we proposed a transformer based generative adversarial network to generate more visually realistic morphing attacks by combining different losses, such as face matching distance, facial landmark based loss, perceptual loss and pixel-wise mean square error. In face morphing attack detection study, we designed a fusion-based few-shot learning (FSL) method to learn discriminative features from face images for few-shot morphing attack detection (FS-MAD), and extend the current binary detection into multiclass classification, namely, few-shot morphing attack fingerprinting (FS-MAF). In the autism diagnosis study, we developed a discriminative few shot learning method to analyze hour-long video data and explored the fusion of facial dynamics for facial trait classification of autism spectrum disorder (ASD) in three severity levels. The results show outstanding performance of the proposed fusion-based few-shot framework on the dataset. Besides, we further explored the possibility of performing face micro- expression spotting and feature analysis on autism video data to classify ASD and control groups. The results indicate the effectiveness of subtle facial expression changes on autism diagnosis

    Time-slice analysis of dyadic human activity

    Get PDF
    La reconnaissance d’activitĂ©s humaines Ă  partir de donnĂ©es vidĂ©o est utilisĂ©e pour la surveillance ainsi que pour des applications d’interaction homme-machine. Le principal objectif est de classer les vidĂ©os dans l’une des k classes d’actions Ă  partir de vidĂ©os entiĂšrement observĂ©es. Cependant, de tout temps, les systĂšmes intelligents sont amĂ©liorĂ©s afin de prendre des dĂ©cisions basĂ©es sur des incertitudes et ou des informations incomplĂštes. Ce besoin nous motive Ă  introduire le problĂšme de l’analyse de l’incertitude associĂ©e aux activitĂ©s humaines et de pouvoir passer Ă  un nouveau niveau de gĂ©nĂ©ralitĂ© liĂ© aux problĂšmes d’analyse d’actions. Nous allons Ă©galement prĂ©senter le problĂšme de reconnaissance d’activitĂ©s par intervalle de temps, qui vise Ă  explorer l’activitĂ© humaine dans un intervalle de temps court. Il a Ă©tĂ© dĂ©montrĂ© que l’analyse par intervalle de temps est utile pour la caractĂ©risation des mouvements et en gĂ©nĂ©ral pour l’analyse de contenus vidĂ©o. Ces Ă©tudes nous encouragent Ă  utiliser ces intervalles de temps afin d’analyser l’incertitude associĂ©e aux activitĂ©s humaines. Nous allons dĂ©tailler Ă  quel degrĂ© de certitude chaque activitĂ© se produit au cours de la vidĂ©o. Dans cette thĂšse, l’analyse par intervalle de temps d’activitĂ©s humaines avec incertitudes sera structurĂ©e en 3 parties. i) Nous prĂ©sentons une nouvelle famille de descripteurs spatiotemporels optimisĂ©s pour la prĂ©diction prĂ©coce avec annotations d’intervalle de temps. Notre reprĂ©sentation prĂ©dictive du point d’intĂ©rĂȘt spatiotemporel (Predict-STIP) est basĂ©e sur l’idĂ©e de la contingence entre intervalles de temps. ii) Nous exploitons des techniques de pointe pour extraire des points d’intĂ©rĂȘts afin de reprĂ©senter ces intervalles de temps. iii) Nous utilisons des relations (uniformes et par paires) basĂ©es sur les rĂ©seaux neuronaux convolutionnels entre les diffĂ©rentes parties du corps de l’individu dans chaque intervalle de temps. Les relations uniformes enregistrent l’apparence locale de la partie du corps tandis que les relations par paires captent les relations contextuelles locales entre les parties du corps. Nous extrayons les spĂ©cificitĂ©s de chaque image dans l’intervalle de temps et examinons diffĂ©rentes façons de les agrĂ©ger temporellement afin de gĂ©nĂ©rer un descripteur pour tout l’intervalle de temps. En outre, nous crĂ©ons une nouvelle base de donnĂ©es qui est annotĂ©e Ă  de multiples intervalles de temps courts, permettant la modĂ©lisation de l’incertitude inhĂ©rente Ă  la reconnaissance d’activitĂ©s par intervalle de temps. Les rĂ©sultats expĂ©rimentaux montrent l’efficience de notre stratĂ©gie dans l’analyse des mouvements humains avec incertitude.Recognizing human activities from video data is routinely leveraged for surveillance and human-computer interaction applications. The main focus has been classifying videos into one of k action classes from fully observed videos. However, intelligent systems must to make decisions under uncertainty, and based on incomplete information. This need motivates us to introduce the problem of analysing the uncertainty associated with human activities and move to a new level of generality in the action analysis problem. We also present the problem of time-slice activity recognition which aims to explore human activity at a small temporal granularity. Time-slice recognition is able to infer human behaviours from a short temporal window. It has been shown that temporal slice analysis is helpful for motion characterization and for video content representation in general. These studies motivate us to consider timeslices for analysing the uncertainty associated with human activities. We report to what degree of certainty each activity is occurring throughout the video from definitely not occurring to definitely occurring. In this research, we propose three frameworks for time-slice analysis of dyadic human activity under uncertainty. i) We present a new family of spatio-temporal descriptors which are optimized for early prediction with time-slice action annotations. Our predictive spatiotemporal interest point (Predict-STIP) representation is based on the intuition of temporal contingency between time-slices. ii) we exploit state-of-the art techniques to extract interest points in order to represent time-slices. We also present an accumulative uncertainty to depict the uncertainty associated with partially observed videos for the task of early activity recognition. iii) we use Convolutional Neural Networks-based unary and pairwise relations between human body joints in each time-slice. The unary term captures the local appearance of the joints while the pairwise term captures the local contextual relations between the parts. We extract these features from each frame in a time-slice and examine different temporal aggregations to generate a descriptor for the whole time-slice. Furthermore, we create a novel dataset which is annotated at multiple short temporal windows, allowing the modelling of the inherent uncertainty in time-slice activity recognition. All the three methods have been evaluated on TAP dataset. Experimental results demonstrate the effectiveness of our framework in the analysis of dyadic activities under uncertaint

    Predictive Coding Theories of Cortical Function

    Full text link
    Predictive coding is a unifying framework for understanding perception, action and neocortical organization. In predictive coding, different areas of the neocortex implement a hierarchical generative model of the world that is learned from sensory inputs. Cortical circuits are hypothesized to perform Bayesian inference based on this generative model. Specifically, the Rao-Ballard hierarchical predictive coding model assumes that the top-down feedback connections from higher to lower order cortical areas convey predictions of lower-level activities. The bottom-up, feedforward connections in turn convey the errors between top-down predictions and actual activities. These errors are used to correct current estimates of the state of the world and generate new predictions. Through the objective of minimizing prediction errors, predictive coding provides a functional explanation for a wide range of neural responses and many aspects of brain organization

    Vision-based human action recognition using machine learning techniques

    Get PDF
    The focus of this thesis is on automatic recognition of human actions in videos. Human action recognition is defined as automatic understating of what actions occur in a video performed by a human. This is a difficult problem due to the many challenges including, but not limited to, variations in human shape and motion, occlusion, cluttered background, moving cameras, illumination conditions, and viewpoint variations. To start with, The most popular and prominent state-of-the-art techniques are reviewed, evaluated, compared, and presented. Based on the literature review, these techniques are categorized into handcrafted feature-based and deep learning-based approaches. The proposed action recognition framework is then based on these handcrafted and deep learning based techniques, which are then adopted throughout the thesis by embedding novel algorithms for action recognition, both in the handcrafted and deep learning domains. First, a new method based on handcrafted approach is presented. This method addresses one of the major challenges known as “viewpoint variations” by presenting a novel feature descriptor for multiview human action recognition. This descriptor employs the region-based features extracted from the human silhouette. The proposed approach is quite simple and achieves state-of-the-art results without compromising the efficiency of the recognition process which shows its suitability for real-time applications. Second, two innovative methods are presented based on deep learning approach, to go beyond the limitations of handcrafted approach. The first method is based on transfer learning using pre-trained deep learning model as a source architecture to solve the problem of human action recognition. It is experimentally confirmed that deep Convolutional Neural Network model already trained on large-scale annotated dataset is transferable to action recognition task with limited training dataset. The comparative analysis also confirms its superior performance over handcrafted feature-based methods in terms of accuracy on same datasets. The second method is based on unsupervised deep learning-based approach. This method employs Deep Belief Networks (DBNs) with restricted Boltzmann machines for action recognition in unconstrained videos. The proposed method automatically extracts suitable feature representation without any prior knowledge using unsupervised deep learning model. The effectiveness of the proposed method is confirmed with high recognition results on a challenging UCF sports dataset. Finally, the thesis is concluded with important discussions and research directions in the area of human action recognition

    A survey of traditional and deep learning-based feature descriptors for high dimensional data in computer vision

    Get PDF
    Higher dimensional data such as video and 3D are the leading edge of multimedia retrieval and computer vision research. In this survey, we give a comprehensive overview and key insights into the state of the art of higher dimensional features from deep learning and also traditional approaches. Current approaches are frequently using 3D information from the sensor or are using 3D in modeling and understanding the 3D world. With the growth of prevalent application areas such as 3D games, self-driving automobiles, health monitoring and sports activity training, a wide variety of new sensors have allowed researchers to develop feature description models beyond 2D. Although higher dimensional data enhance the performance of methods on numerous tasks, they can also introduce new challenges and problems. The higher dimensionality of the data often leads to more complicated structures which present additional problems in both extracting meaningful content and in adapting it for current machine learning algorithms. Due to the major importance of the evaluation process, we also present an overview of the current datasets and benchmarks. Moreover, based on more than 330 papers from this study, we present the major challenges and future directions. Computer Systems, Imagery and Medi

    Combining visual recognition and computational linguistics : linguistic knowledge for visual recognition and natural language descriptions of visual content

    Get PDF
    Extensive efforts are being made to improve visual recognition and semantic understanding of language. However, surprisingly little has been done to exploit the mutual benefits of combining both fields. In this thesis we show how the different fields of research can profit from each other. First, we scale recognition to 200 unseen object classes and show how to extract robust semantic relatedness from linguistic resources. Our novel approach extends zero-shot to few shot recognition and exploits unlabeled data by adopting label propagation for transfer learning. Second, we capture the high variability but low availability of composite activity videos by extracting the essential information from text descriptions. For this we recorded and annotated a corpus for fine-grained activity recognition. We show improvements in a supervised case but we are also able to recognize unseen composite activities. Third, we present a corpus of videos and aligned descriptions. We use it for grounding activity descriptions and for learning how to automatically generate natural language descriptions for a video. We show that our proposed approach is also applicable to image description and that it outperforms baselines and related work. In summary, this thesis presents a novel approach for automatic video description and shows the benefits of extracting linguistic knowledge for object and activity recognition as well as the advantage of visual recognition for understanding activity descriptions.Trotz umfangreicher Anstrengungen zur Verbesserung der die visuelle Erkennung und dem automatischen VerstĂ€ndnis von Sprache, ist bisher wenig getan worden, um diese beiden Forschungsbereiche zu kombinieren. In dieser Dissertation zeigen wir, wie beide voneinander profitieren können. Als erstes skalieren wir Objekterkennung zu 200 ungesehen Klassen und zeigen, wie man robust semantische Ähnlichkeiten von Sprachressourcen extrahiert. Unser neuer Ansatz kombiniert Transfer und halbĂŒberwachten Lernverfahren und kann so Daten ohne Annotation ausnutzen und mit keinen als auch mit wenigen Trainingsbeispielen auskommen. Zweitens erfassen wir die hohe VariabilitĂ€t aber geringe VerfĂŒgbarkeit von Videos mit zusammengesetzten AktivitĂ€ten durch Extraktion der wesentlichen Informationen aus Textbeschreibungen. Wir verbessern ĂŒberwachtes Training als auch die Erkennung von ungesehenen AktivitĂ€ten. Drittens stellen wir einen parallelen Datensatz von Videos und Beschreibungen vor. Wir verwenden ihn fĂŒr Grounding von AktivitĂ€tsbeschreibungen und um die automatische Generierung natĂŒrlicher Sprache fĂŒr ein Video zu erlernen. Wir zeigen, dass sich unsere Ansatz auch fĂŒr Bildbeschreibung einsetzten lĂ€sst und das er bisherige AnsĂ€tze ĂŒbertrifft. Zusammenfassend stellt die Dissertation einen neuen Ansatz zur automatische Videobeschreibung vor und zeigt die Vorteile von sprachbasierten Ähnlichkeitsmaßen fĂŒr die Objekt- und AktivitĂ€tserkennung als auch umgekehrt

    Segmentation of experience and episodic memory across species

    Get PDF
    How continuous ongoing perceptual experience is processed by the brain and mind to form unique episodes in memory is a key scientific question. Recent work in Psychology and Neuroscience has proposed that humans perceptually segment continuous ongoing experience into meaningful units, which allows the successful formation of episodic memories. Despite accumulating work demonstrating that non- human animals also display a capability of episodic-‘like’ memory, whether non-human animals segment continuous ongoing experience into ‘meaningful’ episodic units is a question that has not been fully explored. Hence, the main goal of the research in this thesis aims to address whether a comparable segmentation process (or processes) of continuous ongoing experience occurs for non-human animals in their formation of episodic-like memory, as it does for humans in their formation of episodic memory. Chapter 2 argues that, similarly to humans, rats can use top-down like prediction-error processing in segmenting for subsequent memory to guide behaviour in an episodic-like spontaneous object recognition task. Chapter 3 suggests that mice readily incorporate conspecific-contextual information using episodic-like memory processing, indicating that conspecifics can act as a segmentation cue for non-human animals. Chapter 4 highlights that humans and rodents may similarly segment continuous ongoing experience during turns made around spatial boundaries. Chapter 5 argues that individual place cells can represent content of episodic nature, with the theoretical implication of this being discussed in relation to episodic memory. Thus, the results presented in this thesis, as well as re-interpretation of previous literature, would argue in favour of non-humans segmenting their experience for episodic-like memory. Finally, the evidence is evaluated in the context of whether episodic-like memory in non-human animals is simply just episodic memory as experienced in humans

    On Improving Generalization of CNN-Based Image Classification with Delineation Maps Using the CORF Push-Pull Inhibition Operator

    Get PDF
    Deployed image classification pipelines are typically dependent on the images captured in real-world environments. This means that images might be affected by different sources of perturbations (e.g. sensor noise in low-light environments). The main challenge arises by the fact that image quality directly impacts the reliability and consistency of classification tasks. This challenge has, hence, attracted wide interest within the computer vision communities. We propose a transformation step that attempts to enhance the generalization ability of CNN models in the presence of unseen noise in the test set. Concretely, the delineation maps of given images are determined using the CORF push-pull inhibition operator. Such an operation transforms an input image into a space that is more robust to noise before being processed by a CNN. We evaluated our approach on the Fashion MNIST data set with an AlexNet model. It turned out that the proposed CORF-augmented pipeline achieved comparable results on noise-free images to those of a conventional AlexNet classification model without CORF delineation maps, but it consistently achieved significantly superior performance on test images perturbed with different levels of Gaussian and uniform noise


    Get PDF
    Over the years, many computer vision models, some inspired by human behavior, have been developed for various applications. However, only handful of them are popular and widely used. Why? There are two major factors: 1) most of these models do not have any efficient numerical algorithm and hence they are computationally very expensive; 2) many models, being too generic, cannot capitalize on problem specific prior information and thus demand rigorous hyper-parameter tuning. In this dissertation, we design fast and efficient algorithms to leverage application specific priors to solve unsupervised and weakly-supervised problems. Specifically, we focus on developing algorithms to impose structured priors, model priors and label priors during the inference and/or learning of vision models. In many application, it is known a priori that a signal is smooth and continuous in space. The first part of this work is focussed on improving unsupervised learning mechanisms by explicitly imposing these structured priors in an optimization framework using different regularization schemes. This led to the development of fast algorithms for robust recovery of signals from compressed measurements, image denoising and data clustering. Moreover, by employing re-descending robust penalty on the structured regularization terms and applying duality, we reduce our clustering formulation to an optimization of a single continuous objective. This enabled integration of clustering processes in an end-to-end feature learning pipeline. In the second part of our work, we exploit inherent properties of established models to develop efficient solvers for SDP, GAN, and semantic segmentation. We consider models for several different problem classes. a) Certain non-convex models in computer vision (e.g., BQP) are popularly solved using convex SDPs after lifting to a high-dimensional space. However, this computationally expensive approach limits these methods to small matrices. A fast and approximate algorithm is developed that directly solves the original non-convex formulation using biconvex relaxations and known rank information. b) Widely popular adversarial networks are difficult to train as they suffer from instability issues. This is because optimizing adversarial networks corresponds to finding a saddle-point of a loss function. We propose a simple prediction method that enables faster training of various adversarial networks using larger learning rates without any instability problems. c) Semantic segmentation models must learn long-distance contextual information while retaining high spatial resolution at the output. Existing models achieves this at the cost of computationally expensive and memory exhaustive training/inference. We designed stacked u-nets model which can repeatedly process top-down and bottom-up features. Our smallest model exceeds Resnet-101 performance on PASCAL VOC 2012 by 4.5% IoU with ∌ 7× fewer parameters. Next, we address the problem of learning heterogeneous concepts from internet videos using mined label tags. Given a large number of videos each with multiple concepts and labels, the idea is to teach machines to automatically learn these concepts by leveraging weak labels. We formulate this into a co-clustering problem and developed a novel bayesian non-parametric weakly supervised Indian buffet process model which additionally enforces the paired label prior between concepts. In the final part of this work we consider an inverse approach: learning data priors from a given model. Specifically, we develop numerically efficient algorithm for estimating the log likelihood of data samples from GANs. The approximate log-likelihood function is used for outlier detection and data augmentation for training classifiers
