867 research outputs found

    Audio-coupled video content understanding of unconstrained video sequences

    Get PDF
    Unconstrained video understanding is a difficult task. The main aim of this thesis is to recognise the nature of objects, activities and environment in a given video clip using both audio and video information. Traditionally, audio and video information has not been applied together for solving such complex task, and for the first time we propose, develop, implement and test a new framework of multi-modal (audio and video) data analysis for context understanding and labelling of unconstrained videos. The framework relies on feature selection techniques and introduces a novel algorithm (PCFS) that is faster than the well-established SFFS algorithm. We use the framework for studying the benefits of combining audio and video information in a number of different problems. We begin by developing two independent content recognition modules. The first one is based on image sequence analysis alone, and uses a range of colour, shape, texture and statistical features from image regions with a trained classifier to recognise the identity of objects, activities and environment present. The second module uses audio information only, and recognises activities and environment. Both of these approaches are preceded by detailed pre-processing to ensure that correct video segments containing both audio and video content are present, and that the developed system can be made robust to changes in camera movement, illumination, random object behaviour etc. For both audio and video analysis, we use a hierarchical approach of multi-stage classification such that difficult classification tasks can be decomposed into simpler and smaller tasks. When combining both modalities, we compare fusion techniques at different levels of integration and propose a novel algorithm that combines advantages of both feature and decision-level fusion. The analysis is evaluated on a large amount of test data comprising unconstrained videos collected for this work. We finally, propose a decision correction algorithm which shows that further steps towards combining multi-modal classification information effectively with semantic knowledge generates the best possible results

    Multiple classifiers in biometrics. part 1: Fundamentals and review

    Full text link
    We provide an introduction to Multiple Classifier Systems (MCS) including basic nomenclature and describing key elements: classifier dependencies, type of classifier outputs, aggregation procedures, architecture, and types of methods. This introduction complements other existing overviews of MCS, as here we also review the most prevalent theoretical framework for MCS and discuss theoretical developments related to MCS The introduction to MCS is then followed by a review of the application of MCS to the particular field of multimodal biometric person authentication in the last 25 years, as a prototypical area in which MCS has resulted in important achievements. This review includes general descriptions of successful MCS methods and architectures in order to facilitate the export of them to other information fusion problems. Based on the theory and framework introduced here, in the companion paper we then develop in more technical detail recent trends and developments in MCS from multimodal biometrics that incorporate context information in an adaptive way. These new MCS architectures exploit input quality measures and pattern-specific particularities that move apart from general population statistics, resulting in robust multimodal biometric systems. Similarly as in the present paper, methods in the companion paper are introduced in a general way so they can be applied to other information fusion problems as well. Finally, also in the companion paper, we discuss open challenges in biometrics and the role of MCS to advance themThis work was funded by projects CogniMetrics (TEC2015-70627-R) from MINECO/FEDER and RiskTrakc (JUST-2015-JCOO-AG-1). Part of thisthis work was conducted during a research visit of J.F. to Prof. Ludmila Kuncheva at Bangor University (UK) with STSM funding from COST CA16101 (MULTI-FORESEE

    Speech data analysis for semantic indexing of video of simulated medical crises.

    Get PDF
    The Simulation for Pediatric Assessment, Resuscitation, and Communication (SPARC) group within the Department of Pediatrics at the University of Louisville, was established to enhance the care of children by using simulation based educational methodologies to improve patient safety and strengthen clinician-patient interactions. After each simulation session, the physician must manually review and annotate the recordings and then debrief the trainees. The physician responsible for the simulation has recorded 100s of videos, and is seeking solutions that can automate the process. This dissertation introduces our developed system for efficient segmentation and semantic indexing of videos of medical simulations using machine learning methods. It provides the physician with automated tools to review important sections of the simulation by identifying who spoke, when and what was his/her emotion. Only audio information is extracted and analyzed because the quality of the image recording is low and the visual environment is static for most parts. Our proposed system includes four main components: preprocessing, speaker segmentation, speaker identification, and emotion recognition. The preprocessing consists of first extracting the audio component from the video recording. Then, extracting various low-level audio features to detect and remove silence segments. We investigate and compare two different approaches for this task. The first one is threshold-based and the second one is classification-based. The second main component of the proposed system consists of detecting speaker changing points for the purpose of segmenting the audio stream. We propose two fusion methods for this task. The speaker identification and emotion recognition components of our system are designed to provide users the capability to browse the video and retrieve shots that identify ”who spoke, when, and the speaker’s emotion” for further analysis. For this component, we propose two feature representation methods that map audio segments of arbitary length to a feature vector with fixed dimensions. The first one is based on soft bag-of-word (BoW) feature representations. In particular, we define three types of BoW that are based on crisp, fuzzy, and possibilistic voting. The second feature representation is a generalization of the BoW and is based on Fisher Vector (FV). FV uses the Fisher Kernel principle and combines the benefits of generative and discriminative approaches. The proposed feature representations are used within two learning frameworks. The first one is supervised learning and assumes that a large collection of labeled training data is available. Within this framework, we use standard classifiers including K-nearest neighbor (K-NN), support vector machine (SVM), and Naive Bayes. The second framework is based on semi-supervised learning where only a limited amount of labeled training samples are available. We use an approach that is based on label propagation. Our proposed algorithms were evaluated using 15 medical simulation sessions. The results were analyzed and compared to those obtained using state-of-the-art algorithms. We show that our proposed speech segmentation fusion algorithms and feature mappings outperform existing methods. We also integrated all proposed algorithms and developed a GUI prototype system for subjective evaluation. This prototype processes medical simulation video and provides the user with a visual summary of the different speech segments. It also allows the user to browse videos and retrieve scenes that provide answers to semantic queries such as: who spoke and when; who interrupted who? and what was the emotion of the speaker? The GUI prototype can also provide summary statistics of each simulation video. Examples include: for how long did each person spoke? What is the longest uninterrupted speech segment? Is there an unusual large number of pauses within the speech segment of a given speaker

    Classification and fusion methods for multimodal biometric authentication.

    Get PDF
    Ouyang, Hua.Thesis (M.Phil.)--Chinese University of Hong Kong, 2007.Includes bibliographical references (leaves 81-89).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Biometric Authentication --- p.1Chapter 1.2 --- Multimodal Biometric Authentication --- p.2Chapter 1.2.1 --- Combination of Different Biometric Traits --- p.3Chapter 1.2.2 --- Multimodal Fusion --- p.5Chapter 1.3 --- Audio-Visual Bi-modal Authentication --- p.6Chapter 1.4 --- Focus of This Research --- p.7Chapter 1.5 --- Organization of This Thesis --- p.8Chapter 2 --- Audio-Visual Bi-modal Authentication --- p.10Chapter 2.1 --- Audio-visual Authentication System --- p.10Chapter 2.1.1 --- Why Audio and Mouth? --- p.10Chapter 2.1.2 --- System Overview --- p.11Chapter 2.2 --- XM2VTS Database --- p.12Chapter 2.3 --- Visual Feature Extraction --- p.14Chapter 2.3.1 --- Locating the Mouth --- p.14Chapter 2.3.2 --- Averaged Mouth Images --- p.17Chapter 2.3.3 --- Averaged Optical Flow Images --- p.21Chapter 2.4 --- Audio Features --- p.23Chapter 2.5 --- Video Stream Classification --- p.23Chapter 2.6 --- Audio Stream Classification --- p.25Chapter 2.7 --- Simple Fusion --- p.26Chapter 3 --- Weighted Sum Rules for Multi-modal Fusion --- p.27Chapter 3.1 --- Measurement-Level Fusion --- p.27Chapter 3.2 --- Product Rule and Sum Rule --- p.28Chapter 3.2.1 --- Product Rule --- p.28Chapter 3.2.2 --- Naive Sum Rule (NS) --- p.29Chapter 3.2.3 --- Linear Weighted Sum Rule (WS) --- p.30Chapter 3.3 --- Optimal Weights Selection for WS --- p.31Chapter 3.3.1 --- Independent Case --- p.31Chapter 3.3.2 --- Identical Case --- p.33Chapter 3.4 --- Confidence Measure Based Fusion Weights --- p.35Chapter 4 --- Regularized k-Nearest Neighbor Classifier --- p.39Chapter 4.1 --- Motivations --- p.39Chapter 4.1.1 --- Conventional k-NN Classifier --- p.39Chapter 4.1.2 --- Bayesian Formulation of kNN --- p.40Chapter 4.1.3 --- Pitfalls and Drawbacks of kNN Classifiers --- p.41Chapter 4.1.4 --- Metric Learning Methods --- p.43Chapter 4.2 --- Regularized k-Nearest Neighbor Classifier --- p.46Chapter 4.2.1 --- Metric or Not Metric? --- p.46Chapter 4.2.2 --- Proposed Classifier: RkNN --- p.47Chapter 4.2.3 --- Hyperkernels and Hyper-RKHS --- p.49Chapter 4.2.4 --- Convex Optimization of RkNN --- p.52Chapter 4.2.5 --- Hyper kernel Construction --- p.53Chapter 4.2.6 --- Speeding up RkNN --- p.56Chapter 4.3 --- Experimental Evaluation --- p.57Chapter 4.3.1 --- Synthetic Data Sets --- p.57Chapter 4.3.2 --- Benchmark Data Sets --- p.64Chapter 5 --- Audio-Visual Authentication Experiments --- p.68Chapter 5.1 --- Effectiveness of Visual Features --- p.68Chapter 5.2 --- Performance of Simple Sum Rule --- p.71Chapter 5.3 --- Performances of Individual Modalities --- p.73Chapter 5.4 --- Identification Tasks Using Confidence-based Weighted Sum Rule --- p.74Chapter 5.4.1 --- Effectiveness of WS_M_C Rule --- p.75Chapter 5.4.2 --- WS_M_C v.s. WS_M --- p.76Chapter 5.5 --- Speaker Identification Using RkNN --- p.77Chapter 6 --- Conclusions and Future Work --- p.78Chapter 6.1 --- Conclusions --- p.78Chapter 6.2 --- Important Follow-up Works --- p.80Bibliography --- p.81Chapter A --- Proof of Proposition 3.1 --- p.90Chapter B --- Proof of Proposition 3.2 --- p.9

    Sparse Classifier Fusion for Speaker Verification

    Full text link

    Deception Detection in Videos

    Full text link
    We present a system for covert automated deception detection in real-life courtroom trial videos. We study the importance of different modalities like vision, audio and text for this task. On the vision side, our system uses classifiers trained on low level video features which predict human micro-expressions. We show that predictions of high-level micro-expressions can be used as features for deception prediction. Surprisingly, IDT (Improved Dense Trajectory) features which have been widely used for action recognition, are also very good at predicting deception in videos. We fuse the score of classifiers trained on IDT features and high-level micro-expressions to improve performance. MFCC (Mel-frequency Cepstral Coefficients) features from the audio domain also provide a significant boost in performance, while information from transcripts is not very beneficial for our system. Using various classifiers, our automated system obtains an AUC of 0.877 (10-fold cross-validation) when evaluated on subjects which were not part of the training set. Even though state-of-the-art methods use human annotations of micro-expressions for deception detection, our fully automated approach outperforms them by 5%. When combined with human annotations of micro-expressions, our AUC improves to 0.922. We also present results of a user-study to analyze how well do average humans perform on this task, what modalities they use for deception detection and how they perform if only one modality is accessible. Our project page can be found at \url{https://doubaibai.github.io/DARE/}.Comment: AAAI 2018, project page: https://doubaibai.github.io/DARE

    Classification with class-independent quality information for biometric verification

    Get PDF
    Biometric identity verification systems frequently face the challenges of non-controlled conditions of data acquisition. Under such conditions biometric signals may suffer from quality degradation due to extraneous, identity-independent factors. It has been demonstrated in numerous reports that a degradation of biometric signal quality is a frequent cause of significant deterioration of classification performance, also in multiple-classifier, multimodal systems, which systematically outperform their single-classifier counterparts. Seeking to improve the robustness of classifiers to degraded data quality, researchers started to introduce measures of signal quality into the classification process. In the existing approaches, the role of class-independent quality information is governed by intuitive rather than mathematical notions, resulting in a clearly drawn distinction between the single-, multiple-classifier and multimodal approaches. The application of quality measures in a multiple-classifier system has received far more attention, with a dominant intuitive notion that a classifier that has data of higher quality at its disposal ought to be more credible than a classifier that operates on noisy signals. In the case of single-classifier systems a quality-based selection of models, classifiers or thresholds has been proposed. In both cases, quality measures have the function of meta-information which supervises but not intervenes with the actual classifier or classifiers employed to assign class labels to modality-specific and class-selective features. In this thesis we argue that in fact the very same mechanism governs the use of quality measures in single- and multi-classifier systems alike, and we present a quantitative rather than intuitive perspective on the role of quality measures in classification. We notice the fact that for a given set of classification features and their fixed marginal distributions, the class separation in the joint feature space changes with the statistical dependencies observed between the individual features. The same effect applies to a feature space in which some of the features are class-independent. Consequently, we demonstrate that the class separation can be improved by augmenting the feature space with class-independent quality information, provided that it sports statistical dependencies on the class-selective features. We discuss how to construct classifier-quality measure ensembles in which the dependence between classification scores and the quality features helps decrease classification errors below those obtained using the classification scores alone. We propose Q – stack, a novel theoretical framework of improving classification with class-independent quality measures based on the concept of classifier stacking. In the scheme of Q – stack a classifier ensemble is used in which the first classifier layer is made of the baseline unimodal classifiers, and the second, stacked classifier operates on features composed of the normalized similarity scores and the relevant quality measures. We present Q – stack as a generalized framework of classification with quality information and we argue that previously proposed methods of classification with quality measures are its special cases. Further in this thesis we address the problem of estimating probability of single classification errors. We propose to employ the subjective Bayesian interpretation of single event probability as credence in the correctness of single classification decisions. We propose to apply the credence-based error predictor as a functional extension of the proposed Q – stack framework, where a Bayesian stacked classifier is employed. As such, the proposed method of credence estimation and error prediction inherits the benefit of seamless incorporation of quality information in the process of credence estimation. We propose a set of objective evaluation criteria for credence estimates, and we discuss how the proposed method can be applied together with an appropriate repair strategy to reduce classification errors to a desired target level. Finally, we demonstrate the application of Q – stack and its functional extension to single error prediction on the task of biometric identity verification using face and fingerprint modalities, and their multimodal combinations, using a real biometric database. We show that the use of the classification and error prediction methods proposed in this thesis allows for a systematic reduction of the error rates below those of the baseline classifiers

    Multiple classifiers in biometrics. Part 2: Trends and challenges

    Full text link
    The present paper is Part 2 in this series of two papers. In Part 1 we provided an introduction to Multiple Classifier Systems (MCS) with a focus into the fundamentals: basic nomenclature, key elements, architecture, main methods, and prevalent theory and framework. Part 1 then overviewed the application of MCS to the particular field of multimodal biometric person authentication in the last 25 years, as a prototypical area in which MCS has resulted in important achievements. Here in Part 2 we present in more technical detail recent trends and developments in MCS coming from multimodal biometrics that incorporate context information in an adaptive way. These new MCS architectures exploit input quality measures and pattern-specific particularities that move apart from general population statistics, resulting in robust multimodal biometric systems. Similarly as in Part 1, methods here are described in a general way so they can be applied to other information fusion problems as well. Finally, we also discuss here open challenges in biometrics in which MCS can play a key roleThis work was funded by projects CogniMetrics (TEC2015-70627-R) from MINECO/FEDER and RiskTrakc (JUST-2015-JCOO-AG-1). Part of this work was conducted during a research visit of J.F. to Prof. Ludmila Kuncheva at Bangor University (UK) with STSM funding from COST CA16101 (MULTI-FORESEE

    Multi-system Biometric Authentication: Optimal Fusion and User-Specific Information

    Get PDF
    Verifying a person's identity claim by combining multiple biometric systems (fusion) is a promising solution to identity theft and automatic access control. This thesis contributes to the state-of-the-art of multimodal biometric fusion by improving the understanding of fusion and by enhancing fusion performance using information specific to a user. One problem to deal with at the score level fusion is to combine system outputs of different types. Two statistically sound representations of scores are probability and log-likelihood ratio (LLR). While they are equivalent in theory, LLR is much more useful in practice because its distribution can be approximated by a Gaussian distribution, which makes it useful to analyze the problem of fusion. Furthermore, its score statistics (mean and covariance) conditioned on the claimed user identity can be better exploited. Our first contribution is to estimate the fusion performance given the class-conditional score statistics and given a particular fusion operator/classifier. Thanks to the score statistics, we can predict fusion performance with reasonable accuracy, identify conditions which favor a particular fusion operator, study the joint phenomenon of combining system outputs with different degrees of strength and correlation and possibly correct the adverse effect of bias (due to the score-level mismatch between training and test sets) on fusion. While in practice the class-conditional Gaussian assumption is not always true, the estimated performance is found to be acceptable. Our second contribution is to exploit the user-specific prior knowledge by limiting the class-conditional Gaussian assumption to each user. We exploit this hypothesis in two strategies. In the first strategy, we combine a user-specific fusion classifier with a user-independent fusion classifier by means of two LLR scores, which are then weighted to obtain a single output. We show that combining both user-specific and user-independent LLR outputs always results in improved performance than using the better of the two. In the second strategy, we propose a statistic called the user-specific F-ratio, which measures the discriminative power of a given user based on the Gaussian assumption. Although similar class separability measures exist, e.g., the Fisher-ratio for a two-class problem and the d-prime statistic, F-ratio is more suitable because it is related to Equal Error Rate in a closed form. F-ratio is used in the following applications: a user-specific score normalization procedure, a user-specific criterion to rank users and a user-specific fusion operator that selectively considers a subset of systems for fusion. The resultant fusion operator leads to a statistically significantly increased performance with respect to the state-of-the-art fusion approaches. Even though the applications are different, the proposed methods share the following common advantages. Firstly, they are robust to deviation from the Gaussian assumption. Secondly, they are robust to few training data samples thanks to Bayesian adaptation. Finally, they consider both the client and impostor information simultaneously
    • …
    corecore