8,955 research outputs found

    New Stategies for Single-channel Speech Separation

    Get PDF

    Robust subspace learning for static and dynamic affect and behaviour modelling

    Get PDF
    Machine analysis of human affect and behavior in naturalistic contexts has witnessed a growing attention in the last decade from various disciplines ranging from social and cognitive sciences to machine learning and computer vision. Endowing machines with the ability to seamlessly detect, analyze, model, predict as well as simulate and synthesize manifestations of internal emotional and behavioral states in real-world data is deemed essential for the deployment of next-generation, emotionally- and socially-competent human-centered interfaces. In this thesis, we are primarily motivated by the problem of modeling, recognizing and predicting spontaneous expressions of non-verbal human affect and behavior manifested through either low-level facial attributes in static images or high-level semantic events in image sequences. Both visual data and annotations of naturalistic affect and behavior naturally contain noisy measurements of unbounded magnitude at random locations, commonly referred to as ‘outliers’. We present here machine learning methods that are robust to such gross, sparse noise. First, we deal with static analysis of face images, viewing the latter as a superposition of mutually-incoherent, low-complexity components corresponding to facial attributes, such as facial identity, expressions and activation of atomic facial muscle actions. We develop a robust, discriminant dictionary learning framework to extract these components from grossly corrupted training data and combine it with sparse representation to recognize the associated attributes. We demonstrate that our framework can jointly address interrelated classification tasks such as face and facial expression recognition. Inspired by the well-documented importance of the temporal aspect in perceiving affect and behavior, we direct the bulk of our research efforts into continuous-time modeling of dimensional affect and social behavior. Having identified a gap in the literature which is the lack of data containing annotations of social attitudes in continuous time and scale, we first curate a new audio-visual database of multi-party conversations from political debates annotated frame-by-frame in terms of real-valued conflict intensity and use it to conduct the first study on continuous-time conflict intensity estimation. Our experimental findings corroborate previous evidence indicating the inability of existing classifiers in capturing the hidden temporal structures of affective and behavioral displays. We present here a novel dynamic behavior analysis framework which models temporal dynamics in an explicit way, based on the natural assumption that continuous- time annotations of smoothly-varying affect or behavior can be viewed as outputs of a low-complexity linear dynamical system when behavioral cues (features) act as system inputs. A novel robust structured rank minimization framework is proposed to estimate the system parameters in the presence of gross corruptions and partially missing data. Experiments on prediction of dimensional conflict and affect as well as multi-object tracking from detection validate the effectiveness of our predictive framework and demonstrate that for the first time that complex human behavior and affect can be learned and predicted based on small training sets of person(s)-specific observations.Open Acces

    Studies on noise robust automatic speech recognition

    Get PDF
    Noise in everyday acoustic environments such as cars, traffic environments, and cafeterias remains one of the main challenges in automatic speech recognition (ASR). As a research theme, it has received wide attention in conferences and scientific journals focused on speech technology. This article collection reviews both the classic and novel approaches suggested for noise robust ASR. The articles are literature reviews written for the spring 2009 seminar course on noise robust automatic speech recognition (course code T-61.6060) held at TKK

    Scalable learning for geostatistics and speaker recognition

    Get PDF
    With improved data acquisition methods, the amount of data that is being collected has increased severalfold. One of the objectives in data collection is to learn useful underlying patterns. In order to work with data at this scale, the methods not only need to be effective with the underlying data, but also have to be scalable to handle larger data collections. This thesis focuses on developing scalable and effective methods targeted towards different domains, geostatistics and speaker recognition in particular. Initially we focus on kernel based learning methods and develop a GPU based parallel framework for this class of problems. An improved numerical algorithm that utilizes the GPU parallelization to further enhance the computational performance of kernel regression is proposed. These methods are then demonstrated on problems arising in geostatistics and speaker recognition. In geostatistics, data is often collected at scattered locations and factors like instrument malfunctioning lead to missing observations. Applications often require the ability interpolate this scattered spatiotemporal data on to a regular grid continuously over time. This problem can be formulated as a regression problem, and one of the most popular geostatistical interpolation techniques, kriging is analogous to a standard kernel method: Gaussian process regression. Kriging is computationally expensive and needs major modifications and accelerations in order to be used practically. The GPU framework developed for kernel methods is extended to kriging and further the GPU's texture memory is better utilized for enhanced computational performance. Speaker recognition deals with the task of verifying a person's identity based on samples of his/her speech - "utterances". This thesis focuses on text-independent framework and three new recognition frameworks were developed for this problem. We proposed a kernelized Renyi distance based similarity scoring for speaker recognition. While its performance is promising, it does not generalize well for limited training data and therefore does not compare well to state-of-the-art recognition systems. These systems compensate for the variability in the speech data due to the message, channel variability, noise and reverberation. State-of-the-art systems model each speaker as a mixture of Gaussians (GMM) and compensate for the variability (termed "nuisance"). We propose a novel discriminative framework using a latent variable technique, partial least squares (PLS), for improved recognition. The kernelized version of this algorithm is used to achieve a state of the art speaker ID system, that shows results competitive with the best systems reported on in NIST's 2010 Speaker Recognition Evaluation

    Advanced Biometrics with Deep Learning

    Get PDF
    Biometrics, such as fingerprint, iris, face, hand print, hand vein, speech and gait recognition, etc., as a means of identity management have become commonplace nowadays for various applications. Biometric systems follow a typical pipeline, that is composed of separate preprocessing, feature extraction and classification. Deep learning as a data-driven representation learning approach has been shown to be a promising alternative to conventional data-agnostic and handcrafted pre-processing and feature extraction for biometric systems. Furthermore, deep learning offers an end-to-end learning paradigm to unify preprocessing, feature extraction, and recognition, based solely on biometric data. This Special Issue has collected 12 high-quality, state-of-the-art research papers that deal with challenging issues in advanced biometric systems based on deep learning. The 12 papers can be divided into 4 categories according to biometric modality; namely, face biometrics, medical electronic signals (EEG and ECG), voice print, and others

    Outlier Detection for Shape Model Fitting

    Get PDF
    Medical image analysis applications often benefit from having a statistical shape model in the background. Statistical shape models are generative models which can generate shapes from the same family and assign a likelihood to the generated shape. In an Analysis-by-synthesis approach to medical image analysis, the target shape to be segmented, registered or completed must first be reconstructed by the statistical shape model. Shape models accomplish this by either acting as regression models, used to obtain the reconstruction, or as regularizers, used to limit the space of possible reconstructions. However, the accuracy of these models is not guaranteed for targets that lie out of the modeled distribution of the statistical shape model. Targets with pathologies are an example of out-of-distribution data. The target shape to be reconstructed has deformations caused by pathologies that do not exist on the healthy data used to build the model. Added and missing regions may lead to false correspondences, which act as outliers and influence the reconstruction result. Robust fitting is necessary to decrease the influence of outliers on the fitting solution, but often comes at the cost of decreased accuracy in the inlier region. Robust techniques often presuppose knowledge of outlier characteristics to build a robust cost function or knowledge of the correct regressed function to filter the outliers. This thesis proposes strategies to obtain the outliers and reconstruction simultaneously without previous knowledge about either. The assumptions are that a statistical shape model that represents the healthy variations of the target organ is available, and that some landmarks on the model reference that annotate locations with correspondence to the target exist. The first strategy uses an EM-like algorithm to obtain the sampling posterior. This is a global reconstruction approach that requires classical noise assumptions on the outlier distribution. The second strategy uses Bayesian optimization to infer the closed-form predictive posterior distribution and estimate a label map of the outliers. The underlying regression model is a Gaussian Process Morphable Model (GPMM). To make the reconstruction obtained through Bayesian optimization robust, a novel acquisition function is proposed. The acquisition function uses the posterior and predictive posterior distributions to avoid choosing outliers as next query points. The algorithms give as outputs a label map and a a posterior distribution that can be used to choose the most likely reconstruction. To obtain the label map, the first strategy uses Bayesian classification to separate inliers and outliers, while the second strategy annotates all query points as inliers and unused model vertices as outliers. The proposed solutions are compared to the literature, evaluated through their sensitivity and breakdown points, and tested on publicly available datasets and in-house clinical examples. The thesis contributes to shape model fitting to pathological targets by showing that: - performing accurate inlier reconstruction and outlier detection is possible without case-specific manual thresholds or input label maps, through the use of outlier detection. - outlier detection makes the algorithms agnostic to pathology type i.e. the algorithms are suitable for both sparse and grouped outliers which appear as holes and bumps, the severity of which influences the results. - using the GPMM-based sequential Bayesian optimization approach, the closed-form predictive posterior distribution can be obtained despite the presence of outliers, because the Gaussian noise assumption is valid for the query points. - using sequential Bayesian optimization instead of traditional optimization for shape model fitting brings forth several advantages that had not been previously explored. Fitting can be driven by different reconstruction goals such as speed, location-dependent accuracy, or robustness. - defining pathologies as outliers opens the door for general pathology segmentation solutions for medical data. Segmentation algorithms do not need to be dependent on imaging modality, target pathology type, or training datasets for pathology labeling. The thesis highlights the importance of outlier-based definitions of pathologies in medical data that are independent of pathology type and imaging modality. Developing such standards would not only simplify the comparison of different pathology segmentation algorithms on unlabeled datsets, but also push forward standard algorithms that are able to deal with general pathologies instead of data-driven definitions of pathologies. This comes with theoretical as well as clinical advantages. Practical applications are shown on shape reconstruction and labeling tasks. Publicly-available challenge datasets are used, one for cranium implant reconstruction, one for kidney tumor detection, and one for liver shape reconstruction. Further clinical applications are shown on in-house examples of a femur and mandible with artifacts and missing parts. The results focus on shape modeling but can be extended in future work to include intensity information and inner volume pathologies

    Second generation sparse models

    Get PDF
    Sparse data models, where data is assumed to be well represented as a linear combination of a few elements from a learned dictionary, have gained considerable attention in recent years, and their use has led to state-of-the-art results in many applications. The success of these models is largely attributed to two critical features: the use of sparsity as a robust mechanism for regularizing the linear coefficients that represent the data, and the flexibility provided by overcomplete dictionaries that are learned from the data. These features are controlled by two critical hyper-parameters: the desired sparsity of the coefficients, and the size of the dictionaries to be learned. However, lacking theoretical guidelines for selecting these critical parameters, applications based on sparse models often require hand-tuning and cross-validation to select them, for each application, and each data set. This can be both inefficient and ineffective. On the other hand, there are multiple scenarios in which imposing additional constraints to the produced representations, including the sparse codes and the dictionary itself, can result in further improvements. This thesis is about improving and/or extending current sparse models by addressing the two issues discussed above, providing the elements for a new generation of more powerful and flexible sparse models. First, we seek to gain a better understanding of sparse models as data modeling tools, so that critical parameters can be selected automatically, efficiently, and in a principled way. Secondly, we explore new sparse modeling formulations for effectively exploiting the prior information present in different scenarios. In order to achieve these goals, we combine ideas and tools from information theory, statistics, machine learning, and optimization theory. The theoretical contributions are complemented with applications in audio, image and video processing

    Dynamic behavior analysis via structured rank minimization

    Get PDF
    Human behavior and affect is inherently a dynamic phenomenon involving temporal evolution of patterns manifested through a multiplicity of non-verbal behavioral cues including facial expressions, body postures and gestures, and vocal outbursts. A natural assumption for human behavior modeling is that a continuous-time characterization of behavior is the output of a linear time-invariant system when behavioral cues act as the input (e.g., continuous rather than discrete annotations of dimensional affect). Here we study the learning of such dynamical system under real-world conditions, namely in the presence of noisy behavioral cues descriptors and possibly unreliable annotations by employing structured rank minimization. To this end, a novel structured rank minimization method and its scalable variant are proposed. The generalizability of the proposed framework is demonstrated by conducting experiments on 3 distinct dynamic behavior analysis tasks, namely (i) conflict intensity prediction, (ii) prediction of valence and arousal, and (iii) tracklet matching. The attained results outperform those achieved by other state-of-the-art methods for these tasks and, hence, evidence the robustness and effectiveness of the proposed approach