12 research outputs found

    Non-linear System Identification with Composite Relevance Vector Machines

    Get PDF
    Nonlinear system identification based on relevance vector machines (RVMs) has been traditionally addressed by stacking the input and/or output regressors and then performing standard RVM regression. This letter introduces a full family of composite kernels in order to integrate the input and output information in the mapping function efficiently and hence generalize the standard approach. An improved trade-off between accuracy and sparsity is obtained in several benchmark problems. Also, the RVM yields confidence intervals for the predictions, and it is less sensitive to free parameter selectionPublicad

    Increasing Accuracy Performance through Optimal Feature Extraction Algorithms

    Get PDF
    This research developed models and techniques to improve the three key modules of popular recognition systems: preprocessing, feature extraction, and classification. Improvements were made in four key areas: processing speed, algorithm complexity, storage space, and accuracy. The focus was on the application areas of the face, traffic sign, and speaker recognition. In the preprocessing module of facial and traffic sign recognition, improvements were made through the utilization of grayscaling and anisotropic diffusion. In the feature extraction module, improvements were made in two different ways; first, through the use of mixed transforms and second through a convolutional neural network (CNN) that best fits specific datasets. The mixed transform system consists of various combinations of the Discrete Wavelet Transform (DWT) and Discrete Cosine Transform (DCT), which have a reliable track record for image feature extraction. In terms of the proposed CNN, a neuroevolution system was used to determine the characteristics and layout of a CNN to best extract image features for particular datasets. In the speaker recognition system, the improvement to the feature extraction module comprised of a quantized spectral covariance matrix and a two-dimensional Principal Component Analysis (2DPCA) function. In the classification module, enhancements were made in visual recognition through the use of two neural networks: the multilayer sigmoid and convolutional neural network. Results show that the proposed improvements in the three modules led to an increase in accuracy as well as reduced algorithmic complexity, with corresponding reductions in storage space and processing time

    Speaker characterization using adult and children’s speech

    Get PDF
    Speech signals contain important information about a speaker, such as age, gender, language, accent, and emotional/psychological state. Automatic recognition of these types of characteristics has a wide range of commercial, medical and forensic applications such as interactive voice response systems, service customization, natural human-machine interaction, recognizing the type of pathology of speakers, and directing the forensic investigation process. Many such applications depend on reliable systems using short speech segments without regard to the spoken text (text-independent). All these applications are also applicable using children’s speech. This research aims to develop accurate methods and tools to identify different characteristics of the speakers. Our experiments cover speaker recognition, gender recognition, age-group classification, and accent identification. However, similar approaches and techniques can be applied to identify other characteristics such as emotional/psychological state. The main focus of this research is on detecting these characteristics from children’s speech, which is previously reported as a more challenging subject compared to adult. Furthermore, the impact of different frequency bands on the performances of several recognition systems is studied, and the performance obtained using children’s speech is compared with the corresponding results from experiments using adults’ speech. Speaker characterization is performed by fitting a probability density function to acoustic features extracted from the speech signals. Since the distribution of acoustic features is complex, Gaussian mixture models (GMM) are applied. Due to lack of data, parametric model adaptation methods have been applied to adapt the universal background model (UBM) to the char acteristics of utterances. An effective approach involves adapting the UBM to speech signals using the Maximum-A-Posteriori (MAP) scheme. Then, the Gaussian means of the adapted GMM are concatenated to form a Gaussian mean super-vector for a given utterance. Finally, a classification or regression algorithm is used to identify the speaker characteristics. While effective, Gaussian mean super-vectors are of a high dimensionality resulting in high computational cost and difficulty in obtaining a robust model in the context of limited data. In the field of speaker recognition, recent advances using the i-vector framework have increased the classification accuracy. This framework, which provides a compact representation of an utterance in the form of a low dimensional feature vector, applies a simple factor analysis on GMM means

    Métodos discriminativos para la optimización de modelos en la Verificación del Hablante

    Get PDF
    La creciente necesidad de sistemas de autenticación seguros ha motivado el interés de algoritmos efectivos de Verificación de Hablante (VH). Dicha necesidad de algoritmos de alto rendimiento, capaces de obtener tasas de error bajas, ha abierto varias ramas de investigación. En este trabajo proponemos investigar, desde un punto de vista discriminativo, un conjunto de metodologías para mejorar el desempeño del estado del arte de los sistemas de VH. En un primer enfoque investigamos la optimización de los hiper-parámetros para explícitamente considerar el compromiso entre los errores de falsa aceptación y falso rechazo. El objetivo de la optimización se puede lograr maximizando el área bajo la curva conocida como ROC (Receiver Operating Characteristic) por sus siglas en inglés. Creemos que esta optimización de los parámetros no debe de estar limitada solo a un punto de operación y una estrategia más robusta es optimizar los parámetros para incrementar el área bajo la curva, AUC (Area Under the Curve por sus siglas en inglés) de modo que todos los puntos sean maximizados. Estudiaremos cómo optimizar los parámetros utilizando la representación matemática del área bajo la curva ROC basada en la estadística de Wilcoxon Mann Whitney (WMW) y el cálculo adecuado empleando el algoritmo de descendente probabilístico generalizado. Además, analizamos el efecto y mejoras en métricas como la curva detection error tradeoff (DET), el error conocido como Equal Error Rate (EER) y el valor mínimo de la función de detección de costo, minimum value of the detection cost function (minDCF) todos ellos por sue siglas en inglés. En un segundo enfoque, investigamos la señal de voz como una combinación de atributos que contienen información del hablante, del canal y el ruido. Los sistemas de verificación convencionales entrenan modelos únicos genéricos para todos los casos, y manejan las variaciones de estos atributos ya sea usando análisis de factores o no considerando esas variaciones de manera explícita. Proponemos una nueva metodología para particionar el espacio de los datos de acuerdo a estas carcterísticas y entrenar modelos por separado para cada partición. Las particiones se pueden obtener de acuerdo a cada atributo. En esta investigación mostraremos como entrenar efectivamente los modelos de manera discriminativa para maximizar la separación entre ellos. Además, el diseño de algoritimos robustos a las condiciones de ruido juegan un papel clave que permite a los sistemas de VH operar en condiciones reales. Proponemos extender nuestras metodologías para mitigar los efectos del ruido en esas condiciones. Para nuestro primer enfoque, en una situación donde el ruido se encuentre presente, el punto de operación puede no ser solo un punto, o puede existir un corrimiento de forma impredecible. Mostraremos como nuestra metodología de maximización del área bajo la curva ROC es más robusta que la usada por clasificadores convencionales incluso cuando el ruido no está explícitamente considerado. Además, podemos encontrar ruido a diferentes relación señal a ruido (SNR) que puede degradar el desempeño del sistema. Así, es factible considerar una descomposición eficiente de las señales de voz que tome en cuenta los diferentes atributos como son SNR, el ruido y el tipo de canal. Consideramos que en lugar de abordar el problema con un modelo unificado, una descomposición en particiones del espacio de características basado en atributos especiales puede proporcionar mejores resultados. Esos atributos pueden representar diferentes canales y condiciones de ruido. Hemos analizado el potencial de estas metodologías que permiten mejorar el desempeño del estado del arte de los sistemas reduciendo el error, y por otra parte controlar los puntos de operación y mitigar los efectos del ruido

    Dysarthric speech analysis and automatic recognition using phase based representations

    Get PDF
    Dysarthria is a neurological speech impairment which usually results in the loss of motor speech control due to muscular atrophy and poor coordination of articulators. Dysarthric speech is more difficult to model with machine learning algorithms, due to inconsistencies in the acoustic signal and to limited amounts of training data. This study reports a new approach for the analysis and representation of dysarthric speech, and applies it to improve ASR performance. The Zeros of Z-Transform (ZZT) are investigated for dysarthric vowel segments. It shows evidence of a phase-based acoustic phenomenon that is responsible for the way the distribution of zero patterns relate to speech intelligibility. It is investigated whether such phase-based artefacts can be systematically exploited to understand their association with intelligibility. A metric based on the phase slope deviation (PSD) is introduced that are observed in the unwrapped phase spectrum of dysarthric vowel segments. The metric compares the differences between the slopes of dysarthric vowels and typical vowels. The PSD shows a strong and nearly linear correspondence with the intelligibility of the speaker, and it is shown to hold for two separate databases of dysarthric speakers. A systematic procedure for correcting the underlying phase deviations results in a significant improvement in ASR performance for speakers with severe and moderate dysarthria. In addition, information encoded in the phase component of the Fourier transform of dysarthric speech is exploited in the group delay spectrum. Its properties are found to represent disordered speech more effectively than the magnitude spectrum. Dysarthric ASR performance was significantly improved using phase-based cepstral features in comparison to the conventional MFCCs. A combined approach utilising the benefits of PSD corrections and phase-based features was found to surpass all the previous performance on the UASPEECH database of dysarthric speech

    Single channel audio separation using deep neural networks and matrix factorizations

    Get PDF
    PhD ThesisSource Separation has become a significant research topic in the signal processing community and the machine learning area. Due to numerous applications, such as automatic speech recognition and speech communication, separation of target speech from the mixed signal is of great importance. In many practical applications, speech separation from a single recorder is most desirable from an application standpoint. In this thesis, two novel approaches have been proposed to address this single channel audio separation problem. This thesis first reviews traditional approaches for single channel source separation, and later elicits a generic approach, which is more capable of feature learning, i.e. deep graphical models. In the first part of this thesis, a novel approach based on matrix factorization and hierarchical model has been proposed. In this work, an artificial stereo mixture is formulated to provide extra information. In addition, a hybrid framework that combines the generalized Expectation-Maximization algorithm with a multiplicative update rule is proposed to optimize the parameters of a matrix factorization based approach to approximatively separate the mixture. Furthermore, a hierarchical model based on an extreme learning machine is developed to check the validity of the approximately separated sources followed by an energy minimization method to further improve the quality of the separated sources by generating a time-frequency mask. Various experiments have been conducted and the obtained results have shown that the proposed approach outperforms conventional approaches not only in reduction of computational complexity, but also the separation performance. In the second part, a deep neural network based ensemble system is proposed. In this work, the complementary property of different features are fully explored by ‘wide’ and ‘forward’ ensemble system. In addition, instead of using the features learned from the output layer, the features learned from the penultimate layer are investigated. The final embedded features are classified with an extreme learning machine to generate a binary mask to separate a mixed signal. The experiment focuses on speech in the presence of music and the obtained results demonstrated that the proposed ensemble system has the ability to explore the complementary property of various features thoroughly under various conditions with promising separation performance

    REPRESENTATION LEARNING FOR ACTION RECOGNITION

    Get PDF
    The objective of this research work is to develop discriminative representations for human actions. The motivation stems from the fact that there are many issues encountered while capturing actions in videos like intra-action variations (due to actors, viewpoints, and duration), inter-action similarity, background motion, and occlusion of actors. Hence, obtaining a representation which can address all the variations in the same action while maintaining discrimination with other actions is a challenging task. In literature, actions have been represented either using either low-level or high-level features. Low-level features describe the motion and appearance in small spatio-temporal volumes extracted from a video. Due to the limited space-time volume used for extracting low-level features, they are not able to account for viewpoint and actor variations or variable length actions. On the other hand, high-level features handle variations in actors, viewpoints, and duration but the resulting representation is often high-dimensional which introduces the curse of dimensionality. In this thesis, we propose new representations for describing actions by combining the advantages of both low-level and high-level features. Specifically, we investigate various linear and non-linear decomposition techniques to extract meaningful attributes in both high-level and low-level features. In the first approach, the sparsity of high-level feature descriptors is leveraged to build action-specific dictionaries. Each dictionary retains only the discriminative information for a particular action and hence reduces inter-action similarity. Then, a sparsity-based classification method is proposed to classify the low-rank representation of clips obtained using these dictionaries. We show that this representation based on dictionary learning improves the classification performance across actions. Also, a few of the actions consist of rapid body deformations that hinder the extraction of local features from body movements. Hence, we propose to use a dictionary which is trained on convolutional neural network (CNN) features of the human body in various poses to reliably identify actors from the background. Particularly, we demonstrate the efficacy of sparse representation in the identification of the human body under rapid and substantial deformation. In the first two approaches, sparsity-based representation is developed to improve discriminability using class-specific dictionaries that utilize action labels. However, developing an unsupervised representation of actions is more beneficial as it can be used to both recognize similar actions and localize actions. We propose to exploit inter-action similarity to train a universal attribute model (UAM) in order to learn action attributes (common and distinct) implicitly across all the actions. Using maximum aposteriori (MAP) adaptation, a high-dimensional super action-vector (SAV) for each clip is extracted. As this SAV contains redundant attributes of all other actions, we use factor analysis to extract a novel lowvi dimensional action-vector representation for each clip. Action-vectors are shown to suppress background motion and highlight actions of interest in both trimmed and untrimmed clips that contributes to action recognition without the help of any classifiers. It is observed during our experiments that action-vector cannot effectively discriminate between actions which are visually similar to each other. Hence, we subject action-vectors to supervised linear embedding using linear discriminant analysis (LDA) and probabilistic LDA (PLDA) to enforce discrimination. Particularly, we show that leveraging complimentary information across action-vectors using different local features followed by discriminative embedding provides the best classification performance. Further, we explore non-linear embedding of action-vectors using Siamese networks especially for fine-grained action recognition. A visualization of the hidden layer output in Siamese networks shows its ability to effectively separate visually similar actions. This leads to better classification performance than linear embedding on fine-grained action recognition. All of the above approaches are presented on large unconstrained datasets with hundreds of examples per action. However, actions in surveillance videos like snatch thefts are difficult to model because of the diverse variety of scenarios in which they occur and very few labeled examples. Hence, we propose to utilize the universal attribute model (UAM) trained on large action datasets to represent such actions. Specifically, we show that there are similarities between certain actions in the large datasets with snatch thefts which help in extracting a representation for snatch thefts using the attributes from the UAM. This representation is shown to be effective in distinguishing snatch thefts from regular actions with high accuracy.In summary, this thesis proposes both supervised and unsupervised approaches for representing actions which provide better discrimination than existing representations. The first approach presents a dictionary learning based sparse representation for effective discrimination of actions. Also, we propose a sparse representation for the human body based on dictionaries in order to recognize actions with rapid body deformations. In the next approach, a low-dimensional representation called action-vector for unsupervised action recognition is presented. Further, linear and non-linear embedding of action-vectors is proposed for addressing inter-action similarity and fine-grained action recognition, respectively. Finally, we propose a representation for locating snatch thefts among thousands of regular interactions in surveillance videos
    corecore