76 research outputs found

    An auditory saliency pooling-based LSTM model for speech intelligibility classification

    Get PDF
    This article belongs to the Section Computer and Engineering Science and Symmetry/Asymmetry.Speech intelligibility is a crucial element in oral communication that can be influenced by multiple elements, such as noise, channel characteristics, or speech disorders. In this paper, we address the task of speech intelligibility classification (SIC) in this last circumstance. Taking our previous works, a SIC system based on an attentional long short-term memory (LSTM) network, as a starting point, we deal with the problem of the inadequate learning of the attention weights due to training data scarcity. For overcoming this issue, the main contribution of this paper is a novel type of weighted pooling (WP) mechanism, called saliency pooling where the WP weights are not automatically learned during the training process of the network, but are obtained from an external source of information, the Kalinli’s auditory saliency model. In this way, it is intended to take advantage of the apparent symmetry between the human auditory attention mechanism and the attentional models integrated into deep learning networks. The developed systems are assessed on the UA-speech dataset that comprises speech uttered by subjects with several dysarthria levels. Results show that all the systems with saliency pooling significantly outperform a reference support vector machine (SVM)-based system and LSTM-based systems with mean pooling and attention pooling, suggesting that Kalinli’s saliency can be successfully incorporated into the LSTM architecture as an external cue for the estimation of the speech intelligibility level.The work leading to these results has been supported by the Spanish Ministry of Economy, Industry and Competitiveness through TEC2017-84395-P (MINECO) and TEC2017-84593-C2-1-R (MINECO) projects (AEI/FEDER, UE), and the Universidad Carlos III de Madrid under Strategic Action 2018/00071/001

    INTERPRETABILITY FOR ARTIFICIAL INTELLIGENCE IN SPEAKER RECOGNITION TASKS

    Get PDF
    In the future, the DOD will more frequently incorporate AI into tasks with high consequences and, in turn, be scrutinized for mistakes. So far, much AI development has occurred in the private sector for which the consequences of error are lower than in defense. The DOD faces different incentives for the interpretability of AI and needs AI to aid decision-makers instead of replacing them. My thesis project implements techniques to improve trust in a speaker recognition task by focusing on meaningful and theoretically sound feature extraction and model simplicity. My main result is expected and elicits action from DOD. I find that a convolutional neural network (CNN) model performs substantially better than a multilayer perceptron and that logistic regression cannot discern speaker identity. Thus, the DOD needs to focus on research to develop interpretable models for complex tasks. The other result I find is surprising and motivates interpretability: When I construct features using mel frequency cepstral coefficients (MFCCs) on a human speech signal with an improperly long window, my CNN achieves an accuracy of 92%. Ill-defined MFCCs are theoretically meaningless; however, I find that they help predict speaker identity. Further confounding this result, I find that when I construct the MFCCs using the suggested 30 ms window, the model's accuracy falls to 72%. Future research should explore disentangled CNN-based models and the concept of an MFCC as windowing time grows.Civilian, Department of the NavyApproved for public release. Distribution is unlimited

    Image quality assessment using two-dimensional complex mel-cepstrum

    Get PDF
    Assessment of visual quality plays a crucial role in modeling, implementation, and optimization of image-and video-processing applications. The image quality assessment (IQA) techniques basically extract features from the images to generate objective scores. Feature-based IQA methods generally consist of two complementary phases: (1) feature extraction and (2) feature pooling. For feature extraction in the IQA framework, various algorithms have been used and recently, the two-dimensional (2-D) mel-cepstrum (2-DMC) feature extraction scheme has provided promising results in a feature-based IQA framework. However, the 2-DMC feature extraction scheme completely loses image-phase information that may contain high-frequency characteristics and important structural components of the image. In this work, "2-D complex mel-cepstrum" is proposed for feature extraction in an IQA framework. The method tries to integrate Fourier transform phase information into the 2-DMC, which was shown to be an efficient feature extraction scheme for assessment of image quality. Support vector regression is used for feature pooling that provides mapping between the proposed features and the subjective scores. Experimental results show that the proposed technique obtains promising results for the IQA problem by making use of the image-phase information. © 2016 SPIE and IS and T

    Structure Learning in Audio

    Get PDF

    Automated generation of movie tributes

    Get PDF
    O objetivo desta tese é gerar um tributo a um filme sob a forma de videoclip, considerando como entrada um filme e um segmento musical coerente. Um tributo é considerado um vídeo que contém os clips mais significativos de um filme, reproduzidos sequencialmente, enquanto uma música toca. Nesta proposta, os clips a constar do tributo final são o resultado da sumarização das legendas do filme com um algoritmo de sumarização genérico. É importante que o artefacto seja coerente e fluido, pelo que há a necessidade de haver um equilíbrio entre a seleção de conteúdo importante e a seleção de conteúdo que esteja em harmonia com a música. Para tal, os clips são filtrados de forma a garantir que apenas aqueles que contêm a mesma emoção da música aparecem no vídeo final. Tal é feito através da extração de vetores de características áudio relacionadas com emoções das cenas às quais os clips pertencem e da música, e, de seguida, da sua comparação por meio do cálculo de uma medida de distância. Por fim, os clips filtrados preenchem a música cronologicamente. Os resultados foram positivos: em média, os tributos produzidos obtiveram 7 pontos, numa escala de 0 a 10, em critérios como seleção de conteúdo e coerência emocional, fruto de avaliação humana.This thesis’ purpose is to generate a movie tribute in the form of a videoclip for a given movie and music. A tribute is considered to be a video containing meaningful clips from the movie playing along with a cohesive music piece. In this work, we collect the clips by summarizing the movie subtitles with a generic summarization algorithm. It is important that the artifact is coherent and fluid, hence there is the need to balance between the selection of important content and the selection of content that is in harmony with the music. To achieve so, clips are filtered so as to ensure that only those that contain the same emotion as the music are chosen to appear in the final video. This is made by extracting vectors of emotion-related audio features from the scenes they belong to and from the music, and then comparing them with a distance measure. Finally, filtered clips fill the music length in a chronological order. Results were positive: on average, the produced tributes obtained scores of 7, on a scale from 0 to 10, on content selection, and emotional coherence criteria, from human evaluation

    Recent Advances in Signal Processing

    Get PDF
    The signal processing task is a very critical issue in the majority of new technological inventions and challenges in a variety of applications in both science and engineering fields. Classical signal processing techniques have largely worked with mathematical models that are linear, local, stationary, and Gaussian. They have always favored closed-form tractability over real-world accuracy. These constraints were imposed by the lack of powerful computing tools. During the last few decades, signal processing theories, developments, and applications have matured rapidly and now include tools from many areas of mathematics, computer science, physics, and engineering. This book is targeted primarily toward both students and researchers who want to be exposed to a wide variety of signal processing techniques and algorithms. It includes 27 chapters that can be categorized into five different areas depending on the application at hand. These five categories are ordered to address image processing, speech processing, communication systems, time-series analysis, and educational packages respectively. The book has the advantage of providing a collection of applications that are completely independent and self-contained; thus, the interested reader can choose any chapter and skip to another without losing continuity

    On combining acoustic and modulation spectrograms in an attention LSTM-based system for speech intelligibility level classification

    Get PDF
    Speech intelligibility can be affected by multiple factors, such as noisy environments, channel distortions or physiological issues. In this work, we deal with the problem of automatic prediction of the speech intelligibility level in this latter case. Starting from our previous work, a non-intrusive system based on LSTM networks with attention mechanism designed for this task, we present two main contributions. In the first one, it is proposed the use of per-frame modulation spectrograms as input features, instead of compact representations derived from them that discard important temporal information. In the second one, two different strategies for the combination of per-frame acoustic log-mel and modulation spectrograms into the LSTM framework are explored: at decision level or late fusion and at utterance level or Weighted-Pooling (WP) fusion. The proposed models are evaluated with the UA-Speech database that contains dysarthric speech with different degrees of severity. On the one hand, results show that attentional LSTM networks are able to adequately modeling the modulation spectrograms sequences producing similar classification rates as in the case of log-mel spectrograms. On the other hand, both combination strategies, late and WP fusion, outperform the single-feature systems, suggesting that per-frame log-mel and modulation spectrograms carry complementary information for the task of speech intelligibility prediction, than can be effectively exploited by the LSTM-based architectures, being the system with the WP fusion strategy and Attention-Pooling the one that achieves best results.The work leading to these results has been partly supported by the Spanish Government-MinECo under Projects TEC2017-84395-P and TEC2017-84593-C2-1-R.Publicad

    Towards user-friendly audio creation

    Full text link
    corecore