38 research outputs found
Wavelet transforms for non-uniform speech recognition
An algorithm for nonuniform speech segmentation and its application in speech recognition systems is presented. A method based on the Modulated Gaussian Wavelet Transform based Speech Analyser (MGWTSA) and the subsequent parametrization block is used to transform a uniform signal into a set of nonuniformly separated frames, with the accurate information being fed into a speech recognition system. The algorithm needs a frame characterizing the signal where necessary, trying to reduce the number of frames per signal as much as possible, without an appreciable reduction in the recognition rate of the system.Peer ReviewedPostprint (published version
Audio segmentation-by-classification approach based on factor analysis in broadcast news domain
This paper studies a novel audio segmentation-by-classification approach based on factor analysis. The proposed technique compensates the within-class variability by using class-dependent factor loading matrices and obtains the scores by computing the log-likelihood ratio for the class model to a non-class model over fixed-length windows. Afterwards, these scores are smoothed to yield longer contiguous segments of the same class by means of different back-end systems. Unlike previous solutions, our proposal does not make use of specific acoustic features and does not need a hierarchical structure. The proposed method is applied to segment and classify audios coming from TV shows into five different acoustic classes: speech, music, speech with music, speech with noise, and others. The technique is compared to a hierarchical system with specific acoustic features achieving a significant error reduction
Unsupervised adaptation of deep speech activity detection models to unseen domains
Speech Activity Detection (SAD) aims to accurately classify audio fragments containing human speech. Current state-of-the-art systems for the SAD task are mainly based on deep learning solutions. These applications usually show a significant drop in performance when test data are different from training data due to the domain shift observed. Furthermore, machine learning algorithms require large amounts of labelled data, which may be hard to obtain in real applications. Considering both ideas, in this paper we evaluate three unsupervised domain adaptation techniques applied to the SAD task. A baseline system is trained on a combination of data from different domains and then adapted to a new unseen domain, namely, data from Apollo space missions coming from the Fearless Steps Challenge. Experimental results demonstrate that domain adaptation techniques seeking to minimise the statistical distribution shift provide the most promising results. In particular, Deep CORAL method reports a 13% relative improvement in the original evaluation metric when compared to the unadapted baseline model. Further experiments show that the cascaded application of Deep CORAL and pseudo-labelling techniques can improve even more the results, yielding a significant 24% relative improvement in the evaluation metric when compared to the baseline system
Text-to-Pictogram Summarization for Augmentative and Alternative Communication
Many people suffer from language disorders that affect their communicative capabilities. Augmentative and alternative communication devices assist learning process through graphical representation of common words. In this article, we present a complete text-to-pictogram system able to simplify complex texts and ease its comprehension with pictograms
Progressive loss functions for speech enhancement with deep neural networks
The progressive paradigm is a promising strategy to optimize network performance for speech enhancement purposes. Recent works have shown different strategies to improve the accuracy of speech enhancement solutions based on this mechanism. This paper studies the progressive speech enhancement using convolutional and residual neural network architectures and explores two criteria for loss function optimization: weighted and uniform progressive. This work carries out the evaluation on simulated and real speech samples with reverberation and added noise using REVERB and VoiceHome datasets. Experimental results show a variety of achievements among the loss function optimization criteria and the network architectures. Results show that the progressive design strengthens the model and increases the robustness to distortions due to reverberation and noise
ASLP-MULAN: Audio speech and language processing for multimedia analytics
Our intention is generating the right mixture of audio, speech and language technologies with big data ones. Some audio, speech and language automatic technologies are available or gaining enough degree of maturity as to be able to help to this objective: automatic speech transcription, query by spoken example, spoken information retrieval, natural language processing, unstructured multimedia contents transcription and description, multimedia files summarization, spoken emotion detection and sentiment analysis, speech and text understanding, etc. They seem to be worthwhile to be joined and put at work on automatically captured data streams coming from several sources of information like YouTube, Facebook, Twitter, online newspapers, web search engines, etc. to automatically generate reports that include both scientific based scores and subjective but relevant summarized statements on the tendency analysis and the perceived satisfaction of a product, a company or another entity by the general population
AMIC: Affective multimedia analytics with inclusive and natural communication
Traditionally, textual content has been the main source of information extraction and indexing, and other technologies that are capable of extracting information from the audio and video of multimedia documents have joined later. Other major axis of analysis is the emotional and affective aspect intrinsic in human communication. This information of emotions, stances, preferences, figurative language, irony, sarcasm, etc. is fundamental and irreplaceable for a complete understanding of the content in conversations, speeches, debates, discussions, etc. The objective of this project is focused on advancing, developing and improving speech and language technologies as well as image and video technologies in the analysis of multimedia content adding to this analysis the extraction of affective-emotional information. As additional steps forward, we will advance in the methodologies and ways for presenting the information to the user, working on technologies for language simplification, automatic reports and summary generation, emotional speech synthesis and natural and inclusive interaction
Modelling of the analytic spectrum for speech recognition
In this paper, a new spectral representation is introduced and applied to speech recognition. As the widely used LPC autocorrelation technique, it arises from an optimization approach that starts from a set of M+ 1 autocorrelations estimated from the signal samples. This new technique models the analytic spectrum (Fourier's transform of the causal autocorrelation sequence) by assuming that its cepstral coefficients are zero beyond M, and uses an extremely simple algorithm to compute the nonzero coefficients. In speech recognition, the same Euclidean cepstral distance measure that is the object of the optimization is also used to calculate the spectral dissimilarity. Preliminary recognition tests with this technique are presentad.Peer ReviewedPostprint (published version
Identificación Automática de Idioma en Lenguaje Hablado
La identificación automática de idioma (LID) es la tarea por la cual se ha de reconocer en qué idioma se está hablando en una conversación. Podemos encontrar dos problemas típicos: identificación, donde decidiremos el idioma de entre un conjunto conocido de posibilidades; o detección, donde decidiremos si la conversación se habla o no en un idioma objetivo. Las utilidades principales son el enrutamiento de llamadas en call-centers, audiodescripción, y seguridad militar. Las principales técnicas utilizadas se dividen en tres grupos: a) técnicas acústicas: se extraen las características frecuenciales a corto plazo de la señal, principalmente mediante los mel frequency cepstral coefficients (MFCC); b) técnicas basadas en tokens: se particiona la señal en grupos preestablecidos (tokens) y se estudian las frecuencias y el orden de aparición de los mismos, como en el reconocimiento de fonemas seguido de modelo de lenguaje (PRLM), donde los tokens son fonemas; c) técnicas prosódicas: se extraen características suprasegmentales de la señal a largo plazo, como el pitch, la energía, la duración o los formantes. Una vez extraídos uno o varios de estos parámetros, se utilizan técnicas de reconocimiento de patrones para formar modelos de cada idioma, con los que realizaremos la clasificación. Nuestro grupo está investigando principalmente técnicas acústicas y prosódicas, utilizando clasificadores basados en iVectors, basados a su vez en factor analysis. Para comparar las prestaciones entre diferentes grupos de investigación se realizan evaluaciones a nivel nacional e internacional, con idiomas muy variados, donde nuestro grupo ha obtenido muy buenos resultados en las últimas ediciones
Albayzín-2014 evaluation: audio segmentation and classification in broadcast news domains
The electronic version of this article is the complete one and can be found online at: http://dx.doi.org/10.1186/s13636-015-0076-3Audio segmentation is important as a pre-processing task to improve the performance of many speech technology tasks and, therefore, it has an undoubted research interest. This paper describes the database, the metric, the systems and the results for the Albayzín-2014 audio segmentation campaign. In contrast to previous evaluations where the task was the segmentation of non-overlapping classes, Albayzín-2014 evaluation proposes the delimitation of the presence of speech, music and/or noise that can be found simultaneously. The database used in the evaluation was created by fusing different media and noises in order to increase the difficulty of the task. Seven segmentation systems from four different research groups were evaluated and combined. Their experimental results were analyzed and compared with the aim of providing a benchmark and showing up the promising directions in this field.This work has been partially funded by the Spanish Government and the European Union (FEDER) under the project TIN2011-28169-C05-02 and supported by the European Regional Development Fund and the Spanish
Government (‘SpeechTech4All Project’ TEC2012-38939-C03