9 research outputs found
Human and Machine Speaker Recognition Based on Short Trivial Events
Trivial events are ubiquitous in human to human conversations, e.g., cough,
laugh and sniff. Compared to regular speech, these trivial events are usually
short and unclear, thus generally regarded as not speaker discriminative and so
are largely ignored by present speaker recognition research. However, these
trivial events are highly valuable in some particular circumstances such as
forensic examination, as they are less subjected to intentional change, so can
be used to discover the genuine speaker from disguised speech. In this paper,
we collect a trivial event speech database that involves 75 speakers and 6
types of events, and report preliminary speaker recognition results on this
database, by both human listeners and machines. Particularly, the deep feature
learning technique recently proposed by our group is utilized to analyze and
recognize the trivial events, which leads to acceptable equal error rates
(EERs) despite the extremely short durations (0.2-0.5 seconds) of these events.
Comparing different types of events, 'hmm' seems more speaker discriminative.Comment: ICASSP 201
Query by Example of Speaker Audio Signals using Power Spectrum and MFCCs
Search engine is the popular term for an information retrieval (IR) system. Typically, search engine can be based on full-text indexing. Changing the presentation from the text data to multimedia data types make an information retrieval process more complex such as a retrieval of image or sounds in large databases. This paper introduces the use of language and text independent speech as input queries in a large sound database by using Speaker identification algorithm. The method consists of 2 main processing first steps, we separate vocal and non-vocal identification after that vocal be used to speaker identification for audio query by speaker voice. For the speaker identification and audio query by process, we estimate the similarity of the example signal and the samples in the queried database by calculating the Euclidian distance between the Mel frequency cepstral coefficients (MFCC) and Energy spectrum of acoustic features. The simulations show that the good performance with a sustainable computational cost and obtained the average accuracy rate more than 90%
Estudio y mejora de sistemas de verificación de locutores bajo condiciones de voz afónica
Este proyecto analiza las diferencias entre distintos modos del habla, en especial el susurro, utilizado como medio de comunicación cuando se padece una enfermedad como la afonı́a, y cómo afectan a los sistemas de verificación automática de locutores. El objetivo del proyecto es el estudio de la pérdida de prestaciones, y la mejora de los sistemas mediante la aplicación de distintas técnicas.El estudio parte del análisis de las señales en los dominios de voz susurrada y voz neutra, cuyas diferencias explican el detrimento del sistema. Para cuantificarlo, se escogen un sistema de referencia de altas prestaciones y una base de datos que cuenta con audios en condiciones de habla normal y susurrada.Las técnicas de mejora estudiadas abordan el problema en distintos puntos del sistema completo. Estas técnicas se introducen de forma teórica en el segundo bloque del trabajo, y en el tercer bloque se muestran los resultados obtenidos para cada una de ellas. Para evaluarlas y compararlas se utilizan herramientas de software libre, herramientas de visualización y entrenamiento de modelos estadı́sticos utilizando Python como lenguaje de programación principal. El trabajo muestra el rendimiento de herramientas alternativas a los algoritmos populares de aprendizaje automático, necesarias cuando no se dispone de una cantidad significativa de datos que permitan buenos resultados.<br /
GLOTTAL EXCITATION EXTRACTION OF VOICED SPEECH - JOINTLY PARAMETRIC AND NONPARAMETRIC APPROACHES
The goal of this dissertation is to develop methods to recover glottal flow pulses, which contain biometrical information about the speaker. The excitation information estimated from an observed speech utterance is modeled as the source of an inverse problem. Windowed linear prediction analysis and inverse filtering are first used to deconvolve the speech signal to obtain a rough estimate of glottal flow pulses. Linear prediction and its inverse filtering can largely eliminate the vocal-tract response which is usually modeled as infinite impulse response filter. Some remaining vocal-tract components that reside in the estimate after inverse filtering are next removed by maximum-phase and minimum-phase decomposition which is implemented by applying the complex cepstrum to the initial estimate of the glottal pulses. The additive and residual errors from inverse filtering can be suppressed by higher-order statistics which is the method used to calculate cepstrum representations. Some features directly provided by the glottal source\u27s cepstrum representation as well as fitting parameters for estimated pulses are used to form feature patterns that were applied to a minimum-distance classifier to realize a speaker identification system with very limited subjects
Modelização de filtro de trato vocal para reconstrução de voz disfónica
Análise e modelizadação da envolvente espectral para as nove vogais
orais do português europeu padrão nos modos de fala vozeada e fala
sussurrada. Desenvolvimento de modelos compactos no domínio espectral e
no domínio cepstral das nove vogais orais do português padrão,
orientados ao orador. Desenvolvimento e avaliação de um algoritmo
protótipo de identificação de vogal sussurrada, orientado à operação em
tempo real