12 research outputs found

    Text-dependent Forensic Voice Comparison: Likelihood Ratio Estimation with the Hidden Markov Model (HMM) and Gaussian Mixture Model – Universal Background Model (GMMUBM) Approaches

    Get PDF
    Among the more typical forensic voice comparison (FVC) approaches, the acoustic-phonetic statistical approach is suitable for text-dependent FVC, but it does not fully exploit available time-varying information of speech in its modelling. The automatic approach, on the other hand, essentially deals with text-independent cases, which means temporal information is not explicitly incorporated in the modelling. Text-dependent likelihood ratio (LR)-based FVC studies, in particular those that adopt the automatic approach, are few. This preliminary LR-based FVC study compares two statistical models, the Hidden Markov Model (HMM) and the Gaussian Mixture Model (GMM), for the calculation of forensic LRs using the same speech data. FVC experiments were carried out using different lengths of Japanese short words under a forensically realistic, but challenging condition: only two speech tokens for model training and LR estimation. Log-likelihood-ratio cost (Cllr) was used as the assessment metric. The study demonstrates that the HMM system constantly outperforms the GMM system in terms of average Cllr values. However, words longer than three mora are needed if the advantage of the HMM is to become evident. With a seven-mora word, for example, the HMM outperformed the GMM by a Cllr value of 0.073

    DNN-Based Source Enhancement to Increase Objective Sound Quality Assessment Score

    Get PDF
    We propose a training method for deep neural network (DNN)-based source enhancement to increase objective sound quality assessment (OSQA) scores such as the perceptual evaluation of speech quality (PESQ). In many conventional studies, DNNs have been used as a mapping function to estimate time-frequency masks and trained to minimize an analytically tractable objective function such as the mean squared error (MSE). Since OSQA scores have been used widely for soundquality evaluation, constructing DNNs to increase OSQA scores would be better than using the minimum-MSE to create highquality output signals. However, since most OSQA scores are not analytically tractable, i.e., they are black boxes, the gradient of the objective function cannot be calculated by simply applying back-propagation. To calculate the gradient of the OSQA-based objective function, we formulated a DNN optimization scheme on the basis of black-box optimization, which is used for training a computer that plays a game. For a black-box-optimization scheme, we adopt the policy gradient method for calculating the gradient on the basis of a sampling algorithm. To simulate output signals using the sampling algorithm, DNNs are used to estimate the probability-density function of the output signals that maximize OSQA scores. The OSQA scores are calculated from the simulated output signals, and the DNNs are trained to increase the probability of generating the simulated output signals that achieve high OSQA scores. Through several experiments, we found that OSQA scores significantly increased by applying the proposed method, even though the MSE was not minimized

    Construction of a corpus of elderly Japanese speech for analysis and recognition

    Get PDF
    Tokushima UniversityAichi Prefectural UniversityUniversity of YamanashiLREC 2018 Special Speech Sessions "Speech Resources Collection in Real-World Situations"; Phoenix Seagaia Conference Center, Miyazaki; 2018-05-09We have constructed a new speech data corpus using the utterances of 100 elderly Japanese people, in order to improve the accuracy of automatic recognition of the speech of older people. Humanoid robots are being developed for use in elder care nursing facilities because interaction with such robots is expected to help clients maintain their cognitive abilities, as well as provide them with companionship. In order for these robots to interact with the elderly through spoken dialogue, a high performance speech recognition system for the speech of elderly people is needed. To develop such a system, we recorded speech uttered by 100 elderly Japanese who had an average age of 77.2, most of them living in nursing homes. Another corpus of elderly Japanese speech called S-JNAS (Seniors-Japanese Newspaper Article Sentences) has been developed previously, but the average age of the participants was 67.6. Since the target age for nursing home care is around 75, much higher than that of most of the S-JNAS samples, we felt a more representative corpus was needed. In this study we compare the performance of our new corpus with both the Japanese read speech corpus JNAS (Japanese Newspaper Article Speech), which consists of adult speech, and with the S-JNAS, the senior version of JNAS, by conducting speech recognition experiments. Data from the JNAS, S-JNAS and CSJ (Corpus of Spontaneous Japanese) was used as training data for the acoustic models, respectively. We then used our new corpus to adapt the acoustic models to elderly speech, but we were unable to achieve sufficient performance when attempting to recognize elderly speech. Based on our experimental results, we believe that development of a corpus of spontaneous elderly speech and/or special acoustic adaptation methods will likely be necessary to improve the recognition performance of dialog systems for the elderly

    SAS: A Speaker Verification Spoofing Database Containing Diverse Attacks

    Get PDF
    Due to copyright restrictions, the access to the full text of this article is only available via subscription.This paper presents the first version of a speaker verification spoofing and anti-spoofing database, named SAS corpus. The corpus includes nine spoofing techniques, two of which are speech synthesis, and seven are voice conversion. We design two protocols, one for standard speaker verification evaluation, and the other for producing spoofing materials. Hence, they allow the speech synthesis community to produce spoofing materials incrementally without knowledge of speaker verification spoofing and anti-spoofing. To provide a set of preliminary results, we conducted speaker verification experiments using two state-of-the-art systems. Without any anti-spoofing techniques, the two systems are extremely vulnerable to the spoofing attacks implemented in our SAS corpus.EPSRC ; CAF ; TÜBİTA

    Integración de bases de datos para la detección de ataques mediante Spoofing

    Get PDF
    La seguridad es una demanda inherente a la condición humana sobre cualquiera de nuestros actos, pertenencias y nosotros mismos, en definitiva. La información, y su repercusión sobre nuestra propia integridad, tampoco está excluida de dicha demanda y los investigadores hemos de integrar el concepto de SEGURIDAD en el desarrollo de cada uno de los proyectos que abordamos. En este campo podemos diferenciar dos ámbitos principales, la seguridad física y la seguridad de la información. La seguridad física es una estrategia para proteger las instalaciones, los activos, los recursos y las personas de los incidentes o acciones que pueden causar pérdidas o daños a estas entidades. La seguridad de la información, es una estrategia para proteger la integridad y privacidad del contenido con seguridad digital. A día de hoy la forma de identificación más común es el uso de contraseñas, llaves, tarjetas… Una pega de estos métodos es que pueden ser robados u olvidados. Por otro lado, encontramos herramientas como la biometría, una práctica más nueva, que se está utilizando para implementar seguridad tanto física como de información. En comparación con los métodos tradicionales de contraseñas, llaves y similares, la biometría es una posesión que siempre se posee y ahí reside su principal ventaja. En la seguridad biométrica es común el uso de la huella dactilar, estructura facial, el iris o la voz. En lo que a esta última se refiere, la biometría de la voz, es la ciencia de utilizar la voz de una persona como una característica biológica de identificación única para autenticarla. También conocida como verificación de voz o reconocimiento de hablante, la biometría de voz permite un acceso rápido, no intrusivo y seguro para una variedad de casos de uso, desde call centers, aplicaciones móviles o aplicaciones en línea, hasta chatbots, dispositivos IoT (Internet of Things) y de acceso físico. Si existe la necesidad de implementar sistemas de seguridad es por la existencia, a su vez, de un riesgo cierto; hay algo o alguien de quien protegerse. En el caso de la biometría de voz, son los denominados ataques spoofing o de suplantación de identidad los que constituyen una gran amenaza para la seguridad. De cara a hacer frente a estos ataques, diversos estudios e instituciones tienden a implementar módulos de detección de habla sintética (SSD). El funcionamiento de esta tecnología se basa en un clasificador que dispone de dos modelos diferentes, uno de habla humana y otro de habla sintética. Cuando un usuario trata de verificarse frente al sistema, la señal se compara con ambos modelos y, si la diferencia de similitudes supera un umbral, se acepta como humana, en caso contrario se rechaza clasificándola como sintética. Durante el desarrollo de esta tecnología, los sistemas deben ser entrenados y para ello se utiliza una gran cantidad de grabaciones de voz, que servirán para crear los modelos mencionados antes. A lo largo de este Trabajo de Fin de Grado se estudia la utilización de bases de datos por parte de estos sistemas para la detección de ataques mediante spoofing. Para llevar a cabo esta tarea se hace uso de un SSD basado tanto en parámetros espectrales MFCC como los parámetros de la fase armónica, RPS. Asimismo, se realizan pruebas con redes neuronales con el objetivo último de obtener resultados con menor probabilidad de error. Se hace uso de las denominadas redes neuronales DNN (Deep Neural Networks) para la mejora de la tarea de clasificación.Segurtasuna giza izaerari datxekion eskaria da, gure egintza, ondasun eta, azken batean, geure buruaren gainekoa. Informazioa eta horrek gure osotasunean duen eragina ere ez daude eskari horretatik kanpo, eta ikertzaileok segurtasunaren kontzeptua txertatu behar dugu lantzen dugun proiektu bakoitzaren garapenean. Eremu honetan, bi eremu nagusi bereiz ditzakegu: segurtasun fisikoa eta informazioaren segurtasuna. Segurtasun fisikoa estrategia bat da instalazioak, aktiboak, baliabideak eta pertsonak erakunde horiei galerak edo kalteak eragin diezazkieketen intzidenteetatik edo ekintzetatik babesteko. Informazioaren segurtasuna berriz, edukiaren osotasuna eta pribatutasuna segurtasun digitalarekin babesteko estrategia bat da. Gaur egun, identifikatzeko modurik ohikoena pasahitzak, giltzak eta txartelak erabiltzea da. Metodo horien alde txarra , lapurtu edo ahaztu egin daitezkeela da. Bestalde, biometria bezalako tresnak aurkitzen ditugu, praktika berriago bat, segurtasun fisikoa zein informaziokoa ezartzeko erabiltzen dena. Pasahitz, giltza eta antzekoen metodo tradizionalen aldean, biometria beti edukitzen den edukitza da, eta hor datza bere abantaila nagusia. Segurtasun biometrikoan ohikoa da hatz-marka, aurpegi-egitura, irisa edo ahotsa erabiltzea. Azken horri dagokionez, ahotsaren biometria pertsona baten ahotsa identifikatzeko ezaugarri biologiko bakar gisa erabiltzeko zientzia da. Ahots-egiaztapen edo hiztun-aintzatespen gisa ere ezagutzen da, eta ahots-biometriak sarbide azkarra, ez intrusiboa eta segurua ahalbidetzen du erabilera-kasu anitzetarako: call center-ak, aplikazio mugikorrak edo lineako aplikazioak, chatbotak, IoT gailuak (Internet of Things) eta sarbide fisikokoak. Segurtasun-sistemak inplementatzeko beharra, aldi berean, arrisku ziurra dagoela esan nahi du, hau da, bada babesteko zerbait edo norbait. Ahots-biometriaren kasuan, spoofing edo nortasuna ordezteko erasoak dira segurtasunerako mehatxu handiak. Eraso horiei aurre egiteko, hainbat azterlan eta erakundek hizkera sintetikoa hautemateko moduluak (SSD) inplementatzeko joera dute. Teknologia honen funtzionamendua bi eredu desberdin dituen sailkatzaile batean oinarritzen da, bata giza hizkerakoa eta bestea hizkuntza sintetikokoa. Erabiltzaile bat sistemaren aurrean bere burua egiaztatzen saiatzen denean, seinalea bi ereduekin alderatzen da eta, antzekotasun-aldeak atalase bat gainditzen badu, gizakitzat hartzen da; bestela, baztertu egiten da, sintetikotzat sailkatuz. Teknologia hori garatzeko prozesuan, sistemak entrenatu egin behar dira, eta, horretarako, ahots-grabazio ugari erabiltzen dira, lehen aipatutako ereduak sortzeko. Gradu Amaierako Lan honetan, datu baseen erabilera aztertzen da sistema hauek spoofing bidezko erasoak detektatzeko atazan. Zeregin hau Aurrera eramateko, SSD bat erabiltzen da, MFCC parametro espektraletan eta fase harmonikoaren parametroetan (RPS) oinarrituta. Halaber, probak egiten dira sare neuronalekin, errore-probabilitate txikiagoko emaitzak lortzeko azken helburuarekin. DNN (Deep Neural Networks) sare neuronalak erabiltzen dira sailkapen-lana hobetzeko.Security is an inherent demand of the human condition on any of our acts, belongings and ourselves in short. The information, and its repercussion on our own integrity, is not excluded from this demand either, and researchers must integrate the concept of SECURITY in the development of each of the projects that we tackle. In this field we can differentiate two main areas, physical security and information security. Physical security is a strategy to protect facilities, assets, resources, and people from incidents or actions that can cause loss or damage to these entities. Information security is a strategy to protect the integrity and privacy of content with digital security. Today the most common form of identification is the use of passwords, keys, cards ... One drawback to these methods is that they can be stolen or forgotten. On the other hand, we find tools such as biometrics, a newer practice, which is being used to implement both physical and information security. Compared to traditional methods of passwords, keys and similar, biometrics is a possession that is always possessed and that is its main advantage. In biometric security, the use of fingerprint, facial structure, iris or voice is common. As far as the latter is concerned, voice biometrics is the science of using a person's voice as a uniquely identifying biological feature to authenticate them. Also known as voice verification or speaker recognition, voice biometrics enables fast, non-intrusive and secure access for a variety of use cases, from call centers, mobile applications, or online applications, to chatbots, IoT (Internet of Things) and physical access. If there is a need to implement security systems, it is due to the existence, in turn, of a certain risk; there is something or someone to protect yourself from. In the case of voice biometrics, it is the so-called spoofing or spoofing attacks that constitute a great security threat. In order to deal with these attacks, various studies and institutions tend to implement synthetic speech detection (SSD) modules. The operation of this technology is based on a classifier that has two different models, one of human speech and the other of synthetic speech. When a user tries to verify against the system, the signal is compared with both models and, if the difference in similarities exceeds a threshold, it is accepted as human; otherwise, it is rejected, classifying it as synthetic. During the development of this technology, the systems must be trained and in order to accomplish this, a large number of voice recordings are used, which will serve to create the models mentioned above. Throughout this Final Degree Project, it is studied the use of databases by these systems to detect spoofing attacks. To carry out this task, an SSD is used based on both MFCC spectral parameters and the harmonic phase parameters, RPS. Likewise, tests are carried out with neural networks with the ultimate objective of obtaining results with a lower probability of error. In this project, the so-called DNN neural networks (Deep Neural Networks) are used to improve the classification task

    Proceedings of the LREC 2018 Special Speech Sessions

    Get PDF
    LREC 2018 Special Speech Sessions "Speech Resources Collection in Real-World Situations"; Phoenix Seagaia Conference Center, Miyazaki; 2018-05-0

    Utilización de la fase armónica en la detección de voz sintética.

    Get PDF
    156 p.Los sistemas de verificación de locutor (SV) tienen que enfrentarse a la posibilidad de ser atacados mediante técnicas de spoofing. Hoy en día, las tecnologías de conversión de voces y de síntesis de voz adaptada a locutor han avanzado lo suficiente para poder crear voces que sean capaces de engañar a un sistema SV. En esta tesis se propone un módulo de detección de habla sintética (SSD) que puede utilizarse como complemento a un sistema SV, pero que es capaz de funcionar de manera independiente. Lo conforma un clasificador basado en GMM, dotado de modelos de habla humana y sintética. Cada entrada se compara con ambos, y, si la diferencia de verosimilitudes supera un determinado umbral, se acepta como humana, rechazándose en caso contrario. El sistema desarrollado es independiente de locutor. Para la generación de modelos se utilizarán parámetros RPS. Se propone una técnica para reducir la complejidad del proceso de entrenamiento, evitando generar TTSs adaptados o un conversor de voz para cada locutor. Para ello, como la mayoría de los sistemas de adaptación o síntesis modernos hacen uso de vocoders, se propone transcodificar las señales humanas mediante vocoders para obtener de esta forma sus versiones sintéticas, con las que se generarán los modelos sintéticos del clasificador. Se demostrará que se pueden detectar señales sintéticas detectando que se crearon mediante un vocoder. El rendimiento del sistema prueba en diferentes condiciones: con las propias señales transcodificadas o con ataques TTS. Por último, se plantean estrategias para el entrenamiento de modelos para sistemas SSD
    corecore