20 research outputs found
Aportación a la extracción paramétrica en reconocimiento de voz robusto basada en la aplicación de conocimiento de fonética acústica
This thesis is based on the following hypothesis: the introduction of direct
knowledge from the acoustic-phonetic field to the speech recognition problem,
especially in the feature extraction step, may constitute a solid base of analysis for the
determination of the behavior and capabilities of those systems and their improvement,
as well.
Most of the complexity of this Ph.D. thesis comes from the different subjects
related with the speech processing área. The application of acoustic-phonetic
information to the speech recognition research área implies a deep knowledge of both
subjects.
The research carried out in this work has been divided in two main parts: analysis
of the current feature extraction methods and a study of several possible procedures
about the incorporation of phonetic-acoustic knowledge to those systems.
Abundant recognition and related quality measure results are presented for 50
different parameter extraction models.
Details about the real-time implementation on a DSP platform (TMS3230C31-60)
of two different parameter extraction models are presented.
Finally, a set of computer tools developed for building and testing new speech
recognition systems has been produced. Besides, the application of several results from
this work can be extended to other speech processing áreas, such as computer assisted
language learning, linguistic rehabilitation, etc.---ABSTRACT---La hipótesis en la que se basa el desarrollo de esta tesis, se centra en la suposición
de que la aportación de conocimiento directo, proveniente del campo de la fonética
acústica, al problema del reconocimiento automático de la voz, en concreto a la etapa de
extracción de características, puede constituir una base sólida con la que poder analizar
el comportamiento y capacidad de discriminación de dichos sistemas, así como una
forma de mejorar sus prestaciones.
Parte de la complejidad que presenta esta tesis doctoral, viene motivada por las
diferentes disciplinas que están relacionadas con el área de procesamiento de la voz. La
aplicación de información fonética-acústica al campo de investigación del
reconocimiento del habla requiere un amplio conocimiento de ambas materias.
Las investigaciones desarrolladas en este trabajo se han dividido en dos bloques
fundamentales: análisis de los métodos actuales de extracción de rasgos fonéticos y un
estudio de algunas posibles formas de incorporación de conocimiento fonético-acústico
a dichos sistemas.
En esta tesis se ofrecen abundantes resultados relativos a tasas de reconocimiento
y medidas acerca de la calidad de este proceso, para un total de 50 modelos de
extracción de parámetros.
Así mismo se incluyen los detalles de la implementación en tiempo real para una
plataforma DSP, en concreto TMS320C31-60, de dos diferentes modelos de extracción
de rasgos.
Además, se ha desarrollado un conjunto de las herramientas informáticas que
pueden servir de base para construir y validar de forma sencilla, nuevos sistemas de
reconocimiento. La aplicación de algunos de los resultados del trabajo puede extenderse
también a otras áreas del tratamiento de la voz, tales como la enseñanza de una segunda
lengua, logopedia, etc
Improving speaker recognition by biometric voice deconstruction
Person identification, especially in critical environments, has always been a subject of great interest. However, it has gained a new dimension in a world threatened by a new kind of terrorism that uses social networks (e.g., YouTube) to broadcast its message. In this new scenario, classical identification methods (such as fingerprints or face recognition) have been forcedly replaced by alternative biometric characteristics such as voice, as sometimes this is the only feature available. The present study benefits from the advances achieved during last years in understanding and modeling voice production. The paper hypothesizes that a gender-dependent characterization of speakers combined with the use of a set of features derived from the components, resulting from the deconstruction of the voice into its glottal source and vocal tract estimates, will enhance recognition rates when compared to classical approaches. A general description about the main hypothesis and the methodology followed to extract the gender-dependent extended biometric parameters is given. Experimental validation is carried out both on a highly controlled acoustic condition database, and on a mobile phone network recorded under non-controlled acoustic conditions
Relevance of the glottal pulse and the vocal tract in gender detection
Gender detection is a very important objective to improve efficiency in tasks as speech or speaker recognition, among others. Traditionally gender detection has been focused on fundamental frequency (f0) and cepstral features derived from voiced segments of speech. The methodology presented here consists in obtaining uncorrelated glottal and vocal tract components which are parameterized as mel-frequency coefficients. K-fold and cross-validation using QDA and GMM classifiers showed that better detection rates are reached when glottal source and vocal tract parameters are used in a gender-balanced database of running speech from 340 speakers
Using dysphonic voice to characterize speaker's biometry
Phonation distortion leaves relevant marks in a speaker's biometric profile. Dysphonic voice production may be used for biometrical speaker characterization. In the present paper phonation features derived from the glottal source (GS) parameterization, after vocal tract inversion, is proposed for dysphonic voice characterization in Speaker Verification tasks. The glottal source derived parameters are matched in a forensic evaluation framework defining a distance-based metric specification. The phonation segments used in the study are derived from fillers, long vowels, and other phonation segments produced in spontaneous telephone conversations. Phonated segments from a telephonic database of 100 male Spanish native speakers are combined in a 10-fold cross-validation task to produce the set of quality measurements outlined in the paper. Shimmer, mucosal wave correlate, vocal fold cover biomechanical parameter unbalance and a subset of the GS cepstral profile produce accuracy rates as high as 99.57 for a wide threshold interval (62.08-75.04%). An Equal Error Rate of 0.64 % can be granted. The proposed metric framework is shown to behave more fairly than classical likelihood ratios in supporting the hypothesis of the defense vs that of the prosecution, thus ofering a more reliable evaluation scoring. Possible applications are Speaker Verification and Dysphonic Voice Grading
Computational approaches to Explainable Artificial Intelligence: Advances in theory, applications and trends
Financiado para publicación en acceso aberto: Universidad de Granada / CBUA.[Abstract]: Deep Learning (DL), a groundbreaking branch of Machine Learning (ML), has emerged as a driving force in both theoretical and applied Artificial Intelligence (AI). DL algorithms, rooted in complex and non-linear artificial neural systems, excel at extracting high-level features from data. DL has demonstrated human-level performance in real-world tasks, including clinical diagnostics, and has unlocked solutions to previously intractable problems in virtual agent design, robotics, genomics, neuroimaging, computer vision, and industrial automation. In this paper, the most relevant advances from the last few years in Artificial Intelligence (AI) and several applications to neuroscience, neuroimaging, computer vision, and robotics are presented, reviewed and discussed. In this way, we summarize the state-of-the-art in AI methods, models and applications within a collection of works presented at the 9th International Conference on the Interplay between Natural and Artificial Computation (IWINAC). The works presented in this paper are excellent examples of new scientific discoveries made in laboratories that have successfully transitioned to real-life applications.Funding for open access charge: Universidad de Granada / CBUA. The work reported here has been partially funded by many public and private bodies: by the MCIN/AEI/10.13039/501100011033/ and FEDER “Una manera de hacer Europa” under the RTI2018-098913-B100 project, by the Consejeria de Economia, Innovacion, Ciencia y Empleo (Junta de Andalucia) and FEDER under CV20-45250, A-TIC-080-UGR18, B-TIC-586-UGR20 and P20-00525 projects, and by the Ministerio de Universidades under the FPU18/04902 grant given to C. Jimenez-Mesa, the Margarita-Salas grant to J.E. Arco, and the Juan de la Cierva grant to D. Castillo-Barnes.
This work was supported by projects PGC2018-098813-B-C32 & RTI2018-098913-B100 (Spanish “Ministerio de Ciencia, Innovacón y Universidades”), P18-RT-1624, UMA20-FEDERJA-086, CV20-45250, A-TIC-080-UGR18 and P20 00525 (Consejería de econnomía y conocimiento, Junta de Andalucía) and by European Regional Development Funds (ERDF). M.A. Formoso work was supported by Grant PRE2019-087350 funded by MCIN/AEI/10.13039/501100011033 by “ESF Investing in your future”. Work of J.E. Arco was supported by Ministerio de Universidades, Gobierno de España through grant “Margarita Salas”.
The work reported here has been partially funded by Grant PID2020-115220RB-C22 funded by MCIN/AEI/10.13039/501100011033 and, as appropriate, by “ERDF A way of making Europe”, by the “European Union” or by the “European Union NextGenerationEU/PRTR”.
The work of Paulo Novais is financed by National Funds through the Portuguese funding agency, FCT - Fundaça̋o para a Ciência e a Tecnologia within project DSAIPA/AI/0099/2019.
Ramiro Varela was supported by the Spanish State Agency for Research (AEI) grant PID2019-106263RB-I00.
José Santos was supported by the Xunta de Galicia and the European Union (European Regional Development Fund - Galicia 2014–2020 Program), with grants CITIC (ED431G 2019/01), GPC ED431B 2022/33, and by the Spanish Ministry of Science and Innovation (project PID2020-116201GB-I00). The work reported here has been partially funded by Project Fondecyt 1201572 (ANID).
The work reported here has been partially funded by Project Fondecyt 1201572 (ANID).
In [247], the project has received funding by grant RTI2018-098969-B-100 from the Spanish Ministerio de Ciencia Innovación y Universidades and by grant PROMETEO/2019/119 from the Generalitat Valenciana (Spain). In [248], the research work has been partially supported by the National Science Fund of Bulgaria (scientific project “Digital Accessibility for People with Special Needs: Methodology, Conceptual Models and Innovative Ecosystems”), Grant Number KP-06-N42/4, 08.12.2020; EC for project CybSPEED, 777720, H2020-MSCA-RISE-2017 and OP Science and Education for Smart Growth (2014–2020) for project Competence Center “Intelligent mechatronic, eco- and energy saving sytems and technologies”BG05M2OP001-1.002-0023.
The work reported here has been partially funded by the support of MICIN project PID2020-116346GB-I00.
The work reported here has been partially funded by many public and private bodies: by MCIN/AEI/10.13039/501100011033 and “ERDF A way to make Europe” under the PID2020-115220RB-C21 and EQC2019-006063-P projects; by MCIN/AEI/10.13039/501100011033 and “ESF Investing in your future” under FPU16/03740 grant; by the CIBERSAM of the Instituto de Salud Carlos III; by MinCiencias project 1222-852-69927, contract 495-2020.
The work is partially supported by the Autonomous Government of Andalusia (Spain) under project UMA18-FEDERJA-084, project name Detection of anomalous behavior agents by DL in low-cost video surveillance intelligent systems. Authors gratefully acknowledge the support of NVIDIA Corporation with the donation of a RTX A6000 48 Gb.
This work was conducted in the context of the Horizon Europe project PRE-ACT, and it has received funding through the European Commission Horizon Europe Program (Grant Agreement number: 101057746). In addition, this work was supported by the Swiss State Secretariat for Education, Research and Innovation (SERI) under contract nummber 22 00058.
S.B Cho was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2020-0-01361, Artificial Intelligence Graduate School Program (Yonsei University)).Junta de Andalucía; CV20-45250Junta de Andalucía; A-TIC-080-UGR18Junta de Andalucía; B-TIC-586-UGR20Junta de Andalucía; P20-00525Junta de Andalucía; P18-RT-1624Junta de Andalucía; UMA20-FEDERJA-086Portugal. Fundação para a Ciência e a Tecnologia; DSAIPA/AI/0099/2019Xunta de Galicia; ED431G 2019/01Xunta de Galicia; GPC ED431B 2022/33Chile. Agencia Nacional de Investigación y Desarrollo; 1201572Generalitat Valenciana; PROMETEO/2019/119Bulgarian National Science Fund; KP-06-N42/4Bulgaria. Operational Programme Science and Education for Smart Growth; BG05M2OP001-1.002-0023Colombia. Ministerio de Ciencia, Tecnología e Innovación; 1222-852-69927Junta de Andalucía; UMA18-FEDERJA-084Suíza. State Secretariat for Education, Research and Innovation; 22 00058Institute of Information & Communications Technology Planning & Evaluation (Corea del Sur); 2020-0-0136
Neuromechanical Modelling of Articulatory Movements from Surface Electromyography and Speech Formants
Speech articulation is produced by the movements of muscles in the larynx, pharynx, mouth and face. Therefore speech shows acoustic features as formants which are directly related with neuromotor actions of these muscles. The first two formants are strongly related with jaw and tongue muscular activity. Speech can be used as a simple and ubiquitous signal, easy to record and process, either locally or on e-Health platforms. This fact may open a wide set of applications in the study of functional grading and monitoring neurodegenerative diseases. A relevant question, in this sense, is how far speech correlates and neuromotor actions are related. This preliminary study is intended to find answers to this question by using surface electromyographic recordings on the masseter and the acoustic kinematics related with the first formant. It is shown in the study that relevant correlations can be found among the surface electromyographic activity (dynamic muscle behavior) and the positions and first derivatives of the first formant (kinematic variables related to vertical velocity and acceleration of the joint jaw and tongue biomechanical system). As an application example, it is shown that the probability density function associated to these kinematic variables is more sensitive than classical features as Vowel Space Area (VSA) or Formant Centralization Ratio (FCR) in characterizing neuromotor degeneration in Parkinson's Disease.This work is being funded by Grants TEC2016-77791-C4-4-R from the Ministry of Economic Affairs and Competitiveness of Spain, Teka-Park 55 02 CENIE-0348_CIE_6_E POCTEP (InterReg Programme) and 16-30805A, SIX Research Center (CZ.1.05/2.1.00/03.0072), and LO1401 from the Czech Republic Government
Identification of Smith–Magenis syndrome cases through an experimental evaluation of machine learning methods
This research work introduces a novel, nonintrusive method for the automatic identification of Smith–Magenis syndrome, traditionally studied through genetic markers. The method utilizes cepstral peak prominence and various machine learning techniques, relying on a single metric computed by the research group. The performance of these techniques is evaluated across two case studies, each employing a unique data preprocessing approach. A proprietary data “windowing” technique is also developed to derive a more representative dataset. To address class imbalance in the dataset, the synthetic minority oversampling technique (SMOTE) is applied for data augmentation. The application of these preprocessing techniques has yielded promising results from a limited initial dataset. The study concludes that the k-nearest neighbors and linear discriminant analysis perform best, and that cepstral peak prominence is a promising measure for identifying Smith–Magenis syndrome
The Role of Data Analytics in the Assessment of Pathological Speech—A Critical Appraisal
Pathological voice characterization has received increasing attention over the last 20 years. Hundreds of studies have been published showing inventive approaches with very promising findings. Nevertheless, methodological issues might hamper performance assessment trustworthiness. This study reviews some critical aspects regarding data collection and processing, machine learning-oriented methods, and grounding analytical approaches, with a view to embedding developed clinical decision support tools into the diagnosis decision-making process. A set of 26 relevant studies published since 2010 was selected through critical selection criteria and evaluated. The model-driven (MD) or data-driven (DD) character of the selected approaches is deeply examined considering novelty, originality, statistical robustness, trustworthiness, and clinical relevance. It has been found that before 2020 most of the works examined were more aligned with MD approaches, whereas over the last two years a balanced proportion of DD and MD-based studies was found. A total of 15 studies presented MD characters, whereas seven were mainly DD-oriented, and four shared both profiles. Fifteen studies showed exploratory or prospective advanced statistical analysis. Eighteen included some statistical validation to avail claims. Twenty-two reported original work, whereas the remaining four were systematic reviews of others’ work. Clinical relevance and acceptability by voice specialists were found in 14 out of the 26 works commented on. Methodological issues such as detection and classification performance, training and generalization capability, explainability, preservation of semantic load, clinical acceptance, robustness, and development expenses have been identified as major issues in applying machine learning to clinical support systems. Other important aspects to be taken into consideration are trustworthiness, gender-balance issues, and statistical relevance