Search CORE

20 research outputs found

Adaptation of Whisper models to child speech recognition

Author: Barcovschi Andrei
Corcoran Peter
Cucu Horia
Jain Rishabh
Yiwere Mariam
Publication venue
Publication date: 24/07/2023
Field of study

Automatic Speech Recognition (ASR) systems often struggle with transcribing child speech due to the lack of large child speech datasets required to accurately train child-friendly ASR models. However, there are huge amounts of annotated adult speech datasets which were used to create multilingual ASR models, such as Whisper. Our work aims to explore whether such models can be adapted to child speech to improve ASR for children. In addition, we compare Whisper child-adaptations with finetuned self-supervised models, such as wav2vec2. We demonstrate that finetuning Whisper on child speech yields significant improvements in ASR performance on child speech, compared to non finetuned Whisper models. Additionally, utilizing self-supervised Wav2vec2 models that have been finetuned on child speech outperforms Whisper finetuning.Comment: Accepted in Interspeech 202

arXiv.org e-Print Archive

Towards generalisable and calibrated synthetic speech detection with self-supervised representations

Author: Cucu Horia
Oneata Dan
Oneata Elisabeta
Pascu Octavian
Stan Adriana
Publication venue
Publication date: 11/09/2023
Field of study

Generalisation -- the ability of a model to perform well on unseen data -- is crucial for building reliable deep fake detectors. However, recent studies have shown that the current audio deep fake models fall short of this desideratum. In this paper we show that pretrained self-supervised representations followed by a simple logistic regression classifier achieve strong generalisation capabilities, reducing the equal error rate from 30% to 8% on the newly introduced In-the-Wild dataset. Importantly, this approach also produces considerably better calibrated models when compared to previous approaches. This means that we can trust our model's predictions more and use these for downstream tasks, such as uncertainty estimation. In particular, we show that the entropy of the estimated probabilities provides a reliable way of rejecting uncertain samples and further improving the accuracy.Comment: Submitted to ICASSP 202

arXiv.org e-Print Archive

Adaptive Planning Search Algorithm for Analog Circuit Verification

Author: Andronache Cristina
Buzo Andi
Caranica Alexandru
Cucu Horia
Diaconu Cristian
Manolache Cristian
Pelz Georg
Publication venue
Publication date: 23/06/2023
Field of study

Integrated circuit verification has gathered considerable interest in recent times. Since these circuits keep growing in complexity year by year, pre-Silicon (pre-SI) verification becomes ever more important, in order to ensure proper functionality. Thus, in order to reduce the time needed for manually verifying ICs, we propose a machine learning (ML) approach, which uses less simulations. This method relies on an initial evaluation set of operating condition configurations (OCCs), in order to train Gaussian process (GP) surrogate models. By using surrogate models, we can propose further, more difficult OCCs. Repeating this procedure for several iterations has shown better GP estimation of the circuit's responses, on both synthetic and real circuits, resulting in a better chance of finding the worst case, or even failures, for certain circuit responses. Thus, we show that the proposed approach is able to provide OCCs closer to the specifications for all circuits and identify a failure (specification violation) for one of the responses of a real circuit

arXiv.org e-Print Archive

Text spotting in large speech databases for under-resourced languages

Author: Andi Buzo
Corneliu Burileanu
Horia Cucu
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2013
Field of study

Abstract—Lightly supervised acoustic modeling in under-resourced languages raises new issues due to the poor accuracy of Automatic Speech Recognition (ASR) systems for such languages and the quality of the speech transcriptions that may be found. In these conditions, the common alignment techniques are not always capable of aligning the ASR output and the approximate transcription. We propose two aligning methods that overcome these issues. In the first approach we apply an image processing algorithm on the matching matrix of the two texts to be aligned, while the second alignment approach is based on segmental DTW. The approaches outperform the current Dynamic Time Warping technique (DTW) by extracting in average 29 % and 27 % respectively more speech data than the currently used DTW

CiteSeerX

Crossref

INVESTIGATING THE ROLE OF MACHINE TRANSLATED TEXT IN ASR DOMAIN ADAPTATION: UNSUPERVISED AND SEMI-SUPERVISED METHODS

Author: Besacier Laurent
Burileanu Corneliu
Buzo Andi
Cucu Horia
Publication venue: HAL CCSD
Publication date: 01/01/2011
Field of study

International audienceno abstrac

Hal - Université Grenoble Alpes

FlexLip: A Controllable Text-to-Lip System

Author: Cucu Horia
Lorincz Beata
Oneata Dan
Stan Adriana
Publication venue: 'MDPI AG'
Publication date: 01/05/2022
Field of study

The task of converting text input into video content is becoming an important topic for synthetic media generation. Several methods have been proposed with some of them reaching close-to-natural performances in constrained tasks. In this paper, we tackle a subissue of the text-to-video generation problem, by converting the text into lip landmarks. However, we do this using a modular, controllable system architecture and evaluate each of its individual components. Our system, entitled FlexLip, is split into two separate modules: text-to-speech and speech-to-lip, both having underlying controllable deep neural network architectures. This modularity enables the easy replacement of each of its components, while also ensuring the fast adaptation to new speaker identities by disentangling or projecting the input features. We show that by using as little as 20 min of data for the audio generation component, and as little as 5 min for the speech-to-lip component, the objective measures of the generated lip landmarks are comparable with those obtained when using a larger set of training samples. We also introduce a series of objective evaluation measures over the complete flow of our system by taking into consideration several aspects of the data and system configuration. These aspects pertain to the quality and amount of training data, the use of pretrained models, and the data contained therein, as well as the identity of the target speaker; with regard to the latter, we show that we can perform zero-shot lip adaptation to an unseen identity by simply updating the shape of the lips in our model.Comment: 16 pages, 4 tables, 4 figure

arXiv.org e-Print Archive

Directory of Open Access Journals

PubMed Central

Statistical Error Correction Methods for Domain-Specific ASR Systems

Author: Besacier Laurent
Burileanu Corneliu
Buzo Andi
Cucu Horia
Publication venue: HAL CCSD
Publication date: 01/01/2013
Field of study

International audienceno abstrac

Hal - Université Grenoble Alpes

Enhancing Automatic Speech Recognition for Romanian by Using Machine Translated and Web-based Text Corpora

Author: Besacier Laurent
Burileanu Corneliu
Buzo Andi
Cucu Horia
Publication venue: HAL CCSD
Publication date: 01/01/2011
Field of study

International audienceno abstrac

Hal - Université Grenoble Alpes