265 research outputs found
Automatic Transcription of Lyrics in Monophonic and Poliphonic Songs
The paper proposes the implementation of a system for automatic transcription of lyrics in monophonic and polyphonic songs. The basis of the system is an automatic speech recognizer. Taking into account the differences between singing and spoken voice, acoustic models are adapted to singing voice, using several methods, and Language Models (LM) trained on songs lyrics are built. Moreover, background music is attenuated in polyphonic music using the Robust Principal Component Analysis (RPCA) algorithm, trying to facilitate the recognition task avoiding its effect. The results show that, using
as adaptation data the same type of tracks that are transcribed then, both adaptation methods and specific LM for songs
improve the performance of the baseline system at phonemeand word-level. However, the use of RPCA over polyphonic songs introduces distortions in singing voice, and therefore,
in general, it is not useful for improving the performance of the whole system.Master in Multimedia and Communications = Master Universitario en Multimedia y Comunicaciones. Curso 2013/201
Monaural speech separation using source-adapted models
We propose a model-based source separation system for use on single channel speech mixtures where the precise source characteristics are not known a priori. We do this by representing the space of source variation with a parametric signal model based on the eigenvoice technique for rapid speaker adaptation. We present an algorithm to infer the characteristics of the sources present in a mixture, allowing for significantly improved separation performance over that obtained using unadapted source models. The algorithm is evaluated on the task defined in the 2006 Speech Separation Challenge [1] and compared with separation using source-dependent models
Singing Voice Recognition for Music Information Retrieval
This thesis proposes signal processing methods for analysis of singing voice audio signals, with the objectives of obtaining information about the identity and lyrics content of the singing. Two main topics are presented, singer identification in monophonic and polyphonic music, and lyrics transcription and alignment. The information automatically extracted from the singing voice is meant to be used for applications such as music classification, sorting and organizing music databases, music information retrieval, etc.
For singer identification, the thesis introduces methods from general audio classification and specific methods for dealing with the presence of accompaniment. The emphasis is on singer identification in polyphonic audio, where the singing voice is present along with musical accompaniment. The presence of instruments is detrimental to voice identification performance, and eliminating the effect of instrumental accompaniment is an important aspect of the problem. The study of singer identification is centered around the degradation of classification performance in presence of instruments, and separation of the vocal line for improving performance. For the study, monophonic singing was mixed with instrumental accompaniment at different signal-to-noise (singing-to-accompaniment) ratios and the classification process was performed on the polyphonic mixture and on the vocal line separated from the polyphonic mixture. The method for classification including the step for separating the vocals is improving significantly the performance compared to classification of the polyphonic mixtures, but not close to the performance in classifying the monophonic singing itself. Nevertheless, the results show that classification of singing voices can be done robustly in polyphonic music when using source separation.
In the problem of lyrics transcription, the thesis introduces the general speech recognition framework and various adjustments that can be done before applying the methods on singing voice. The variability of phonation in singing poses a significant challenge to the speech recognition approach. The thesis proposes using phoneme models trained on speech data and adapted to singing voice characteristics for the recognition of phonemes and words from a singing voice signal. Language models and adaptation techniques are an important aspect of the recognition process. There are two different ways of recognizing the phonemes in the audio: one is alignment, when the true transcription is known and the phonemes have to be located, other one is recognition, when both transcription and location of phonemes have to be found. The alignment is, obviously, a simplified form of the recognition task.
Alignment of textual lyrics to music audio is performed by aligning the phonetic transcription of the lyrics with the vocal line separated from the polyphonic mixture, using a collection of commercial songs. The word recognition is tested for transcription of lyrics from monophonic singing. The performance of the proposed system for automatic alignment of lyrics and audio is sufficient for facilitating applications such as automatic karaoke annotation or song browsing. The word recognition accuracy of the lyrics transcription from singing is quite low, but it is shown to be useful in a query-by-singing application, for performing a textual search based on the words recognized from the query. When some key words in the query are recognized, the song can be reliably identified
Computational Pronunciation Analysis in Sung Utterances
Recent automatic lyrics transcription (ALT) approaches focus on building
stronger acoustic models or in-domain language models, while the pronunciation
aspect is seldom touched upon. This paper applies a novel computational
analysis on the pronunciation variances in sung utterances and further proposes
a new pronunciation model adapted for singing. The singing-adapted model is
tested on multiple public datasets via word recognition experiments. It
performs better than the standard speech dictionary in all settings reporting
the best results on ALT in a capella recordings using n-gram language models.
For reproducibility, we share the sentence-level annotations used in testing,
providing a new benchmark evaluation set for ALT
Leveraging repetition for improved automatic lyric transcription in popular music
Transcribing lyrics from musical audio is a challenging research prob-lem which has not benefited from many advances made in the related field of automatic speech recognition, owing to the prevalent musical accompaniment and differences between the spoken and sung voice. However, one aspect of this problem which has yet to be exploited by researchers is that significant portions of the lyrics will be repeated throughout the song. In this paper we investigate how this information can be leveraged to form a consensus transcription with improved consistency and accuracy. Our results show that improvements can be gained using a variety of techniques, and that relative gains are largest under the most challenging and realistic experimental conditions
Advances in deep learning methods for speech recognition and understanding
Ce travail expose plusieurs études dans les domaines de
la reconnaissance de la parole et
compréhension du langage parlé.
La compréhension sémantique du langage parlé est un sous-domaine important
de l'intelligence artificielle.
Le traitement de la parole intéresse depuis longtemps les chercheurs,
puisque la parole est une des charactéristiques qui definit l'être humain.
Avec le développement du réseau neuronal artificiel,
le domaine a connu une évolution rapide
à la fois en terme de précision et de perception humaine.
Une autre étape importante a été franchie avec le développement
d'approches bout en bout.
De telles approches permettent une coadaptation de toutes
les parties du modèle, ce qui augmente ainsi les performances,
et ce qui simplifie la procédure d'entrainement.
Les modèles de bout en bout sont devenus réalisables avec la quantité croissante
de données disponibles, de ressources informatiques et,
surtout, avec de nombreux développements architecturaux innovateurs.
Néanmoins, les approches traditionnelles (qui ne sont pas bout en bout)
sont toujours pertinentes pour le traitement de la parole en raison
des données difficiles dans les environnements bruyants,
de la parole avec un accent et de la grande variété de dialectes.
Dans le premier travail, nous explorons la reconnaissance de la parole hybride
dans des environnements bruyants.
Nous proposons de traiter la reconnaissance de la parole,
qui fonctionne dans
un nouvel environnement composé de différents bruits inconnus,
comme une tâche d'adaptation de domaine.
Pour cela, nous utilisons la nouvelle technique à l'époque
de l'adaptation du domaine antagoniste.
En résumé, ces travaux antérieurs proposaient de former
des caractéristiques de manière à ce qu'elles soient distinctives
pour la tâche principale, mais non-distinctive pour la tâche secondaire.
Cette tâche secondaire est conçue pour être la tâche de reconnaissance de domaine.
Ainsi, les fonctionnalités entraînées sont invariantes vis-à-vis du domaine considéré.
Dans notre travail, nous adoptons cette technique et la modifions pour
la tâche de reconnaissance de la parole dans un environnement bruyant.
Dans le second travail, nous développons une méthode générale
pour la régularisation des réseaux génératif récurrents.
Il est connu que les réseaux récurrents ont souvent des difficultés à rester
sur le même chemin, lors de la production de sorties longues.
Bien qu'il soit possible d'utiliser des réseaux bidirectionnels pour
une meilleure traitement de séquences pour l'apprentissage des charactéristiques,
qui n'est pas applicable au cas génératif.
Nous avons développé un moyen d'améliorer la cohérence de
la production de longues séquences avec des réseaux récurrents.
Nous proposons un moyen de construire un modèle similaire à un réseau bidirectionnel.
L'idée centrale est d'utiliser une perte L2 entre
les réseaux récurrents génératifs vers l'avant et vers l'arrière.
Nous fournissons une évaluation expérimentale sur
une multitude de tâches et d'ensembles de données,
y compris la reconnaissance vocale,
le sous-titrage d'images et la modélisation du langage.
Dans le troisième article, nous étudions la possibilité de développer
un identificateur d'intention de bout en bout pour la compréhension du langage parlé.
La compréhension sémantique du langage parlé est une étape importante vers
le développement d'une intelligence artificielle de type humain.
Nous avons vu que les approches de bout en bout montrent
des performances élevées sur les tâches, y compris la traduction automatique et
la reconnaissance de la parole.
Nous nous inspirons des travaux antérieurs pour développer
un système de bout en bout pour la reconnaissance de l'intention.This work presents several studies in the areas of speech recognition and
understanding.
The semantic speech understanding is an important sub-domain of the
broader field of artificial intelligence.
Speech processing has had interest from the researchers for long time
because language is one of the defining characteristics of a human being.
With the development of neural networks, the domain has seen rapid progress
both in terms of accuracy and human perception.
Another important milestone was achieved with the development of
end-to-end approaches.
Such approaches allow co-adaptation of all the parts of the model
thus increasing the performance, as well as simplifying the training
procedure.
End-to-end models became feasible with the increasing amount of available
data, computational resources, and most importantly with many novel
architectural developments.
Nevertheless, traditional, non end-to-end, approaches are still relevant
for speech processing due to challenging data in noisy environments,
accented speech, and high variety of dialects.
In the first work, we explore the hybrid speech recognition in noisy
environments.
We propose to treat the recognition in the unseen noise condition
as the domain adaptation task.
For this, we use the novel at the time technique of the adversarial
domain adaptation.
In the nutshell, this prior work proposed to train features in such
a way that they are discriminative for the primary task,
but non-discriminative for the secondary task.
This secondary task is constructed to be the domain recognition task.
Thus, the features trained are invariant towards the domain at hand.
In our work, we adopt this technique and modify it for the task of
noisy speech recognition.
In the second work, we develop a general method for regularizing
the generative recurrent networks.
It is known that the recurrent networks frequently have difficulties
staying on same track when generating long outputs.
While it is possible to use bi-directional networks for better
sequence aggregation for feature learning, it is not applicable
for the generative case.
We developed a way improve the consistency of generating long sequences
with recurrent networks.
We propose a way to construct a model similar to bi-directional network.
The key insight is to use a soft L2 loss between the forward and
the backward generative recurrent networks.
We provide experimental evaluation on a multitude of tasks and datasets,
including speech recognition, image captioning, and language modeling.
In the third paper, we investigate the possibility of developing
an end-to-end intent recognizer for spoken language understanding.
The semantic spoken language understanding is an important
step towards developing a human-like artificial intelligence.
We have seen that the end-to-end approaches show high
performance on the tasks including machine translation and speech recognition.
We draw the inspiration from the prior works to develop
an end-to-end system for intent recognition
- …