104 research outputs found

    Singing Voice Recognition for Music Information Retrieval

    Get PDF
    This thesis proposes signal processing methods for analysis of singing voice audio signals, with the objectives of obtaining information about the identity and lyrics content of the singing. Two main topics are presented, singer identification in monophonic and polyphonic music, and lyrics transcription and alignment. The information automatically extracted from the singing voice is meant to be used for applications such as music classification, sorting and organizing music databases, music information retrieval, etc. For singer identification, the thesis introduces methods from general audio classification and specific methods for dealing with the presence of accompaniment. The emphasis is on singer identification in polyphonic audio, where the singing voice is present along with musical accompaniment. The presence of instruments is detrimental to voice identification performance, and eliminating the effect of instrumental accompaniment is an important aspect of the problem. The study of singer identification is centered around the degradation of classification performance in presence of instruments, and separation of the vocal line for improving performance. For the study, monophonic singing was mixed with instrumental accompaniment at different signal-to-noise (singing-to-accompaniment) ratios and the classification process was performed on the polyphonic mixture and on the vocal line separated from the polyphonic mixture. The method for classification including the step for separating the vocals is improving significantly the performance compared to classification of the polyphonic mixtures, but not close to the performance in classifying the monophonic singing itself. Nevertheless, the results show that classification of singing voices can be done robustly in polyphonic music when using source separation. In the problem of lyrics transcription, the thesis introduces the general speech recognition framework and various adjustments that can be done before applying the methods on singing voice. The variability of phonation in singing poses a significant challenge to the speech recognition approach. The thesis proposes using phoneme models trained on speech data and adapted to singing voice characteristics for the recognition of phonemes and words from a singing voice signal. Language models and adaptation techniques are an important aspect of the recognition process. There are two different ways of recognizing the phonemes in the audio: one is alignment, when the true transcription is known and the phonemes have to be located, other one is recognition, when both transcription and location of phonemes have to be found. The alignment is, obviously, a simplified form of the recognition task. Alignment of textual lyrics to music audio is performed by aligning the phonetic transcription of the lyrics with the vocal line separated from the polyphonic mixture, using a collection of commercial songs. The word recognition is tested for transcription of lyrics from monophonic singing. The performance of the proposed system for automatic alignment of lyrics and audio is sufficient for facilitating applications such as automatic karaoke annotation or song browsing. The word recognition accuracy of the lyrics transcription from singing is quite low, but it is shown to be useful in a query-by-singing application, for performing a textual search based on the words recognized from the query. When some key words in the query are recognized, the song can be reliably identified

    Application of automatic speech recognition technologies to singing

    Get PDF
    The research field of Music Information Retrieval is concerned with the automatic analysis of musical characteristics. One aspect that has not received much attention so far is the automatic analysis of sung lyrics. On the other hand, the field of Automatic Speech Recognition has produced many methods for the automatic analysis of speech, but those have rarely been employed for singing. This thesis analyzes the feasibility of applying various speech recognition methods to singing, and suggests adaptations. In addition, the routes to practical applications for these systems are described. Five tasks are considered: Phoneme recognition, language identification, keyword spotting, lyrics-to-audio alignment, and retrieval of lyrics from sung queries. The main bottleneck in almost all of these tasks lies in the recognition of phonemes from sung audio. Conventional models trained on speech do not perform well when applied to singing. Training models on singing is difficult due to a lack of annotated data. This thesis offers two approaches for generating such data sets. For the first one, speech recordings are made more “song-like”. In the second approach, textual lyrics are automatically aligned to an existing singing data set. In both cases, these new data sets are then used for training new acoustic models, offering considerable improvements over models trained on speech. Building on these improved acoustic models, speech recognition algorithms for the individual tasks were adapted to singing by either improving their robustness to the differing characteristics of singing, or by exploiting the specific features of singing performances. Examples of improving robustness include the use of keyword-filler HMMs for keyword spotting, an i-vector approach for language identification, and a method for alignment and lyrics retrieval that allows highly varying durations. Features of singing are utilized in various ways: In an approach for language identification that is well-suited for long recordings; in a method for keyword spotting based on phoneme durations in singing; and in an algorithm for alignment and retrieval that exploits known phoneme confusions in singing.Das Gebiet des Music Information Retrieval befasst sich mit der automatischen Analyse von musikalischen Charakteristika. Ein Aspekt, der bisher kaum erforscht wurde, ist dabei der gesungene Text. Auf der anderen Seite werden in der automatischen Spracherkennung viele Methoden für die automatische Analyse von Sprache entwickelt, jedoch selten für Gesang. Die vorliegende Arbeit untersucht die Anwendung von Methoden aus der Spracherkennung auf Gesang und beschreibt mögliche Anpassungen. Zudem werden Wege zur praktischen Anwendung dieser Ansätze aufgezeigt. Fünf Themen werden dabei betrachtet: Phonemerkennung, Sprachenidentifikation, Schlagwortsuche, Text-zu-Gesangs-Alignment und Suche von Texten anhand von gesungenen Anfragen. Das größte Hindernis bei fast allen dieser Themen ist die Erkennung von Phonemen aus Gesangsaufnahmen. Herkömmliche, auf Sprache trainierte Modelle, bieten keine guten Ergebnisse für Gesang. Das Trainieren von Modellen auf Gesang ist schwierig, da kaum annotierte Daten verfügbar sind. Diese Arbeit zeigt zwei Ansätze auf, um solche Daten zu generieren. Für den ersten wurden Sprachaufnahmen künstlich gesangsähnlicher gemacht. Für den zweiten wurden Texte automatisch zu einem vorhandenen Gesangsdatensatz zugeordnet. Die neuen Datensätze wurden zum Trainieren neuer Modelle genutzt, welche deutliche Verbesserungen gegenüber sprachbasierten Modellen bieten. Auf diesen verbesserten akustischen Modellen aufbauend wurden Algorithmen aus der Spracherkennung für die verschiedenen Aufgaben angepasst, entweder durch das Verbessern der Robustheit gegenüber Gesangscharakteristika oder durch das Ausnutzen von hilfreichen Besonderheiten von Gesang. Beispiele für die verbesserte Robustheit sind der Einsatz von Keyword-Filler-HMMs für die Schlagwortsuche, ein i-Vector-Ansatz für die Sprachenidentifikation sowie eine Methode für das Alignment und die Textsuche, die stark schwankende Phonemdauern nicht bestraft. Die Besonderheiten von Gesang werden auf verschiedene Weisen genutzt: So z.B. in einem Ansatz für die Sprachenidentifikation, der lange Aufnahmen benötigt; in einer Methode für die Schlagwortsuche, die bekannte Phonemdauern in Gesang mit einbezieht; und in einem Algorithmus für das Alignment und die Textsuche, der bekannte Phonemkonfusionen verwertet

    Singing information processing: techniques and applications

    Get PDF
    Por otro lado, se presenta un método para el cambio realista de intensidad de voz cantada. Esta transformación se basa en un modelo paramétrico de la envolvente espectral, y mejora sustancialmente la percepción de realismo al compararlo con software comerciales como Melodyne o Vocaloid. El inconveniente del enfoque propuesto es que requiere intervención manual, pero los resultados conseguidos arrojan importantes conclusiones hacia la modificación automática de intensidad con resultados realistas. Por último, se propone un método para la corrección de disonancias en acordes aislados. Se basa en un análisis de múltiples F0, y un desplazamiento de la frecuencia de su componente sinusoidal. La evaluación la ha realizado un grupo de músicos entrenados, y muestra un claro incremento de la consonancia percibida después de la transformación propuesta.La voz cantada es una componente esencial de la música en todas las culturas del mundo, ya que se trata de una forma increíblemente natural de expresión musical. En consecuencia, el procesado automático de voz cantada tiene un gran impacto desde la perspectiva de la industria, la cultura y la ciencia. En este contexto, esta Tesis contribuye con un conjunto variado de técnicas y aplicaciones relacionadas con el procesado de voz cantada, así como con un repaso del estado del arte asociado en cada caso. En primer lugar, se han comparado varios de los mejores estimadores de tono conocidos para el caso de uso de recuperación por tarareo. Los resultados demuestran que \cite{Boersma1993} (con un ajuste no obvio de parámetros) y \cite{Mauch2014}, tienen un muy buen comportamiento en dicho caso de uso dada la suavidad de los contornos de tono extraídos. Además, se propone un novedoso sistema de transcripción de voz cantada basada en un proceso de histéresis definido en tiempo y frecuencia, así como una herramienta para evaluación de voz cantada en Matlab. El interés del método propuesto es que consigue tasas de error cercanas al estado del arte con un método muy sencillo. La herramienta de evaluación propuesta, por otro lado, es un recurso útil para definir mejor el problema, y para evaluar mejor las soluciones propuestas por futuros investigadores. En esta Tesis también se presenta un método para evaluación automática de la interpretación vocal. Usa alineamiento temporal dinámico para alinear la interpretación del usuario con una referencia, proporcionando de esta forma una puntuación de precisión de afinación y de ritmo. La evaluación del sistema muestra una alta correlación entre las puntuaciones dadas por el sistema, y las puntuaciones anotadas por un grupo de músicos expertos

    Models and analysis of vocal emissions for biomedical applications

    Get PDF
    This book of Proceedings collects the papers presented at the 3rd International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications, MAVEBA 2003, held 10-12 December 2003, Firenze, Italy. The workshop is organised every two years, and aims to stimulate contacts between specialists active in research and industrial developments, in the area of voice analysis for biomedical applications. The scope of the Workshop includes all aspects of voice modelling and analysis, ranging from fundamental research to all kinds of biomedical applications and related established and advanced technologies

    Sociophonetics of popular music: insights from corpus analysis and speech perception experiments

    Get PDF
    This thesis examines the flexibility and context-sensitivity of speech perception by looking at a domain not often explored in the study of language cognition — popular music. Three empirical studies are presented. The first examines the current state of sociolinguistic variation in commercial popular music, while the second and third explore everyday listeners’ perception of language in musical and non-musical contexts. The foundational assumption of the thesis is that the use of ‘American English’ in song is automatic for New Zealand singers, and constitutes a responsive style that is both accurate and consistent. The use of New Zealand English in song, by contrast, is stylised, involving an initiative act of identity and requiring effort and awareness. This will be discussed in Chapter 1, where I also introduce the term Standard Popular Music Singing Style (SPMSS) to refer to the US English-derived phonetic style dominant in popular song. The first empirical study will be presented in Chapter 2. Using a systematically selected corpus of commercial pop and hip hop from NZ and the USA, analysis of non-prevocalic and linking /r/, and the vowels of the bath, lot and goat lexical sets confirm that SPMSS is highly normative in NZ music. Most pop singers closely follow US patterns, while several hip hop artists display elements of New Zealand English. This reflects the value placed on authenticity in hip hop, and also interacts with ethnicity, showing the use of different authentication practices by P¯akeh¯a (NZ European) and M¯aori/Pasifika artists. By looking at co-variation amongst the variables, I explore both the apparent identity goals of the artists, and the relative salience of the variables. Chapters 3 and 4 use the results of the corpus analysis to explore how the dominance of SPMSS affects speech processing. The first of the two perception experiments is a phonetic categorisation task. Listeners decide whether they hear the word bed or bad in a condition where the stimuli are either set to music, or appear in one of two non-musical control conditions. The stimuli are on a resynthesised continuum between the dress and trap vowels, passing through an F1 space where the vowel is ambiguous and could either be perceived as a spoken NZE trap or a sung dress. When set to music, the NZ listeners perceive the vowel according to expectations of SPMSS (i.e. expecting US-derived vowel qualities). The second perception experiment is a lexical decision task that uses the natural speech of a NZ and a US speaker, once again in musical and non-musical conditions. Participants’ processing of the US voice is facilitated in the music condition, becoming faster than reaction times to their native dialect. Bringing the results of the corpus and perception studies together, this thesis shows that SPMSS is highly normative in NZ popular music not just for performers, but also in the minds of the general music-listening public. I argue that many New Zealanders are bidialectal, with native-like knowledge of SPMSS. Speech and song are two highly distinct and perceptually contrastive contexts of language use. By differing from conversational language across a range of perceptual and cognitive dimensions, language heard or produced in song is likely to encode and activate a distinct subset of auditory memories. The contextual specificity of such networks may then allow for the abstraction of an independent sub-system of sociophonetic knowledge specific to the musical context

    Finding Our Rhythm: Contextualizing Second Language Development Through Music-Based Pedagogy

    Get PDF
    Each person learning a second or foreign language faces a unique developmental path. Individual learning trajectories have been obscured, however, by the search for “best practices” in second language educational research and praxis (Edge & Richards, 1998). This one-size-fits-all view has been further reinforced by a predominant cognitivist tradition, which orients to cognition mainly through mechanical and input and output processing, or a “mind as machine” metaphor (Boden, 2006). My dissertation aims to offer an alternative to this tradition. In my dissertation, I introduce a music-based intervention designed to develop students’ pronunciation (speech rhythm) in a U.S. college-level English as a second language classroom. The intervention draws heavily on the rhythmic properties of rap and other forms of popular music. Rather than contending solely with the binary question of whether the intervention works (or results in “best practices”), I use mixed methods to examine individual student outcomes through the lens of three major complex subsystems of second language development (Larsen-Freeman, 2011): ideological, interactional and speech production. I aim to demonstrate that students’ rich in-class interactional practices and ideological understandings of the (African American) language associated with the music in the intervention (and with my own Blackness as a teacher-researcher) reveal as much about their second language development as does an assessment of their speech rhythm production. Building on the premise that language learning is an endeavor that is not only cognitive in nature, but also social, my dissertation advocates for a much fuller contextualization of second language development and classroom practices
    corecore