102 research outputs found

    An Automatic Real-time Synchronization of Live speech with Its Transcription Approach

    Get PDF
    Most studies in automatic synchronization of speech and transcription focus on the synchronization at the sentence level or the phrase level. Nevertheless, in some languages, like Thai, boundaries of such levels are difficult to linguistically define, especially in case of the synchronization of speech and its transcription. Consequently, the synchronization at a finer level like the syllabic level is promising. In this article, an approach to synchronize live speech with its corresponding transcription in real time at the syllabic level is proposed. Our approach employs the modified real-time syllable detection procedure from our previous work and the transcription verification procedure then adopts to verify correctness and to recover errors caused by the real-time syllable detection procedure. In experiments, the acoustic features and the parameters are customized empirically. Results are compared with two baselines which have been applied to the Thai scenario. Experimental results indicate that, our approach outperforms two baselines with error rate reduction of 75.9% and 41.9% respectively and also can provide results in the real-time situation. Besides, our approach is applied to the practical application, namely ChulaDAISY. Practical experiments show that ChulaDAISY applied with our approach could reduce time consumption for producing audio books

    Lyrics-to-Audio Alignment and its Application

    Get PDF
    Automatic lyrics-to-audio alignment techniques have been drawing attention in the last years and various studies have been made in this field. The objective of lyrics-to-audio alignment is to estimate a temporal relationship between lyrics and musical audio signals and can be applied to various applications such as Karaoke-style lyrics display. In this contribution, we provide an overview of recent development in this research topic, where we put a particular focus on categorization of various methods and on applications

    The influence of music in the development of phonetics, phonology and phonological awareness in 3-year-olds with typical development and 3- to 6-year-olds with speech or language disorder

    Get PDF
    The relation between music and language has been intensively studied in recent years. Research has revealed that both domains engage similar processing mechanisms, including auditory processing and higher cognitive functions, and recruite partially overlapping brain structures. Furthemore, different authors have shown that linguistic functions can be positively influenced by music training above 4 years old. Music has also been shown to be associated with the promotion of linguistics skills in children with specific pathologies related to language, such as dyslexia, in 7- to 10-year-olds. In this thesis, we conducted four studies on the effect of music training on different aspects of language development. We studied this influence on phonological awareness in children with typical development and speech or language disorder (Studies 1 and 2) and on phonetic and phonological development in the same populations (Studies 3 and 4). In this randomized control study, with a test-training-retest methodology, 49 typically developing children, aged between 3 and 4, were assessed with a phonological awareness test (Study 1) and a phonetic and phonological test (Study 3). These tests were applied before and after a school year with weekly Music Classes (experimental group) or Visual Arts Classes (control group) in kindergarten. Two additional studies with the same methodology were conducted with 16 atypically developing children, aged between 3 and 6. This group included children with phonologically based Speech Sound Disorder (SSD) and / or Developmental Language Disorder (DLD). The goal of investigating this topic in this group of atypical children was to understand music relation with phonological awareness development (Study 2) and phonetic and phonological development (Study 4), and this at stages earlier than those where the beneficial effect of music in language-related skills has already been found. When comparing pre- and post-assessment and irrespective of type of lessons’ exposure (music or visual arts), results always showed significant differences in each group, as expected due to general developmental reasons. In Study 1, Music Classes’ students outperformed the control group, showing significantly larger differences between the beginning and the end of the lessosns period, indicating that music lessons have influenced phonological awareness, and pointing to a causal relation between music training and phonological awareness as soon as 3 years of age. In Studies 2, 3 and 4 no significant differences were found between groups in the post-assessment moment. This fact may suggest the need of more intensive music training or, in what concerns atypically developing children, a specific music curriculum designed to bridge their linguistic rhythm (and possibly pitch) processing deficits. Globally, this thesis supports the hypothesis that music training may promote language abilities, in particular phonological awareness, in 3-year-olds with typical development, that is prior to the ages that have been previously studied.A relação entre música e linguagem tem sido alvo de diversos estudos ao longo dos últimos anos. Estes estudos têm demonstrado que as duas áreas estão relacionadas, utilizando mecanismos de processamento comuns (incluindo o processamento auditivo e o processamento de funções cognitivas) e recorrendo a estruturas neuronais comuns. Vários autores referem a influência da música em diferentes áreas do desenvolvimento linguístico em crianças com idades superiores a 4 anos, nomeadamente no que diz respeito a competências de consciência fonológica, consciência fonémica, precursores da leitura e capacidade de leitura (em crianças com desenvolvimento típico e com dislexia). De forma a compreender a influência do treino musical no desenvolvimento da consciência fonológica (Estudos 1 e 2) bem como das capacidades fonéticas e fonológicas da criança (Estudos 3 e 4), foi desenhado um estudo com metologia teste-treino-reteste, com randomização de uma amostra e constituição de um grupo com Aulas de Música (grupo experimental) e outro grupo com Aulas de Artes Visuais (grupo de controlo). Esta metodologia incidiu sobre uma população de crianças com desenvolvimento típico (N=49), com 3 anos de idade (Estudos 1 e 3) e sobre uma população com Perturbação de Sons de Fala de base fonológica e / ou Perturbação do Desenvolvimento da Linguagem (N=16), entre os 3 e os 6 anos de idade (Estudos 2 e 4). Todas as crianças foram avaliadas antes e depois do período de intervenção, que decorreu durante um ano escolar, em contexto de Jardim de Infância, com a frequência de uma aula semanal de 45 minutos (Música VS Artes Visuais). A comparação realizada entre os momentos de avaliação inicial e final revelou diferenças significativas no desenvolvimento das diversas competências, em todos os grupos – melhoria expectável devido ao desenvolvimento da criança resultante apenas do tempo decorrido. No Estudo 1, sobre a influência da música no desenvolvimento da Consciência Fonológica, os alunos de Música obtiveram melhores resultados do que os alunos de Artes Visuais, mostrando uma evolução significativamente maior do que o grupo de controlo no momento da pós-avaliação. Este facto indica que o treino musical influenciou positivamente o desenvolvimento das competências de consciência fonológica, apontando para uma relação de causa-efeito, aos 3 anos de idade. Nos Estudos 2 (influência da música no desenvolvimento da consciência fonológica em crianças com desenvolvimento atípico), 3 (influência da música no desenvolvimento fonético-fonológico em crianças com desenvolvimento típico) e 4 (influência da música no desenvolvimento fonético-fonológico em crianças com desenvolvimento atípico) não foram encontradas diferenças significativas entre os grupos experimentais e de controlo no momento de avaliação final. Este facto pode indicar a necessidade de um treino musical mais intensivo ou, no caso das crianças com desenvolvimento atípico, que compreenda um currículo especificamente desenhado para colmatar as dificuldades ao nível do processamento linguístico e musical. Esta tese suporta a hipótese de que o treino musical promove o desenvolvimento de capacidades linguísticas, nomeadamente de consciência fonológica, aos 3 anos de idade, numa faixa etária anterior às até aqui estudadas. Este facto poderá ter impacto na criação de programas de promoção da consciência fonológica em contexto de Jardim de Infância, ou na potenciação de estratégias utilizadas pelo educador de infância, no seu dia-a-dia, no sentido de promover o desenvolvimento de competências fundamentais nas aprendizagens subsequentes da leitura e da escrita

    Signal Processing Methods for Music Synchronization, Audio Matching, and Source Separation

    Get PDF
    The field of music information retrieval (MIR) aims at developing techniques and tools for organizing, understanding, and searching multimodal information in large music collections in a robust, efficient and intelligent manner. In this context, this thesis presents novel, content-based methods for music synchronization, audio matching, and source separation. In general, music synchronization denotes a procedure which, for a given position in one representation of a piece of music, determines the corresponding position within another representation. Here, the thesis presents three complementary synchronization approaches, which improve upon previous methods in terms of robustness, reliability, and accuracy. The first approach employs a late-fusion strategy based on multiple, conceptually different alignment techniques to identify those music passages that allow for reliable alignment results. The second approach is based on the idea of employing musical structure analysis methods in the context of synchronization to derive reliable synchronization results even in the presence of structural differences between the versions to be aligned. Finally, the third approach employs several complementary strategies for increasing the accuracy and time resolution of synchronization results. Given a short query audio clip, the goal of audio matching is to automatically retrieve all musically similar excerpts in different versions and arrangements of the same underlying piece of music. In this context, chroma-based audio features are a well-established tool as they possess a high degree of invariance to variations in timbre. This thesis describes a novel procedure for making chroma features even more robust to changes in timbre while keeping their discriminative power. Here, the idea is to identify and discard timbre-related information using techniques inspired by the well-known MFCC features, which are usually employed in speech processing. Given a monaural music recording, the goal of source separation is to extract musically meaningful sound sources corresponding, for example, to a melody, an instrument, or a drum track from the recording. To facilitate this complex task, one can exploit additional information provided by a musical score. Based on this idea, this thesis presents two novel, conceptually different approaches to source separation. Using score information provided by a given MIDI file, the first approach employs a parametric model to describe a given audio recording of a piece of music. The resulting model is then used to extract sound sources as specified by the score. As a computationally less demanding and easier to implement alternative, the second approach employs the additional score information to guide a decomposition based on non-negative matrix factorization (NMF)

    Singing information processing: techniques and applications

    Get PDF
    Por otro lado, se presenta un método para el cambio realista de intensidad de voz cantada. Esta transformación se basa en un modelo paramétrico de la envolvente espectral, y mejora sustancialmente la percepción de realismo al compararlo con software comerciales como Melodyne o Vocaloid. El inconveniente del enfoque propuesto es que requiere intervención manual, pero los resultados conseguidos arrojan importantes conclusiones hacia la modificación automática de intensidad con resultados realistas. Por último, se propone un método para la corrección de disonancias en acordes aislados. Se basa en un análisis de múltiples F0, y un desplazamiento de la frecuencia de su componente sinusoidal. La evaluación la ha realizado un grupo de músicos entrenados, y muestra un claro incremento de la consonancia percibida después de la transformación propuesta.La voz cantada es una componente esencial de la música en todas las culturas del mundo, ya que se trata de una forma increíblemente natural de expresión musical. En consecuencia, el procesado automático de voz cantada tiene un gran impacto desde la perspectiva de la industria, la cultura y la ciencia. En este contexto, esta Tesis contribuye con un conjunto variado de técnicas y aplicaciones relacionadas con el procesado de voz cantada, así como con un repaso del estado del arte asociado en cada caso. En primer lugar, se han comparado varios de los mejores estimadores de tono conocidos para el caso de uso de recuperación por tarareo. Los resultados demuestran que \cite{Boersma1993} (con un ajuste no obvio de parámetros) y \cite{Mauch2014}, tienen un muy buen comportamiento en dicho caso de uso dada la suavidad de los contornos de tono extraídos. Además, se propone un novedoso sistema de transcripción de voz cantada basada en un proceso de histéresis definido en tiempo y frecuencia, así como una herramienta para evaluación de voz cantada en Matlab. El interés del método propuesto es que consigue tasas de error cercanas al estado del arte con un método muy sencillo. La herramienta de evaluación propuesta, por otro lado, es un recurso útil para definir mejor el problema, y para evaluar mejor las soluciones propuestas por futuros investigadores. En esta Tesis también se presenta un método para evaluación automática de la interpretación vocal. Usa alineamiento temporal dinámico para alinear la interpretación del usuario con una referencia, proporcionando de esta forma una puntuación de precisión de afinación y de ritmo. La evaluación del sistema muestra una alta correlación entre las puntuaciones dadas por el sistema, y las puntuaciones anotadas por un grupo de músicos expertos

    Advances in the neurocognition of music and language

    Get PDF

    Studies in ambient intelligent lighting

    Get PDF
    The revolution in lighting we are arguably experiencing is led by technical developments in the area of solid state lighting technology. The improved lifetime, efficiency and environmentally friendly raw materials make LEDs the main contender for the light source of the future. The core of the change is, however, not in the basic technology, but in the way users interact with it and the way the quality of the produced effect on the environment is judged. With the new found freedom the users can switch their focus from the confines of the technology to the expression of their needs, regardless of the details of the lighting system. Identifying the user needs, creating an effective language to communicate them to the system, and translating them to control signals that fulfill them, as well as defining the means to measure the quality of the produced result are the topic of study of a new multidisciplinary area of study, Ambient Intelligent Lighting. This thesis describes a series of studies in the field of Ambient Intelligent Lighting, divided in two parts. The first part of the thesis demonstrates how, by adopting a user centric design philosophy, the traditional control paradigms can be superseded by novel, so-called effect driven controls. Chapter 3 describes an algorithm that, using statistical methods and image processing, generates a set of colors based on a term or set of terms. The algorithm uses Internet image search engines (Google Images, Flickr) to acquire a set of images that represent a term and subsequently extracts representative colors from the set. Additionally, an estimate of the quality of the extracted set of colors is computed. Based on the algorithm, a system that automatically enriches music with lyrics based images and lighting was built and is described. Chapter 4 proposes a novel effect driven control algorithm, enabling users easy, natural and system agnostic means to create a spatial light distribution. By using an emerging technology, visible light communication, and an intuitive effect definition, a real time interactive light design system was developed. Usability studies on a virtual prototype of the system demonstrated the perceived ease of use and increased efficiency of an effect driven approach. In chapter 5, using stochastic models, natural temporal light transitions are modeled and reproduced. Based on an example video of a natural light effect, a Markov model of the transitions between colors of a single light source representing the effect is learned. The model is a compact, easy to reproduce, and as the user studies show, recognizable representation of the original light effect. The second part of the thesis studies the perceived quality of one of the unique capabilities of LEDs, chromatic temporal transitions. Using psychophysical methods, existing spatial models of human color vision were found to be unsuitable for predicting the visibility of temporal artifacts caused by the digital controls. The chapters in this part demonstrate new perceptual effects and make the first steps towards building a temporal model of human color vision. In chapter 6 the perception of smoothness of digital light transitions is studied. The studies presented demonstrate the dependence of the visibility of digital steps in a temporal transition on the frequency of change, chromaticity, intensity and direction of change of the transition. Furthermore, a clear link between the visibility of digital steps and flicker visibility is demonstrated. Finally, a new, exponential law for the dependence of the threshold speed of smooth transitions on the changing frequency is hypothesized and proven in subsequent experiments. Chapter 7 studies the discrimination and preference of different color transitions between two colors. Due to memory effects, the discrimination threshold for complete transitions was shown to be larger than the discrimination threshold for two single colors. Two linear transitions in different color spaces were shown to be significantly preferred over a set of other, curved, transitions. Chapter 8 studies chromatic and achromatic flicker visibility in the periphery. A complex change of both the absolute visibility thresholds for different frequencies, as well as the critical flicker frequency is observed. Finally, an increase in the absolute visibility thresholds caused by an addition of a mental task in central vision is demonstrated

    Application of automatic speech recognition technologies to singing

    Get PDF
    The research field of Music Information Retrieval is concerned with the automatic analysis of musical characteristics. One aspect that has not received much attention so far is the automatic analysis of sung lyrics. On the other hand, the field of Automatic Speech Recognition has produced many methods for the automatic analysis of speech, but those have rarely been employed for singing. This thesis analyzes the feasibility of applying various speech recognition methods to singing, and suggests adaptations. In addition, the routes to practical applications for these systems are described. Five tasks are considered: Phoneme recognition, language identification, keyword spotting, lyrics-to-audio alignment, and retrieval of lyrics from sung queries. The main bottleneck in almost all of these tasks lies in the recognition of phonemes from sung audio. Conventional models trained on speech do not perform well when applied to singing. Training models on singing is difficult due to a lack of annotated data. This thesis offers two approaches for generating such data sets. For the first one, speech recordings are made more “song-like”. In the second approach, textual lyrics are automatically aligned to an existing singing data set. In both cases, these new data sets are then used for training new acoustic models, offering considerable improvements over models trained on speech. Building on these improved acoustic models, speech recognition algorithms for the individual tasks were adapted to singing by either improving their robustness to the differing characteristics of singing, or by exploiting the specific features of singing performances. Examples of improving robustness include the use of keyword-filler HMMs for keyword spotting, an i-vector approach for language identification, and a method for alignment and lyrics retrieval that allows highly varying durations. Features of singing are utilized in various ways: In an approach for language identification that is well-suited for long recordings; in a method for keyword spotting based on phoneme durations in singing; and in an algorithm for alignment and retrieval that exploits known phoneme confusions in singing.Das Gebiet des Music Information Retrieval befasst sich mit der automatischen Analyse von musikalischen Charakteristika. Ein Aspekt, der bisher kaum erforscht wurde, ist dabei der gesungene Text. Auf der anderen Seite werden in der automatischen Spracherkennung viele Methoden für die automatische Analyse von Sprache entwickelt, jedoch selten für Gesang. Die vorliegende Arbeit untersucht die Anwendung von Methoden aus der Spracherkennung auf Gesang und beschreibt mögliche Anpassungen. Zudem werden Wege zur praktischen Anwendung dieser Ansätze aufgezeigt. Fünf Themen werden dabei betrachtet: Phonemerkennung, Sprachenidentifikation, Schlagwortsuche, Text-zu-Gesangs-Alignment und Suche von Texten anhand von gesungenen Anfragen. Das größte Hindernis bei fast allen dieser Themen ist die Erkennung von Phonemen aus Gesangsaufnahmen. Herkömmliche, auf Sprache trainierte Modelle, bieten keine guten Ergebnisse für Gesang. Das Trainieren von Modellen auf Gesang ist schwierig, da kaum annotierte Daten verfügbar sind. Diese Arbeit zeigt zwei Ansätze auf, um solche Daten zu generieren. Für den ersten wurden Sprachaufnahmen künstlich gesangsähnlicher gemacht. Für den zweiten wurden Texte automatisch zu einem vorhandenen Gesangsdatensatz zugeordnet. Die neuen Datensätze wurden zum Trainieren neuer Modelle genutzt, welche deutliche Verbesserungen gegenüber sprachbasierten Modellen bieten. Auf diesen verbesserten akustischen Modellen aufbauend wurden Algorithmen aus der Spracherkennung für die verschiedenen Aufgaben angepasst, entweder durch das Verbessern der Robustheit gegenüber Gesangscharakteristika oder durch das Ausnutzen von hilfreichen Besonderheiten von Gesang. Beispiele für die verbesserte Robustheit sind der Einsatz von Keyword-Filler-HMMs für die Schlagwortsuche, ein i-Vector-Ansatz für die Sprachenidentifikation sowie eine Methode für das Alignment und die Textsuche, die stark schwankende Phonemdauern nicht bestraft. Die Besonderheiten von Gesang werden auf verschiedene Weisen genutzt: So z.B. in einem Ansatz für die Sprachenidentifikation, der lange Aufnahmen benötigt; in einer Methode für die Schlagwortsuche, die bekannte Phonemdauern in Gesang mit einbezieht; und in einem Algorithmus für das Alignment und die Textsuche, der bekannte Phonemkonfusionen verwertet
    corecore