127 research outputs found

    Analysis and resynthesis of polyphonic music

    Get PDF
    This thesis examines applications of Digital Signal Processing to the analysis, transformation, and resynthesis of musical audio. First I give an overview of the human perception of music. I then examine in detail the requirements for a system that can analyse, transcribe, process, and resynthesise monaural polyphonic music. I then describe and compare the possible hardware and software platforms. After this I describe a prototype hybrid system that attempts to carry out these tasks using a method based on additive synthesis. Next I present results from its application to a variety of musical examples, and critically assess its performance and limitations. I then address these issues in the design of a second system based on Gabor wavelets. I conclude by summarising the research and outlining suggestions for future developments

    A computational framework for sound segregation in music signals

    Get PDF
    Tese de doutoramento. Engenharia Electrotécnica e de Computadores. Faculdade de Engenharia. Universidade do Porto. 200

    Learning to Behave: Internalising Knowledge

    Get PDF

    Modeling of Polish Intonation for Statistical-Parametric Speech Synthesis

    Get PDF
    Wydział NeofilologiiBieżąca praca prezentuje próbę budowy neurobiologicznie umotywowanego modelu mapowań pomiędzy wysokopoziomowymi dyskretnymi kategoriami lingwistycznymi a ciągłym sygnałem częstotliwości podstawowej w polskiej neutralnej mowie czytanej, w oparciu o konwolucyjne sieci neuronowe. Po krótkim wprowadzeniu w problem badawczy w kontekście intonacji, syntezy mowy oraz luki pomiędzy fonetyką a fonologią, praca przedstawia opis uczenia modelu na podstawie specjalnego korpusu mowy oraz ewaluację naturalności konturu F0 generowanego przez wyuczony model za pomocą eksperymentów percepcyjnych typu ABX oraz MOS przy użyciu specjalnie w tym celu zbudowanego resyntezatora Neural Source Filter. Następnie, prezentowane są wyniki eksploracji fonologiczno-fonetycznych mapowań wyuczonych przez model. W tym celu wykorzystana została jedna z tzw. metod wyjaśniających dla sztucznej inteligencji – Layer-wise Relevance Propagation. W pracy przedstawione zostały wyniki powstałej na tej podstawie obszernej analizy ilościowej istotności dla konturu częstotliwości podstawowej każdej z 1297 specjalnie wygenerowanych lingwistycznych kategorii wejściowych modelu oraz ich wielorakich grupowań na różnorodnych poziomach abstrakcji. Pracę kończy dogłębna analiza oraz interpretacja uzyskanych wyników oraz rozważania na temat mocnych oraz słabych stron zastosowanych metod, a także lista proponowanych usprawnień.This work presents an attempt to build a neurobiologically inspired Convolutional Neural Network-based model of the mappings between discrete high-level linguistic categories into a continuous signal of fundamental frequency in Polish neutral read speech. After a brief introduction of the current research problem in the context of intonation, speech synthesis and the phonetic-phonology gap, the work goes on to describe the training of the model on a special speech corpus, and an evaluation of the naturalness of the F0 contour produced by the trained model through ABX and MOS perception experiments conducted with help of a specially built Neural Source Filter resynthesizer. Finally, an in-depth exploration of the phonology-to-phonetics mappings learned by the model is presented; the Layer-wise Relevance Propagation explainability method was used to perform an extensive quantitative analysis of the relevance of 1297 specially engineered linguistic input features and their groupings at various levels of abstraction for the specific contours of the fundamental frequency. The work ends with an in-depth interpretation of these results and a discussion of the advantages and disadvantages of the current method, and lists a number of potential future improvements.Badania przedstawione w pracy zostały cz˛e´sciowo zrealizowane w ramach grantu badawczego Harmonia nr UMO-2014/14/M/HS2/00631 przyznanego przez Narodowe Centrum Nauki

    Initial CONNECT Architecture

    Get PDF
    Interoperability remains a fundamental challenge when connecting heterogeneous systems which encounter and spontaneously communicate with one another in pervasive computing environments. This challenge is exasperated by the highly heterogeneous technologies employed by each of the interacting parties, i.e., in terms of hardware, operating system, middleware protocols, and application protocols. The key aim of the CONNECT project is to drop this heterogeneity barrier and achieve universal interoperability. Here we report on the development of the overall CONNECT architecture that will underpin this solution; in this respect, we present the following contributions: i) an elicitation of interoperability requirements from a set of pervasive computing scenarios, ii) a survey of existing solutions to interoperability, iii) an initial view of the CONNECT architecture, and iv) a series of experiments to provide initial validation of the architecture

    Prosody in text-to-speech synthesis using fuzzy logic

    Get PDF
    For over a thousand years, inventors, scientists and researchers have tried to reproduce human speech. Today, the quality of synthesized speech is not equivalent to the quality of real speech. Most research on speech synthesis focuses on improving the quality of the speech produced by Text-to-Speech (TTS) systems. The best TTS systems use unit selection-based concatenation to synthesize speech. However, this method is very timely and the speech database is very large. Diphone concatenated synthesized speech requires less memory, but sounds robotic. This thesis explores the use of fuzzy logic to make diphone concatenated speech sound more natural. A TTS is built using both neural networks and fuzzy logic. Text is converted into phonemes using neural networks. Fuzzy logic is used to control the fundamental frequency for three types of sentences. In conclusion, the fuzzy system produces f0 contours that make the diphone concatenated speech sound more natural

    Predictive Articulatory speech synthesis Utilizing Lexical Embeddings (PAULE)

    Get PDF
    Das Predictive Articulatory speech synthesis Utilizing Lexical Embeddings (PAULE) Modell ist ein neues Modell zur Kontrolle des artikulatorischen Sprachsynthesizers VocalTractLab (VTL) [15] . Mit PAULE lassen sich deutsche Wörter synthetisieren. Die Wortsynthese kann entweder mit Hilfe eines semantischen Vektors, der die Wortbedeu- tung kodiert, und der gewünschten Dauer der Wortsynthese gestartet werden oder es kann eine Resynthese von einer Audiodatei gemacht werden. Die Audiodatei kann beliebige Aufnahmen von Sprecher:innen enthalten, wobei die Resynthese immer über den Standardsprecher des VTL erfolgt. Abhängig von der Wortbedeutung und der Audiodatei variiert die Synthesequalität. Neu an PAULE ist, dass es einen prädiktiven Ansatz verwendet, indem es aus der geplanten Artikulation die dazugehörige perzeptuelle Akustik vorhersagt und daraus die Wortbedeutung ableitet. Sowohl die Akustik als auch die Wortbedeutung sind als metrische Vektorräume implementiert. Dadurch lässt sich ein Fehler zu einer gewünschten Zielakustik und Zielbedeutung berechnen und minimieren. Bei dem minimierten Fehler handelt es sich nicht um den tatsächlichen Fehler, der aus der Synthese mit dem VTL entsteht, sondern um den Fehler, der aus den Vorhersagen eines prädiktiven Modells generiert wird. Obwohl es nicht der tatsächliche Fehler ist, kann dieser Fehler genutzt werden, um die tatsächliche Artikulation zu verbessern. Um das prädiktive Modell mit der tatsächlichen Akustik in Einklang zu bringen, hört sich PAULE selbst zu. Ein in der Sprachsynthese zentrales Eins-Zu-Viele-Problem ist, dass eine Akustik durch viele verschiedene Artikulationen erzeugt werden kann. Dieses Eins-Zu-Viele-Problem wird durch die Vorhersagefehlerminimierung in PAULE aufgelöst, zusammen mit der Bedingung, dass die Artikulation möglichst stationär und mit möglichst konstanter Kraft ausgeführt wird. PAULE funktioniert ohne jegliche symbolische Repräsentation in der Akustik (Phoneme) und in der Artikulation (motorische Gesten oder Ziele). Damit zeigt PAULE, dass sich gesprochene Wörter ohne symbolische Beschreibungsebene model- lieren lassen. Der gesprochenen Sprache könnte daher im Vergleich zur geschriebenen Sprache eine fundamental andere Verarbeitungsebene zugrunde liegen. PAULE integriert Erfahrungswissen sukzessive. Damit findet PAULE nicht die global beste Artikulation sondern lokal gute Artikulationen. Intern setzt PAULE auf künstliche neuronale Netze und die damit verbundenen Gradienten, die zur Fehlerkorrektur verwendet werden. PAULE kann weder ganze Sätze synthetisieren noch wird somatosensorisches Feedback berücksichtigt. Zu Beidem gibt es Vorarbeiten, die in zukünftige Versionen integriert werden sollen.The Predictive Articulatory speech synthesis Utilizing Lexical Embeddings (PAULE) model is a new control model for the VocalTractLab (VTL) [15] speech synthesizer, a simulator of the human speech system. It is capable of synthesizing single words in the German language. The speech synthesis can be based on a target semantic vector or on target acoustics, i.e., a recorded word token. VTL is controlled by 30 parameters. These parameters have to be estimated for each time point during the production of a word, which is roughly every 2.5 milliseconds. The time-series of these 30 control parameters (cps) of the VTL are the control parameter trajectories (cp-trajectories). The high dimensionality of the cp-trajectories in combination with non-linear interactions leads to a many-to-one mapping problem, where many sets of cp-trajectories produce highly similar synthesized audio. PAULE solves this many-to-one mapping problem by anticipating the effects of cp- trajectories and minimizing a semantic and acoustic error between this nticipation and a targeted meaning and acoustics. The quality of the anticipation is improved by an outer loop, where PAULE listens to itself. PAULE has three central design features that distinguish it from other control models: First, PAULE does not use any symbolic units, neither motor primitives, articulatory targets, or gestural scores on the movement side, nor any phone or syllable representation on the acoustic side. Second, PAULE is a learning model that accumulates experience with articulated words. As a consequence, PAULE will not find a global optimum for the inverse kinematic optimization task it has to solve. Instead, it finds a local optimum that is conditioned on its past experience. Third, PAULE uses gradient-based internal prediction errors of a predictive forward model to plan cp-trajectories for a given semantic or acoustic target. Thus, PAULE is an error-driven model that takes its previous experiences into account. Pilot study results indicate that PAULE is able to minimize an acoustic semantic and acoustic error in the resynthesized audio. This allows PAULE to find cp-trajectories that are correctly classified by a classification model as the correct word with an accuracy of 60 %, which is close to the accuracy for human recordings of 63 %. Furthermore, PAULE seems to model vowel-to-vowel anticipatory coarticulation in terms of formant shifts correctly and can be compared to human electromagnetic articulography (EMA) recordings in a straightforward way. Furthermore, with PAULE it is possible to condition on already executed past cp-trajectories and to smoothly continue the cp-trajectories from the current state. As a side-effect of developing PAULE, it is possible to create large amounts of training data for the VTL through an automated segment-based approach. Next steps, in the development of PAULE, include adding a somatosensory feedback channel, extending PAULE from producing single words to the articulation of small utterances and adding a thorough evaluation
    corecore