Search CORE

127 research outputs found

Analysis and resynthesis of polyphonic music

Author: Nunn Douglas John Edgar
Publication venue
Publication date: 01/01/1997
Field of study

This thesis examines applications of Digital Signal Processing to the analysis, transformation, and resynthesis of musical audio. First I give an overview of the human perception of music. I then examine in detail the requirements for a system that can analyse, transcribe, process, and resynthesise monaural polyphonic music. I then describe and compare the possible hardware and software platforms. After this I describe a prototype hybrid system that attempts to carry out these tasks using a method based on additive synthesis. Next I present results from its application to a variety of musical examples, and critically assess its performance and limitations. I then address these issues in the design of a second system based on Gabor wavelets. I conclude by summarising the research and outlining suggestions for future developments

Durham e-Theses

A computational framework for sound segregation in music signals

Author: Martins Luís Gustavo Pereira Marques
Publication venue
Publication date: 01/01/2008
Field of study

Tese de doutoramento. Engenharia Electrotécnica e de Computadores. Faculdade de Engenharia. Universidade do Porto. 200

Repositório Aberto da Universidade do Porto

Recommended from our members

A new user interface for musical timbre design

Author: Seago Allan
Publication venue
Publication date: 01/01/2009
Field of study

This thesis characterises and addresses problems and issues associated with the design of intuitive user interfaces for timbral control. The usability of a range of synthesis methods and representative implementations of these methods is assessed, and three interface architectures - fixed architecture, architecture specification and direct specification - are identified. The characteristics of each of these architectures, as well as problems of usability inherent to each of them are discussed; it is argued that none of them provide intuitive tools for the manipulation and control of timbre. The study examines the nature of timbre and the notion of timbre space; different kinds of timbre space are considered and criteria are proposed for the selection of suitable timbre spaces as vehicles for synthesis. A number of listening tests, designed to demonstrate the feasibility of subsequent work, were devised and carried out; the results of these tests provide evidence that, where Euclidean distances between sounds located in a given timbre space are reflected in perceptual distances, the ability of subjects to detect relative distances in different parts of the space varies with the perceptual granularity of the space. Three contrasting timbre spaces conforming to the proposed criteria for use in synthesis are constructed; the purpose of these spaces is to provide an environment for a novel user interaction approach for timbral design which incorporates a search strategy based on weighted centroid localization. Two prototypes which exemplify the proposed approach in alternative ways are designed, implemented and tested with potential users in order to validate the approach; a third contrasting prototype which represents a simple contrasting alternative is tested for purposes of comparison. The results of these tests are evaluated and discussed, and areas of further work identified

Open Research Online (The Open University)

OpenGrey Repository

Learning to Behave: Internalising Knowledge

Author
Publication venue: 'University Library/University of Twente'
Publication date: 21/11/2000
Field of study

University of Twente Research Information

DMRN+11: Digital Music Research Network Workshop Proceedings 2016

Author: DMRN+11: Digital Music Research Network Workshop 2016
KUDUMAKIS P
SANDLER M
Publication venue: 'Queen Mary University of London'
Publication date: 26/01/2017
Field of study

Crossref

Queen Mary Research Online

Modeling of Polish Intonation for Statistical-Parametric Speech Synthesis

Author: Kuczmarski Tomasz
Publication venue
Publication date: 01/01/2022
Field of study

Wydział NeofilologiiBieżąca praca prezentuje próbę budowy neurobiologicznie umotywowanego modelu mapowań pomiędzy wysokopoziomowymi dyskretnymi kategoriami lingwistycznymi a ciągłym sygnałem częstotliwości podstawowej w polskiej neutralnej mowie czytanej, w oparciu o konwolucyjne sieci neuronowe. Po krótkim wprowadzeniu w problem badawczy w kontekście intonacji, syntezy mowy oraz luki pomiędzy fonetyką a fonologią, praca przedstawia opis uczenia modelu na podstawie specjalnego korpusu mowy oraz ewaluację naturalności konturu F0 generowanego przez wyuczony model za pomocą eksperymentów percepcyjnych typu ABX oraz MOS przy użyciu specjalnie w tym celu zbudowanego resyntezatora Neural Source Filter. Następnie, prezentowane są wyniki eksploracji fonologiczno-fonetycznych mapowań wyuczonych przez model. W tym celu wykorzystana została jedna z tzw. metod wyjaśniających dla sztucznej inteligencji – Layer-wise Relevance Propagation. W pracy przedstawione zostały wyniki powstałej na tej podstawie obszernej analizy ilościowej istotności dla konturu częstotliwości podstawowej każdej z 1297 specjalnie wygenerowanych lingwistycznych kategorii wejściowych modelu oraz ich wielorakich grupowań na różnorodnych poziomach abstrakcji. Pracę kończy dogłębna analiza oraz interpretacja uzyskanych wyników oraz rozważania na temat mocnych oraz słabych stron zastosowanych metod, a także lista proponowanych usprawnień.This work presents an attempt to build a neurobiologically inspired Convolutional Neural Network-based model of the mappings between discrete high-level linguistic categories into a continuous signal of fundamental frequency in Polish neutral read speech. After a brief introduction of the current research problem in the context of intonation, speech synthesis and the phonetic-phonology gap, the work goes on to describe the training of the model on a special speech corpus, and an evaluation of the naturalness of the F0 contour produced by the trained model through ABX and MOS perception experiments conducted with help of a specially built Neural Source Filter resynthesizer. Finally, an in-depth exploration of the phonology-to-phonetics mappings learned by the model is presented; the Layer-wise Relevance Propagation explainability method was used to perform an extensive quantitative analysis of the relevance of 1297 specially engineered linguistic input features and their groupings at various levels of abstraction for the specific contours of the fundamental frequency. The work ends with an in-depth interpretation of these results and a discussion of the advantages and disadvantages of the current method, and lists a number of potential future improvements.Badania przedstawione w pracy zostały cz˛e´sciowo zrealizowane w ramach grantu badawczego Harmonia nr UMO-2014/14/M/HS2/00631 przyznanego przez Narodowe Centrum Nauki

Adam Mickiewicz University Repository

Repozytorium Uniwersytetu im. Adama Mickiewicza (AMUR)

Recommended from our members

The effect of electronically mediated sound on group musical interaction: A case study of the practice and development of the Automatic Writing Circle

Author: Gardner Thomas
Publication venue
Publication date
Field of study

The interaction between musicians has been one of the traditional strengths of music: it stretches to include an audience and ritual participants but has its origins in group activity, the interpersonal responses of one musician to another. This thesis examines the way that electronic media have transformed the interactions between musicians, particularly in the context of live performance. A central theme is the way in which mediatisation creates new splits within previously integrated musical situations and also merges differences usually defined by physical boundaries. The theories of Gregory Bateson on schizophrenia and Irving Goffman on Situationism are brought together to create a new understanding of the term "schizophonia". This rehabilitated concept is proposed as the key to a creative exploration of new situations and discontinuities which make up group performance in a mediatised environment. In practical terms the exploration of new musical situations is documented in the following projects: the material created for the group "Automatic Writing Circle" during its evolution over a period of six years (compositions, software, instruments), development of the Ouija Board and accompanying software, composition of the piece Lipsync and the earlier piece I slept by numbers for flute and live electronics

City Research Online

Initial CONNECT Architecture

Author: Bertolino Antonia
Blair Gordon
Chauvel Franck
Flores Cortes Carlos
Georgantas Nikolaos
Grace Paul
Howar Falk
Huyn Tran
Jonsson Bengt
Paolucci Massimo
Pathak Animesh
Souville Bertrand
Tivoli Massimo
Publication venue: HAL CCSD
Publication date: 15/02/2010
Field of study

Interoperability remains a fundamental challenge when connecting heterogeneous systems which encounter and spontaneously communicate with one another in pervasive computing environments. This challenge is exasperated by the highly heterogeneous technologies employed by each of the interacting parties, i.e., in terms of hardware, operating system, middleware protocols, and application protocols. The key aim of the CONNECT project is to drop this heterogeneity barrier and achieve universal interoperability. Here we report on the development of the overall CONNECT architecture that will underpin this solution; in this respect, we present the following contributions: i) an elicitation of interoperability requirements from a set of pervasive computing scenarios, ii) a survey of existing solutions to interoperability, iii) an initial view of the CONNECT architecture, and iv) a series of experiments to provide initial validation of the architecture

INRIA a CCSD electronic archive server

Prosody in text-to-speech synthesis using fuzzy logic

Author: Williams Jonathan Brent
Publication venue: The Research Repository @ WVU
Publication date: 01/12/2005
Field of study

For over a thousand years, inventors, scientists and researchers have tried to reproduce human speech. Today, the quality of synthesized speech is not equivalent to the quality of real speech. Most research on speech synthesis focuses on improving the quality of the speech produced by Text-to-Speech (TTS) systems. The best TTS systems use unit selection-based concatenation to synthesize speech. However, this method is very timely and the speech database is very large. Diphone concatenated synthesized speech requires less memory, but sounds robotic. This thesis explores the use of fuzzy logic to make diphone concatenated speech sound more natural. A TTS is built using both neural networks and fuzzy logic. Text is converted into phonemes using neural networks. Fuzzy logic is used to control the fundamental frequency for three types of sentences. In conclusion, the fuzzy system produces f0 contours that make the diphone concatenated speech sound more natural

The Research Repository @ WVU (West Virginia University)

Predictive Articulatory speech synthesis Utilizing Lexical Embeddings (PAULE)

Author: Sering Konstantin Florian
Publication venue: Universität Tübingen
Publication date: 21/12/2023
Field of study

Das Predictive Articulatory speech synthesis Utilizing Lexical Embeddings (PAULE) Modell ist ein neues Modell zur Kontrolle des artikulatorischen Sprachsynthesizers VocalTractLab (VTL) [15] . Mit PAULE lassen sich deutsche Wörter synthetisieren. Die Wortsynthese kann entweder mit Hilfe eines semantischen Vektors, der die Wortbedeu- tung kodiert, und der gewünschten Dauer der Wortsynthese gestartet werden oder es kann eine Resynthese von einer Audiodatei gemacht werden. Die Audiodatei kann beliebige Aufnahmen von Sprecher:innen enthalten, wobei die Resynthese immer über den Standardsprecher des VTL erfolgt. Abhängig von der Wortbedeutung und der Audiodatei variiert die Synthesequalität. Neu an PAULE ist, dass es einen prädiktiven Ansatz verwendet, indem es aus der geplanten Artikulation die dazugehörige perzeptuelle Akustik vorhersagt und daraus die Wortbedeutung ableitet. Sowohl die Akustik als auch die Wortbedeutung sind als metrische Vektorräume implementiert. Dadurch lässt sich ein Fehler zu einer gewünschten Zielakustik und Zielbedeutung berechnen und minimieren. Bei dem minimierten Fehler handelt es sich nicht um den tatsächlichen Fehler, der aus der Synthese mit dem VTL entsteht, sondern um den Fehler, der aus den Vorhersagen eines prädiktiven Modells generiert wird. Obwohl es nicht der tatsächliche Fehler ist, kann dieser Fehler genutzt werden, um die tatsächliche Artikulation zu verbessern. Um das prädiktive Modell mit der tatsächlichen Akustik in Einklang zu bringen, hört sich PAULE selbst zu. Ein in der Sprachsynthese zentrales Eins-Zu-Viele-Problem ist, dass eine Akustik durch viele verschiedene Artikulationen erzeugt werden kann. Dieses Eins-Zu-Viele-Problem wird durch die Vorhersagefehlerminimierung in PAULE aufgelöst, zusammen mit der Bedingung, dass die Artikulation möglichst stationär und mit möglichst konstanter Kraft ausgeführt wird. PAULE funktioniert ohne jegliche symbolische Repräsentation in der Akustik (Phoneme) und in der Artikulation (motorische Gesten oder Ziele). Damit zeigt PAULE, dass sich gesprochene Wörter ohne symbolische Beschreibungsebene model- lieren lassen. Der gesprochenen Sprache könnte daher im Vergleich zur geschriebenen Sprache eine fundamental andere Verarbeitungsebene zugrunde liegen. PAULE integriert Erfahrungswissen sukzessive. Damit findet PAULE nicht die global beste Artikulation sondern lokal gute Artikulationen. Intern setzt PAULE auf künstliche neuronale Netze und die damit verbundenen Gradienten, die zur Fehlerkorrektur verwendet werden. PAULE kann weder ganze Sätze synthetisieren noch wird somatosensorisches Feedback berücksichtigt. Zu Beidem gibt es Vorarbeiten, die in zukünftige Versionen integriert werden sollen.The Predictive Articulatory speech synthesis Utilizing Lexical Embeddings (PAULE) model is a new control model for the VocalTractLab (VTL) [15] speech synthesizer, a simulator of the human speech system. It is capable of synthesizing single words in the German language. The speech synthesis can be based on a target semantic vector or on target acoustics, i.e., a recorded word token. VTL is controlled by 30 parameters. These parameters have to be estimated for each time point during the production of a word, which is roughly every 2.5 milliseconds. The time-series of these 30 control parameters (cps) of the VTL are the control parameter trajectories (cp-trajectories). The high dimensionality of the cp-trajectories in combination with non-linear interactions leads to a many-to-one mapping problem, where many sets of cp-trajectories produce highly similar synthesized audio. PAULE solves this many-to-one mapping problem by anticipating the effects of cp- trajectories and minimizing a semantic and acoustic error between this nticipation and a targeted meaning and acoustics. The quality of the anticipation is improved by an outer loop, where PAULE listens to itself. PAULE has three central design features that distinguish it from other control models: First, PAULE does not use any symbolic units, neither motor primitives, articulatory targets, or gestural scores on the movement side, nor any phone or syllable representation on the acoustic side. Second, PAULE is a learning model that accumulates experience with articulated words. As a consequence, PAULE will not find a global optimum for the inverse kinematic optimization task it has to solve. Instead, it finds a local optimum that is conditioned on its past experience. Third, PAULE uses gradient-based internal prediction errors of a predictive forward model to plan cp-trajectories for a given semantic or acoustic target. Thus, PAULE is an error-driven model that takes its previous experiences into account. Pilot study results indicate that PAULE is able to minimize an acoustic semantic and acoustic error in the resynthesized audio. This allows PAULE to find cp-trajectories that are correctly classified by a classification model as the correct word with an accuracy of 60 %, which is close to the accuracy for human recordings of 63 %. Furthermore, PAULE seems to model vowel-to-vowel anticipatory coarticulation in terms of formant shifts correctly and can be compared to human electromagnetic articulography (EMA) recordings in a straightforward way. Furthermore, with PAULE it is possible to condition on already executed past cp-trajectories and to smoothly continue the cp-trajectories from the current state. As a side-effect of developing PAULE, it is possible to create large amounts of training data for the VTL through an automated segment-based approach. Next steps, in the development of PAULE, include adding a somatosensory feedback channel, extending PAULE from producing single words to the articulation of small utterances and adding a thorough evaluation

Publikationsserver der Universität Tübingen