127 research outputs found
Analysis and resynthesis of polyphonic music
This thesis examines applications of Digital Signal Processing to the analysis, transformation, and resynthesis of musical audio. First I give an overview of the human perception of music. I then examine in detail the requirements for a system that can analyse, transcribe, process, and resynthesise monaural polyphonic music. I then describe and compare the possible hardware and software platforms. After this I describe a prototype hybrid system that attempts to carry out these tasks using a method based on additive synthesis. Next I present results from its application to a variety of musical examples, and critically assess its performance and limitations. I then address these issues in the design of a second system based on Gabor wavelets. I conclude by summarising the research and outlining suggestions for future developments
A computational framework for sound segregation in music signals
Tese de doutoramento. Engenharia Electrotécnica e de Computadores. Faculdade de Engenharia. Universidade do Porto. 200
Recommended from our members
A new user interface for musical timbre design
This thesis characterises and addresses problems and issues associated with the design of intuitive user interfaces for timbral control. The usability of a range of synthesis methods and representative implementations of these methods is assessed, and three interface architectures - fixed architecture, architecture specification and direct specification - are identified. The characteristics of each of these architectures, as well as problems of usability inherent to each of them are discussed; it is argued that none of them provide intuitive tools for the manipulation and control of timbre.
The study examines the nature of timbre and the notion of timbre space; different kinds of timbre space are considered and criteria are proposed for the selection of suitable timbre spaces as vehicles for synthesis.
A number of listening tests, designed to demonstrate the feasibility of subsequent work, were devised and carried out; the results of these tests provide evidence that, where Euclidean distances between sounds located in a given timbre space are reflected in perceptual distances, the ability of subjects to detect relative distances in different parts of the space varies with the perceptual granularity of the space.
Three contrasting timbre spaces conforming to the proposed criteria for use in synthesis are constructed; the purpose of these spaces is to provide an environment for a novel user interaction approach for timbral design which incorporates a search strategy based on weighted centroid localization. Two prototypes which exemplify the proposed approach in alternative ways are designed, implemented and tested with potential users in order to validate the approach; a third contrasting prototype which represents a simple contrasting alternative is tested for purposes of comparison. The results of these tests are evaluated and discussed, and areas of further work identified
Modeling of Polish Intonation for Statistical-Parametric Speech Synthesis
Wydział NeofilologiiBieżąca praca prezentuje próbę budowy neurobiologicznie umotywowanego modelu mapowań pomiędzy wysokopoziomowymi dyskretnymi kategoriami lingwistycznymi a ciągłym sygnałem częstotliwości podstawowej w polskiej neutralnej mowie czytanej, w oparciu o konwolucyjne sieci neuronowe. Po krótkim wprowadzeniu w problem badawczy w kontekście intonacji, syntezy mowy oraz luki pomiędzy fonetyką a fonologią, praca przedstawia opis uczenia modelu na podstawie specjalnego korpusu mowy oraz ewaluację naturalności konturu F0 generowanego przez wyuczony model za pomocą
eksperymentów percepcyjnych typu ABX oraz MOS przy użyciu specjalnie w tym celu zbudowanego resyntezatora Neural Source Filter. Następnie, prezentowane są wyniki eksploracji fonologiczno-fonetycznych mapowań wyuczonych przez model. W tym celu wykorzystana została
jedna z tzw. metod wyjaśniających dla sztucznej inteligencji – Layer-wise Relevance Propagation.
W pracy przedstawione zostały wyniki powstałej na tej podstawie obszernej analizy ilościowej
istotności dla konturu częstotliwości podstawowej każdej z 1297 specjalnie wygenerowanych
lingwistycznych kategorii wejściowych modelu oraz ich wielorakich grupowań na różnorodnych poziomach abstrakcji. Pracę kończy dogłębna analiza oraz interpretacja uzyskanych wyników oraz rozważania na temat mocnych oraz słabych stron zastosowanych metod, a także lista proponowanych usprawnień.This work presents an attempt to build a neurobiologically inspired Convolutional Neural
Network-based model of the mappings between discrete high-level linguistic categories into a
continuous signal of fundamental frequency in Polish neutral read speech. After a brief
introduction of the current research problem in the context of intonation, speech synthesis and the
phonetic-phonology gap, the work goes on to describe the training of the model on a special speech corpus, and an evaluation of the naturalness of the F0 contour produced by the trained model through ABX and MOS perception experiments conducted with help of a specially built Neural Source Filter resynthesizer. Finally, an in-depth exploration of the phonology-to-phonetics mappings learned by the model is presented; the Layer-wise Relevance Propagation explainability method was used to perform an extensive quantitative analysis of the relevance of 1297 specially engineered linguistic input features and
their groupings at various levels of abstraction for the specific contours of the fundamental frequency.
The work ends with an in-depth interpretation of these results and a discussion of the advantages
and disadvantages of the current method, and lists a number of potential future improvements.Badania przedstawione w pracy zostały cz˛e´sciowo zrealizowane w ramach grantu badawczego Harmonia nr UMO-2014/14/M/HS2/00631 przyznanego przez Narodowe Centrum Nauki
Recommended from our members
The effect of electronically mediated sound on group musical interaction: A case study of the practice and development of the Automatic Writing Circle
The interaction between musicians has been one of the traditional strengths of music: it stretches to include an audience and ritual participants but has its origins in group activity, the interpersonal responses of one musician to another. This thesis examines the way that electronic media have transformed the interactions between musicians, particularly in the context of live performance. A central theme is the way in which mediatisation creates new splits within previously integrated musical situations and also merges differences usually defined by physical boundaries.
The theories of Gregory Bateson on schizophrenia and Irving Goffman on Situationism are brought together to create a new understanding of the term "schizophonia". This rehabilitated concept is proposed as the key to a creative exploration of new situations and discontinuities which make up group performance in a mediatised environment.
In practical terms the exploration of new musical situations is documented in the following projects: the material created for the group "Automatic Writing Circle" during its evolution over a period of six years (compositions, software, instruments), development of the Ouija Board and accompanying software, composition of the piece Lipsync and the earlier piece I slept by numbers for flute and live electronics
Initial CONNECT Architecture
Interoperability remains a fundamental challenge when connecting heterogeneous systems which encounter and spontaneously communicate with one another in pervasive computing environments. This challenge is exasperated by the highly heterogeneous technologies employed by each of the interacting parties, i.e., in terms of hardware, operating system, middleware protocols, and application protocols. The key aim of the CONNECT project is to drop this heterogeneity barrier and achieve universal interoperability. Here we report on the development of the overall CONNECT architecture that will underpin this solution; in this respect, we present the following contributions: i) an elicitation of interoperability requirements from a set of pervasive computing scenarios, ii) a survey of existing solutions to interoperability, iii) an initial view of the CONNECT architecture, and iv) a series of experiments to provide initial validation of the architecture
Prosody in text-to-speech synthesis using fuzzy logic
For over a thousand years, inventors, scientists and researchers have tried to reproduce human speech. Today, the quality of synthesized speech is not equivalent to the quality of real speech. Most research on speech synthesis focuses on improving the quality of the speech produced by Text-to-Speech (TTS) systems. The best TTS systems use unit selection-based concatenation to synthesize speech. However, this method is very timely and the speech database is very large. Diphone concatenated synthesized speech requires less memory, but sounds robotic. This thesis explores the use of fuzzy logic to make diphone concatenated speech sound more natural. A TTS is built using both neural networks and fuzzy logic. Text is converted into phonemes using neural networks. Fuzzy logic is used to control the fundamental frequency for three types of sentences. In conclusion, the fuzzy system produces f0 contours that make the diphone concatenated speech sound more natural
Predictive Articulatory speech synthesis Utilizing Lexical Embeddings (PAULE)
Das Predictive Articulatory speech synthesis Utilizing Lexical Embeddings (PAULE)
Modell ist ein neues Modell zur Kontrolle des artikulatorischen Sprachsynthesizers
VocalTractLab (VTL) [15] . Mit PAULE lassen sich deutsche Wörter synthetisieren. Die
Wortsynthese kann entweder mit Hilfe eines semantischen Vektors, der die Wortbedeu-
tung kodiert, und der gewünschten Dauer der Wortsynthese gestartet werden oder es
kann eine Resynthese von einer Audiodatei gemacht werden. Die Audiodatei kann
beliebige Aufnahmen von Sprecher:innen enthalten, wobei die Resynthese immer über
den Standardsprecher des VTL erfolgt. Abhängig von der Wortbedeutung und der
Audiodatei variiert die Synthesequalität.
Neu an PAULE ist, dass es einen prädiktiven Ansatz verwendet, indem es aus
der geplanten Artikulation die dazugehörige perzeptuelle Akustik vorhersagt und
daraus die Wortbedeutung ableitet. Sowohl die Akustik als auch die Wortbedeutung
sind als metrische Vektorräume implementiert. Dadurch lässt sich ein Fehler zu einer
gewünschten Zielakustik und Zielbedeutung berechnen und minimieren. Bei dem
minimierten Fehler handelt es sich nicht um den tatsächlichen Fehler, der aus der
Synthese mit dem VTL entsteht, sondern um den Fehler, der aus den Vorhersagen eines
prädiktiven Modells generiert wird. Obwohl es nicht der tatsächliche Fehler ist, kann
dieser Fehler genutzt werden, um die tatsächliche Artikulation zu verbessern. Um das
prädiktive Modell mit der tatsächlichen Akustik in Einklang zu bringen, hört sich PAULE
selbst zu.
Ein in der Sprachsynthese zentrales Eins-Zu-Viele-Problem ist, dass eine Akustik durch
viele verschiedene Artikulationen erzeugt werden kann. Dieses Eins-Zu-Viele-Problem
wird durch die Vorhersagefehlerminimierung in PAULE aufgelöst, zusammen mit der
Bedingung, dass die Artikulation möglichst stationär und mit möglichst konstanter Kraft
ausgeführt wird. PAULE funktioniert ohne jegliche symbolische Repräsentation in der
Akustik (Phoneme) und in der Artikulation (motorische Gesten oder Ziele). Damit zeigt
PAULE, dass sich gesprochene Wörter ohne symbolische Beschreibungsebene model-
lieren lassen. Der gesprochenen Sprache könnte daher im Vergleich zur geschriebenen
Sprache eine fundamental andere Verarbeitungsebene zugrunde liegen. PAULE integriert
Erfahrungswissen sukzessive. Damit findet PAULE nicht die global beste Artikulation
sondern lokal gute Artikulationen. Intern setzt PAULE auf künstliche neuronale Netze
und die damit verbundenen Gradienten, die zur Fehlerkorrektur verwendet werden.
PAULE kann weder ganze Sätze synthetisieren noch wird somatosensorisches Feedback berücksichtigt. Zu Beidem gibt es Vorarbeiten, die in zukünftige Versionen integriert
werden sollen.The Predictive Articulatory speech synthesis Utilizing Lexical Embeddings (PAULE)
model is a new control model for the VocalTractLab (VTL) [15] speech synthesizer, a simulator of the human speech system. It is capable of synthesizing single words in the German language. The speech synthesis can be based on a target semantic vector or on target acoustics, i.e., a recorded word token. VTL is controlled by 30 parameters. These parameters have to be estimated for each time point during the production of a word, which is roughly every 2.5 milliseconds. The time-series of these 30 control parameters (cps) of the VTL are the control parameter trajectories (cp-trajectories). The high dimensionality of the cp-trajectories in combination with non-linear interactions leads to a many-to-one mapping problem, where many sets of cp-trajectories produce highly similar synthesized audio.
PAULE solves this many-to-one mapping problem by anticipating the effects of cp-
trajectories and minimizing a semantic and acoustic error between this nticipation
and a targeted meaning and acoustics. The quality of the anticipation is improved by an outer loop, where PAULE listens to itself. PAULE has three central design features that distinguish it from other control models: First, PAULE does not use any symbolic units, neither motor primitives, articulatory targets, or gestural scores on the movement side, nor any phone or syllable representation on the acoustic side. Second, PAULE is a learning model that accumulates experience with articulated words. As a consequence, PAULE will not find a global optimum for the inverse kinematic optimization task it has to solve. Instead, it finds a local optimum that is conditioned on its past experience. Third, PAULE uses gradient-based internal prediction errors of a predictive forward model to plan cp-trajectories for a given semantic or acoustic target. Thus, PAULE is an
error-driven model that takes its previous experiences into account.
Pilot study results indicate that PAULE is able to minimize an acoustic semantic and acoustic error in the resynthesized audio. This allows PAULE to find cp-trajectories that are correctly classified by a classification model as the correct word with an accuracy of 60 %, which is close to the accuracy for human recordings of 63 %. Furthermore, PAULE seems to model vowel-to-vowel anticipatory coarticulation in terms of formant shifts correctly and can be compared to human electromagnetic articulography (EMA) recordings in a straightforward way. Furthermore, with PAULE it is possible to condition
on already executed past cp-trajectories and to smoothly continue the cp-trajectories from the current state. As a side-effect of developing PAULE, it is possible to create large amounts of training data for the VTL through an automated segment-based approach.
Next steps, in the development of PAULE, include adding a somatosensory feedback channel, extending PAULE from producing single words to the articulation of small utterances and adding a thorough evaluation
- …