44 research outputs found

    Measuring musics:Notes on modes, motifs, and melodies

    Get PDF
    This dissertation develops computational methods to measure properties of musical traditions, with the aim of comparing them. It analyzes sheet music from a range of traditions, to which end two corpora of Western plainchant are introduced (Cantus Corpus and GregoBase Corpus). These corpora are used to confirm the melodic arch hypothesis, explore regularity in antiphon-differentia connections, and compose artificial chant using a recurrent neural language model. The central chant study, however, proposes a distributional approach to mode classification that can still determine mode fairly accurately even when all pitch information has been discarded. However, this seems to work best when the chants are segmented into ‘natural units’ corresponding to textual units such as syllables and words. Breaking down music into smaller units, or motifs, is the second theme in this dissertation. It is shown how rhythmic motifs can be used to effectively visualize rhythmic data, from music and animal vocalizations, in a rhythm triangle, an idea that is also extended to melodic data. The third theme concerns the shapes of melodies. The dissertation introduces Cosine Contours: a continuous representation for melodic contour, motivated by the observation that the principal components of melodic datasets approximate cosines. A second study on contour suggests that it should indeed be considered a continuous phenomenon, unlike several previous studies, as no evidence is found that contours cluster in distinct types. The dissertation ends with a case-study that applies a formal analysis to the ‘formal’ music of Arvo Pärt by reconstructing almost the entire score of ‘Summa’ using formal procedures

    Advances in the neurocognition of music and language

    Get PDF

    Three-dimensional point-cloud room model in room acoustics simulations

    Get PDF

    Corpus-based unit selection for natural-sounding speech synthesis

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2003.Includes bibliographical references (p. 179-196).This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Speech synthesis is an automatic encoding process carried out by machine through which symbols conveying linguistic information are converted into an acoustic waveform. In the past decade or so, a recent trend toward a non-parametric, corpus-based approach has focused on using real human speech as source material for producing novel natural-sounding speech. This work proposes a communication-theoretic formulation in which unit selection is a noisy channel through which an input sequence of symbols passes and an output sequence, possibly corrupted due to the coverage limits of the corpus, emerges. The penalty of approximation is quantified by substitution and concatenation costs which grade what unit contexts are interchangeable and where concatenations are not perceivable. These costs are semi-automatically derived from data and are found to agree with acoustic-phonetic knowledge. The implementation is based on a finite-state transducer (FST) representation that has been successfully used in speech and language processing applications including speech recognition. A proposed constraint kernel topology connects all units in the corpus with associated substitution and concatenation costs and enables an efficient Viterbi search that operates with low latency and scales to large corpora. An A* search can be applied in a second, rescoring pass to incorporate finer acoustic modelling. Extensions to this FST-based search include hierarchical and paralinguistic modelling. The search can also be used in an iterative feedback loop to record new utterances to enhance corpus coverage. This speech synthesis framework has been deployed across various domains and languages in many voices, a testament to its flexibility and rapid prototyping capability.(cont.) Experimental subjects completing tasks in a given air travel planning scenario by interacting in real time with a spoken dialogue system over the telephone have found the system "easiest to understand" out of eight competing systems. In more detailed listening evaluations, subjective opinions garnered from human participants are found to be correlated with objective measures calculable by machine.by Jon Rong-Wei Yi.Ph.D

    Models and Analysis of Vocal Emissions for Biomedical Applications

    Get PDF
    The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies

    Bag-of-words representations for computer audition

    Get PDF
    Computer audition is omnipresent in everyday life, in applications ranging from personalised virtual agents to health care. From a technical point of view, the goal is to robustly classify the content of an audio signal in terms of a defined set of labels, such as, e.g., the acoustic scene, a medical diagnosis, or, in the case of speech, what is said or how it is said. Typical approaches employ machine learning (ML), which means that task-specific models are trained by means of examples. Despite recent successes in neural network-based end-to-end learning, taking the raw audio signal as input, models relying on hand-crafted acoustic features are still superior in some domains, especially for tasks where data is scarce. One major issue is nevertheless that a sequence of acoustic low-level descriptors (LLDs) cannot be fed directly into many ML algorithms as they require a static and fixed-length input. Moreover, also for dynamic classifiers, compressing the information of the LLDs over a temporal block by summarising them can be beneficial. However, the type of instance-level representation has a fundamental impact on the performance of the model. In this thesis, the so-called bag-of-audio-words (BoAW) representation is investigated as an alternative to the standard approach of statistical functionals. BoAW is an unsupervised method of representation learning, inspired from the bag-of-words method in natural language processing, forming a histogram of the terms present in a document. The toolkit openXBOW is introduced, enabling systematic learning and optimisation of these feature representations, unified across arbitrary modalities of numeric or symbolic descriptors. A number of experiments on BoAW are presented and discussed, focussing on a large number of potential applications and corresponding databases, ranging from emotion recognition in speech to medical diagnosis. The evaluations include a comparison of different acoustic LLD sets and configurations of the BoAW generation process. The key findings are that BoAW features are a meaningful alternative to statistical functionals, offering certain benefits, while being able to preserve the advantages of functionals, such as data-independence. Furthermore, it is shown that both representations are complementary and their fusion improves the performance of a machine listening system.Maschinelles Hören ist im täglichen Leben allgegenwärtig, mit Anwendungen, die von personalisierten virtuellen Agenten bis hin zum Gesundheitswesen reichen. Aus technischer Sicht besteht das Ziel darin, den Inhalt eines Audiosignals hinsichtlich einer Auswahl definierter Labels robust zu klassifizieren. Die Labels beschreiben bspw. die akustische Umgebung der Aufnahme, eine medizinische Diagnose oder - im Falle von Sprache - was gesagt wird oder wie es gesagt wird. Übliche Ansätze hierzu verwenden maschinelles Lernen, d.h., es werden anwendungsspezifische Modelle anhand von Beispieldaten trainiert. Trotz jüngster Erfolge beim Ende-zu-Ende-Lernen mittels neuronaler Netze, in welchen das unverarbeitete Audiosignal als Eingabe benutzt wird, sind Modelle, die auf definierten akustischen Merkmalen basieren, in manchen Bereichen weiterhin überlegen. Dies gilt im Besonderen für Einsatzzwecke, für die nur wenige Daten vorhanden sind. Allerdings besteht dabei das Problem, dass Zeitfolgen von akustischen Deskriptoren in viele Algorithmen des maschinellen Lernens nicht direkt eingespeist werden können, da diese eine statische Eingabe fester Länge benötigen. Außerdem kann es auch für dynamische (zeitabhängige) Klassifikatoren vorteilhaft sein, die Deskriptoren über ein gewisses Zeitintervall zusammenzufassen. Jedoch hat die Art der Merkmalsdarstellung einen grundlegenden Einfluss auf die Leistungsfähigkeit des Modells. In der vorliegenden Dissertation wird der sogenannte Bag-of-Audio-Words-Ansatz (BoAW) als Alternative zum Standardansatz der statistischen Funktionale untersucht. BoAW ist eine Methode des unüberwachten Lernens von Merkmalsdarstellungen, die von der Bag-of-Words-Methode in der Computerlinguistik inspiriert wurde, bei der ein Textdokument als Histogramm der vorkommenden Wörter beschrieben wird. Das Toolkit openXBOW wird vorgestellt, welches systematisches Training und Optimierung dieser Merkmalsdarstellungen - vereinheitlicht für beliebige Modalitäten mit numerischen oder symbolischen Deskriptoren - erlaubt. Es werden einige Experimente zum BoAW-Ansatz durchgeführt und diskutiert, die sich auf eine große Zahl möglicher Anwendungen und entsprechende Datensätze beziehen, von der Emotionserkennung in gesprochener Sprache bis zur medizinischen Diagnostik. Die Auswertungen beinhalten einen Vergleich verschiedener akustischer Deskriptoren und Konfigurationen der BoAW-Methode. Die wichtigsten Erkenntnisse sind, dass BoAW-Merkmalsvektoren eine geeignete Alternative zu statistischen Funktionalen darstellen, gewisse Vorzüge bieten und gleichzeitig wichtige Eigenschaften der Funktionale, wie bspw. die Datenunabhängigkeit, erhalten können. Zudem wird gezeigt, dass beide Darstellungen komplementär sind und eine Fusionierung die Leistungsfähigkeit eines Systems des maschinellen Hörens verbessert

    Intonation Modelling for Speech Synthesis and Emphasis Preservation

    Get PDF
    Speech-to-speech translation is a framework which recognises speech in an input language, translates it to a target language and synthesises speech in this target language. In such a system, variations in the speech signal which are inherent to natural human speech are lost, as the information goes through the different building blocks of the translation process. The work presented in this thesis addresses aspects of speech synthesis which are lost in traditional speech-to-speech translation approaches. The main research axis of this thesis is the study of prosody for speech synthesis and emphasis preservation. A first investigation of regional accents of spoken French is carried out to understand the sensitivity of native listeners with respect to accented speech synthesis. Listening tests show that standard adaptation methods for speech synthesis are not sufficient for listeners to perceive accentedness. On the other hand, combining adaptation with original prosody allows perception of accents. Addressing the need of a more suitable prosody model, a physiologically plausible intonation model is proposed. Inspired by the command-response model, it has basic components, which can be related to muscle responses to nerve impulses. These components are assumed to be a representation of muscle control of the vocal folds. A motivation for such a model is its theoretical language independence, based on the fact that humans share the same vocal apparatus. An automatic parameter extraction method which integrates a perceptually relevant measure is proposed with the model. This approach is evaluated and compared with the standard command-response model. Two corpora including sentences with emphasised words are presented, in the context of the SIWIS project. The first is a multilingual corpus with speech from multiple speaker; the second is a high quality speech synthesis oriented corpus from a professional speaker. Two broad uses of the model are evaluated. The first shows that it is difficult to predict model parameters; however the second shows that parameters can be transferred in the context of emphasis synthesis. A relation between model parameters and linguistic features such as stress and accent is demonstrated. Similar observations are made between the parameters and emphasis. Following, we investigate the extraction of atoms in emphasised speech and their transfer in neutral speech, which turns out to elicit emphasis perception. Using clustering methods, this is extended to the emphasis of other words, using linguistic context. This approach is validated by listening tests, in the case of English

    Learning disentangled speech representations

    Get PDF
    A variety of informational factors are contained within the speech signal and a single short recording of speech reveals much more than the spoken words. The best method to extract and represent informational factors from the speech signal ultimately depends on which informational factors are desired and how they will be used. In addition, sometimes methods will capture more than one informational factor at the same time such as speaker identity, spoken content, and speaker prosody. The goal of this dissertation is to explore different ways to deconstruct the speech signal into abstract representations that can be learned and later reused in various speech technology tasks. This task of deconstructing, also known as disentanglement, is a form of distributed representation learning. As a general approach to disentanglement, there are some guiding principles that elaborate what a learned representation should contain as well as how it should function. In particular, learned representations should contain all of the requisite information in a more compact manner, be interpretable, remove nuisance factors of irrelevant information, be useful in downstream tasks, and independent of the task at hand. The learned representations should also be able to answer counter-factual questions. In some cases, learned speech representations can be re-assembled in different ways according to the requirements of downstream applications. For example, in a voice conversion task, the speech content is retained while the speaker identity is changed. And in a content-privacy task, some targeted content may be concealed without affecting how surrounding words sound. While there is no single-best method to disentangle all types of factors, some end-to-end approaches demonstrate a promising degree of generalization to diverse speech tasks. This thesis explores a variety of use-cases for disentangled representations including phone recognition, speaker diarization, linguistic code-switching, voice conversion, and content-based privacy masking. Speech representations can also be utilised for automatically assessing the quality and authenticity of speech, such as automatic MOS ratings or detecting deep fakes. The meaning of the term "disentanglement" is not well defined in previous work, and it has acquired several meanings depending on the domain (e.g. image vs. speech). Sometimes the term "disentanglement" is used interchangeably with the term "factorization". This thesis proposes that disentanglement of speech is distinct, and offers a viewpoint of disentanglement that can be considered both theoretically and practically
    corecore