37 research outputs found
Models and analysis of vocal emissions for biomedical applications
This book of Proceedings collects the papers presented at the 3rd International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications, MAVEBA 2003, held 10-12 December 2003, Firenze, Italy. The workshop is organised every two years, and aims to stimulate contacts between specialists active in research and industrial developments, in the area of voice analysis for biomedical applications. The scope of the Workshop includes all aspects of voice modelling and analysis, ranging from fundamental research to all kinds of biomedical applications and related established and advanced technologies
Bag-of-words representations for computer audition
Computer audition is omnipresent in everyday life, in applications ranging from personalised virtual agents to health care. From a technical point of view, the goal is to robustly classify the content of an audio signal in terms of a defined set of labels, such as, e.g., the acoustic scene, a medical diagnosis, or, in the case of speech, what is said or how it is said. Typical approaches employ machine learning (ML), which means that task-specific models are trained by means of examples. Despite recent successes in neural network-based end-to-end learning, taking the raw audio signal as input, models relying on hand-crafted acoustic features are still superior in some domains, especially for tasks where data is scarce. One major issue is nevertheless that a sequence of acoustic low-level descriptors (LLDs) cannot be fed directly into many ML algorithms as they require a static and fixed-length input. Moreover, also for dynamic classifiers, compressing the information of the LLDs over a temporal block by summarising them can be beneficial. However, the type of instance-level representation has a fundamental impact on the performance of the model. In this thesis, the so-called bag-of-audio-words (BoAW) representation is investigated as an alternative to the standard approach of statistical functionals. BoAW is an unsupervised method of representation learning, inspired from the bag-of-words method in natural language processing, forming a histogram of the terms present in a document. The toolkit openXBOW is introduced, enabling systematic learning and optimisation of these feature representations, unified across arbitrary modalities of numeric or symbolic descriptors. A number of experiments on BoAW are presented and discussed, focussing on a large number of potential applications and corresponding databases, ranging from emotion recognition in speech to medical diagnosis. The evaluations include a comparison of different acoustic LLD sets and configurations of the BoAW generation process. The key findings are that BoAW features are a meaningful alternative to statistical functionals, offering certain benefits, while being able to preserve the advantages of functionals, such as data-independence. Furthermore, it is shown that both representations are complementary and their fusion improves the performance of a machine listening system.Maschinelles HĂśren ist im täglichen Leben allgegenwärtig, mit Anwendungen, die von personalisierten virtuellen Agenten bis hin zum Gesundheitswesen reichen. Aus technischer Sicht besteht das Ziel darin, den Inhalt eines Audiosignals hinsichtlich einer Auswahl definierter Labels robust zu klassifizieren. Die Labels beschreiben bspw. die akustische Umgebung der Aufnahme, eine medizinische Diagnose oder - im Falle von Sprache - was gesagt wird oder wie es gesagt wird. Ăbliche Ansätze hierzu verwenden maschinelles Lernen, d.h., es werden anwendungsspezifische Modelle anhand von Beispieldaten trainiert. Trotz jĂźngster Erfolge beim Ende-zu-Ende-Lernen mittels neuronaler Netze, in welchen das unverarbeitete Audiosignal als Eingabe benutzt wird, sind Modelle, die auf definierten akustischen Merkmalen basieren, in manchen Bereichen weiterhin Ăźberlegen. Dies gilt im Besonderen fĂźr Einsatzzwecke, fĂźr die nur wenige Daten vorhanden sind. Allerdings besteht dabei das Problem, dass Zeitfolgen von akustischen Deskriptoren in viele Algorithmen des maschinellen Lernens nicht direkt eingespeist werden kĂśnnen, da diese eine statische Eingabe fester Länge benĂśtigen. AuĂerdem kann es auch fĂźr dynamische (zeitabhängige) Klassifikatoren vorteilhaft sein, die Deskriptoren Ăźber ein gewisses Zeitintervall zusammenzufassen. Jedoch hat die Art der Merkmalsdarstellung einen grundlegenden Einfluss auf die Leistungsfähigkeit des Modells. In der vorliegenden Dissertation wird der sogenannte Bag-of-Audio-Words-Ansatz (BoAW) als Alternative zum Standardansatz der statistischen Funktionale untersucht. BoAW ist eine Methode des unĂźberwachten Lernens von Merkmalsdarstellungen, die von der Bag-of-Words-Methode in der Computerlinguistik inspiriert wurde, bei der ein Textdokument als Histogramm der vorkommenden WĂśrter beschrieben wird. Das Toolkit openXBOW wird vorgestellt, welches systematisches Training und Optimierung dieser Merkmalsdarstellungen - vereinheitlicht fĂźr beliebige Modalitäten mit numerischen oder symbolischen Deskriptoren - erlaubt. Es werden einige Experimente zum BoAW-Ansatz durchgefĂźhrt und diskutiert, die sich auf eine groĂe Zahl mĂśglicher Anwendungen und entsprechende Datensätze beziehen, von der Emotionserkennung in gesprochener Sprache bis zur medizinischen Diagnostik. Die Auswertungen beinhalten einen Vergleich verschiedener akustischer Deskriptoren und Konfigurationen der BoAW-Methode. Die wichtigsten Erkenntnisse sind, dass BoAW-Merkmalsvektoren eine geeignete Alternative zu statistischen Funktionalen darstellen, gewisse VorzĂźge bieten und gleichzeitig wichtige Eigenschaften der Funktionale, wie bspw. die Datenunabhängigkeit, erhalten kĂśnnen. Zudem wird gezeigt, dass beide Darstellungen komplementär sind und eine Fusionierung die Leistungsfähigkeit eines Systems des maschinellen HĂśrens verbessert
Proceedings of the 7th Sound and Music Computing Conference
Proceedings of the SMC2010 - 7th Sound and Music Computing Conference, July 21st - July 24th 2010
Towards the automated analysis of simple polyphonic music : a knowledge-based approach
PhDMusic understanding is a process closely related to the knowledge and experience
of the listener. The amount of knowledge required is relative to the
complexity of the task in hand.
This dissertation is concerned with the problem of automatically decomposing
musical signals into a score-like representation. It proposes that, as
with humans, an automatic system requires knowledge about the signal and
its expected behaviour to correctly analyse music.
The proposed system uses the blackboard architecture to combine the
use of knowledge with data provided by the bottom-up processing of the
signal's information. Methods are proposed for the estimation of pitches,
onset times and durations of notes in simple polyphonic music.
A method for onset detection is presented. It provides an alternative to
conventional energy-based algorithms by using phase information. Statistical
analysis is used to create a detection function that evaluates the expected
behaviour of the signal regarding onsets.
Two methods for multi-pitch estimation are introduced. The first concentrates
on the grouping of harmonic information in the frequency-domain.
Its performance and limitations emphasise the case for the use of high-level
knowledge.
This knowledge, in the form of the individual waveforms of a single
instrument, is used in the second proposed approach. The method is based
on a time-domain linear additive model and it presents an alternative to
common frequency-domain approaches.
Results are presented and discussed for all methods, showing that, if
reliably generated, the use of knowledge can significantly improve the quality
of the analysis.Joint Information Systems Committee (JISC) in the UK National Science Foundation (N.S.F.) in the United states. Fundacion Gran Mariscal Ayacucho in Venezuela
Stochastic suprasegmentals: relationships between redundancy, prosodic structure and care of articulation in spontaneous speech
Within spontaneous speech there are wide variations in the articulation of the same word by the same speaker. This paper explores two related factors which influence variation in articulation, prosodic structure and redundancy. We argue that the constraint of producing robust communication while efficiently expending articulatory effort leads to an inverse relationship between language redundancy and care of articulation. The inverse relationship improves robustness by spreading the information more evenly across the speech signal leading to a smoother signal redundancy profile. We argue that prosodic prominence is a linguistic means of achieving smooth signal redundancy. Prosodic prominence increases care of articulation and coincides with unpredictable sections of speech. By doing so, prosodic prominence leads to a smoother signal redundancy. Results confirm the strong relationship between prosodic prominence and care of articulation as well as an inverse relationship between langu..
Towards the Automatic Analysis of Metric Modulations
PhDThe metrical structure is a fundamental aspect of music, yet its automatic analysis
from audio recordings remains one of the great challenges of Music Information Retrieval
(MIR) research. This thesis is concerned with addressing the automatic analysis
of changes of metrical structure over time, i.e. metric modulations. The evaluation of
automatic musical analysis methods is a critical element of the MIR research and is
typically performed by comparing the machine-generated estimates with human expert
annotations, which are used as a proxy for ground truth. We present here two new
datasets of annotations for the evaluation of metrical structure and metric modulation
estimation systems. Multiple annotations allowed for the assessment of inter-annotator
(dis)agreement, thereby allowing for an evaluation of the reference annotations used to
evaluate the automatic systems. The rhythmogram has been identified in previous research
as a feature capable of capturing characteristics of rhythmic content of a music
recording. We present here a direct evaluation of its ability to characterise the metrical
structure and as a result we propose a method to explicitly extract metrical structure
descriptors from it. Despite generally good and increasing performance, such rhythm
features extraction systems occasionally fail. When unpredictable, the failures are a
barrier to usability and development of trust in MIR systems. In a bid to address this
issue, we then propose a method to estimate the reliability of rhythm features extraction.
Finally, we propose a two-fold method to automatically analyse metric modulations from
audio recordings. On the one hand, we propose a method to detect metrical structure
changes from the rhythmogram feature in an unsupervised fashion. On the other hand,
we propose a metric modulations taxonomy rooted in music theory that relies on metrical
structure descriptors that can be automatically estimated. Bringing these elements
together lays the ground for the automatic production of a musicological interpretation
of metric modulations.EPSRC award 1325200 and Omnifone Ltd
Recommended from our members
Hardward and algorithm architectures for real-time additive synthesis
Additive synthesis is a fundamental computer music synthesis paradigm tracing its origins to the work of Fourier and Helmholtz. Rudimentary implementation linearly combines harmonic sinusoids (or partials) to generate tones whose perceived timbral characteristics are a strong function of the partial amplitude spectrum. Having evolved over time, additive synthesis describes a collection of algorithms each characterised by the time-varying linear combination of basis components to generate temporal evolution of timbre. Basis components include exactly harmonic partials, inharmonic partials with time-varying frequency or non-sinusoidal waveforms each with distinct spectral characteristics. Additive synthesis of polyphonic musical instrument tones requires a large number of independently controlled partials incurring a large computational overhead whose investigation and reduction is a key motivator for this work. The thesis begins with a review of prevalent synthesis techniques setting additive synthesis in context and introducing the spectrum modelling paradigm which provides baseline spectral data to the additive synthesis process obtained from the analysis of natural sounds. We proceed to investigate recursive and phase accumulating digital sinusoidal oscillator algorithms, defining specific metrics to quantify relative performance. The concepts of phase accumulation, table lookup phase-amplitude mapping and interpolated fractional addressing are introduced and developed and shown to underpin an additive synthesis subclass - wavetable lookup synthesis (WLS). WLS performance is simulated against specific metrics and parameter conditions peculiar to computer music requirements. We conclude by presenting processing architectures which accelerate computational throughput of specific WLS operations and the sinusoidal additive synthesis model. In particular, we introduce and investigate the concept of phase domain processing and present several âpipeline friendlyâ arithmetic architectures using this technique which implement the additive synthesis of sinusoidal partials
Applications of Dynamical Systems to Music Composition
Mathematics and music have long enjoyed a close working relationship: mathematicians have frequently taken an interest in the organisational principles used in music, while musicians often utilise mathematical formalisms and structures in their works. This relationship has thrived in recent years, particularly since the advent of the computer, which has allowed mathematicians and musicians alike to explore the creative aspects of various mathematical structures quickly and easily. One class of mathematical structure that is of particular interest to the technologically-minded musician is the class of dynamical systems - those that change some feature with time. This class includes fractal zooms, evolutionary computing techniques and cellular automata, each of which holds some potential as the basis of a composition algorithm. The studies that comprise this thesis were undertaken in order to further examine the relationship between mathematics and music. In particular we explore the notion that music can essentially be thought of as a type of pattern propagation: we begin with initial themes and motifs - the musical patterns - which, during the course of the composition, are subjected to certain transformations and developments according to the rules dictated by the composer or the musical form. This is exactly analogous to the process which occurs within a cellular automaton: initial configurations of cells are transformed and developed according to a set of evolution rules. We begin our study by describing the development of the CAMUS v2.0 composition software, which was based on an earlier system by Dr. Eduardo Miranda, and discuss how best to use the system to compose new musical works. The next step in our study is concerned with highlighting the limitations of CAMUS as it currently stands, and suggesting techniques for improving the capabilities of the system. We then chart the development of CAMUS 3D. At each stage we justify the changes made to the system using both aesthetic and technical arguments. We also provide a composition example, which illustrates not only the changes in operation, but also in interface. The system is then re-evaluated, and further developments are suggested
Perceptual models in speech quality assessment and coding
The ever-increasing demand for good communications/toll
quality speech has created a renewed interest into the
perceptual impact of rate compression. Two general areas are
investigated in this work, namely speech quality assessment
and speech coding.
In the field of speech quality assessment, a model is
developed which simulates the processing stages of the
peripheral auditory system. At the output of the model a
"running" auditory spectrum is obtained. This represents
the auditory (spectral) equivalent of any acoustic sound such
as speech. Auditory spectra from coded speech segments serve
as inputs to a second model. This model simulates the
information centre in the brain which performs the speech
quality assessment. [Continues.