736 research outputs found
The wisdom of crowds versus the madness of crowds
Declining trust in northern liberal democratic institutions poses serious challenges to legislatures (parliaments). That mistrust extends to traditional media at a time when new digital media are fanning ‘fake news’ and a ‘madness of crowds’. Will the ‘wisdom of crowds’ on which liberal democracy critically depends prevail over the ‘madness’? Can parliaments resolve that tension positively? In New Zealand trust in political institutions is still high, but voter turnout has slid, especially among the young. Parliament has work to do
A Comparison Between Convolutional and Transformer Architectures for Speech Emotion Recognition
© 2022, IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. This is the accepted manuscript version of a conference paper which has been published in final form at https://doi.org/10.1109/IJCNN55064.2022.9891882Creating speech emotion recognition models com-parable to the capability of how humans recognise emotions is a long-standing challenge in the field of speech technology with many potential commercial applications. As transformer-based architectures have recently become the state-of-the-art for many natural language processing related applications, this paper investigates their suitability for acoustic emotion recognition and compares them to the well-known AlexNet convolutional approach. This comparison is made using several publicly available speech emotion corpora. Experimental results demonstrate the efficacy of the different architectural approaches for particular emotions. The results show that the transformer-based models outperform their convolutional counterparts yielding F1-scores in the range [70.33%, 75.76%]. This paper further provides insights via dimensionality reduction analysis of output layer activations in both architectures and reveals significantly improved clustering in transformer-based models whilst highlighting the nuances with regard to the separability of different emotion classes
Active Learning for Auditory Hierarchy
Much audio content today is rendered as a static stereo mix: fundamentally a fixed single entity. Object-based audio envisages the delivery of sound content using a collection of individual sound ‘objects’ controlled by accompanying metadata. This offers potential for audio to be delivered in a dynamic manner providing enhanced audio for consumers. One example of such treatment is the concept of applying varying levels of data compression to sound objects thereby reducing the volume of data to be transmitted in limited bandwidth situations. This application motivates the ability to accurately classify objects in terms of their ‘hierarchy’. That is, whether or not an object is a foreground sound, which should be reproduced at full quality if possible, or a background sound, which can be heavily compressed without causing a deterioration in the listening experience. Lack of suitably labelled data is an acknowledged problem in the domain. Active Learning is a method that can greatly reduce the manual effort required to label a large corpus by identifying the most effective instances to train a model to high accuracy levels. This paper compares a number of Active Learning methods to investigate which is most effective in the context of a hierarchical labelling task on an audio dataset. Results show that the number of manual labels required can be reduced to 1.7% of the total dataset while still retaining high prediction accuracy
Sonic Elongation: Creative Audition in Documentary Film
This paper investigates documentary films in which real-world sound captured from the location shoot has been treated more creatively than the captured image; in particular, instances when real-world noises pass freely between sound and musical composition. I call this process the sonic elongation from sound to music; a blurring that allows the soundtrack to keep one foot in the image, thus allowing the film to retain a loose grip on the traditional nonfiction aesthetic. With reference to several recent documentary feature films, I argue that such moments rely on a confusion between hearing and listening
Guanabara Bay: For all hopes to a new awakening of paradise
Abstract: Exclusion Territories are geographical areas under the action of degenerative environmental phenomena of anthropogenic origin, which compromise quality of life in general. One of the greatest examples of such areas is the Guanabara Bay and its surroundings, the scene of some of the worst disastrous incidents and locale of frequent episodes of human misery. This article presents a brief description of the main characteristics of the region, providing some technological suggestions of biogeographic recovery to be adopted by public policies that intend to align themselves with the good practices of ecological economy, sustainability and quality of life. The work falls within the context of macro-engineering cum eco-innovation applied to the preservation and management of water sources and water bodies that serve productive purposes as natural niches and breeding grounds.Key words: Exclusion Territories, Guanabara Bay, waste management, quality of life.=================================================================== Resumo: Territórios de Exclusão são áreas geográficas sob ação de fenômenos ambientais degenerativos de origem antropogênica, os quais comprometem a qualidade de vida em geral. Um dos maiores exemplos de zonas desse tipo é a BaÃa de Guanabara e seu entorno, palco de alguns dos piores incidentes desastrosos e de frequentes episódios da miséria humana. O presente artigo descreve sumariamente as principais caracterÃsticas da região, fornecendo algumas sugestões tecnológicas de recuperação biogeográfica a serem adotadas por polÃticas públicas que pretendam alinhar-se à s boas práticas de economia ecológica, sustentabilidade e qualidade de vida. O trabalho se insere no contexto da macroengenharia cum eco-inovação aplicada à preservação e à gestão das fontes hÃdricas e dos corpos de água que servem a propósitos produtivos como nichos naturais e criadouros.Palavras-chave: Territórios de Exclusão, BaÃa de Guanabara, gestão de resÃduos, qualidade de vida.=================================================================== Abstrakt: Ausschlussgebiete sind geografische Regionen, in denen degenerative Umweltphänomene anthropogenen Ursprungs auftreten, die im Allgemeinen die Lebensqualität beeinträchtigen. Eines der besten Beispiele für solche Gebiete ist die Guanabara-Bucht und die Umgebung, Schauplatz einiger der schlimmsten katastrophalen Vorfälle und Schauplatz häufiger Episoden menschlichen Elends. Dieser Artikel enthält eine kurze Beschreibung der Hauptmerkmale der Region sowie einige technologische Vorschläge für die biogeografische Erholung, die die öffentliche Politik zur Angleichung an bewährte Praktiken in Bezug auf ökologische Ökonomie, Nachhaltigkeit und Lebensqualität annehmen sollte. Die Arbeit fällt in den Kontext von Makrotechnik und Öko-Innovation, die auf die Erhaltung und Bewirtschaftung von Wasserquellen und Gewässern angewendet werden, die als natürliche Nischen und Brutstätten für produktive Zwecke dienen.Schlüsselwörter: Ausschlussgebiete, Guanabara-Bucht, Abfallwirtschaft, Lebensqualität
Conditioning Text-to-Speech synthesis on dialect accent: a case study
Modern text-to-speech systems are modular in many different ways. In recent years, end-users gained the ability to control speech attributes such as degree of emotion, rhythm and timbre, along with other suprasegmental features. More ambitious objectives are related to modelling a combination of speakers and languages, e.g. to enable cross-speaker language transfer. Though, no prior work has been done on the more fine-grained analysis of regional accents. To fill this gap, in this thesis we present practical end-to-end solutions to synthesise speech while controlling within-country variations of the same language, and we do so for 6 different dialects of the British Isles. In particular, we first conduct an extensive study of the speaker verification field and tweak state-of-the-art embedding models to work with dialect accents. Then, we adapt standard acoustic models and voice conversion systems by conditioning them on dialect accent representations and finally compare our custom pipelines with a cutting-edge end-to-end architecture from the multi-lingual world. Results show that the adopted models are suitable and have enough capacity to accomplish the task of regional accent conversion. Indeed, we are able to produce speech closely resembling the selected speaker and dialect accent, where the most accurate synthesis is obtained via careful fine-tuning of the multi-lingual model to the multi-dialect case. Finally, we delineate limitations of our multi-stage approach and propose practical mitigations, to be explored in future work
The Biometric Evolution of Sound and Space
Auditoria in the late 20th and 21st centuries have evolved into a series of spatial conventions that are an established and accepted norm. The relationship between space and music now exists in a decoupled condition, and music is no longer reliant on volumetric and material conditions to define its form (Glantz 2000).
This thesis looks at a series of novel approaches to investigate how the links between music and space can be reconnected though evolutionary computation, parametric modelling, virtual acoustics and biometric sensing. The thesis describes in detail the experiments undertaken in developing methodologies in linking music, space and the body.
The thesis will show how it is possible to develop new form finding and musical generation tools that allow new room shapes and acoustic measures to inform how new acoustic and musical forms can be developed unconsciously and objectively by a listener, in response to sound and site
Designing for quality in real-world mobile crowdsourcing systems
PhD ThesisCrowdsourcing has emerged as a popular means to collect and analyse data on a scale for
problems that require human intelligence to resolve. Its prompt response and low cost have
made it attractive to businesses and academic institutions. In response, various online
crowdsourcing platforms, such as Amazon MTurk, Figure Eight and Prolific have successfully
emerged to facilitate the entire crowdsourcing process. However, the quality of results has
been a major concern in crowdsourcing literature. Previous work has identified various key
factors that contribute to issues of quality and need to be addressed in order to produce high
quality results. Crowd tasks design, in particular, is a major key factor that impacts the
efficiency and effectiveness of crowd workers as well as the entire crowdsourcing process.
This research investigates crowdsourcing task designs to collect and analyse two distinct types
of data, and examines the value of creating high-quality crowdwork activities on new
crowdsource enabled systems for end-users. The main contribution of this research includes 1)
a set of guidelines for designing crowdsourcing tasks that support quality collection, analysis
and translation of speech and eye tracking data in real-world scenarios; and 2) Crowdsourcing
applications that capture real-world data and coordinate the entire crowdsourcing process to
analyse and feed quality results back. Furthermore, this research proposes a new quality control
method based on workers trust and self-verification. To achieve this, the research follows the
case study approach with a focus on two real-world data collection and analysis case studies.
The first case study, Speeching, explores real-world speech data collection, analysis, and
feedback for people with speech disorder, particularly with Parkinson’s. The second case study,
CrowdEyes, examines the development and use of a hybrid system combined of crowdsourcing
and low-cost DIY mobile eye trackers for real-world visual data collection, analysis, and
feedback. Both case studies have established the capability of crowdsourcing to obtain high
quality responses comparable to that of an expert. The Speeching app, and the provision of
feedback in particular were well perceived by the participants. This opens up new opportunities
in digital health and wellbeing. Besides, the proposed crowd-powered eye tracker is fully
functional under real-world settings. The results showed how this approach outperforms all
current state-of-the-art algorithms under all conditions, which opens up the technology for wide
variety of eye tracking applications in real-world settings
The quality of experience of next generation audio :exploring system, context and human influence factors
PhD ThesisThe next generation of audio reproduction technology has the potential to deliver
immersive and personalised experiences to the user; multichannel with-height loudspeaker
arrays and binaural techniques offer 3D audio experiences, whereas objectbased
techniques offer possibilities of adapting content to suit the system, context
and user. A fundamental process in the advancement of such technology is perceptual
evaluation. It is crucial to understand how listeners perceive new technology in
order to drive future developments. This thesis explores the experience provided by
next generation audio technology by taking a quality of experience (QoE) approach
to evaluation. System, context and human factors all influence QoE and in this thesis
three case studies are presented to explore the role of these categories of influence factors
(IFs) in the context of next generation audio evaluation. Furthermore, these case
studies explore suitable methods and approaches for the evaluation of the QoE of
next generation audio with respect to its various IFs. Specific contributions delivered
from these individual studies include a subjective comparison between soundbar and
discrete surround sound technology, the application of the Open Profiling of Quality
method to the field of audio evaluation, an understanding of both how and why environmental
noise influences preferred audio object balance, an understanding of how
the influence of technical audio quality on overall listening experience is related to
a range of psychographic variables and an assessment of the impact of binaural processing
on overall listening experience. When considering these studies as a whole,
the research presented here contributes the thesis that to effectively evaluate the perceived
quality of next generation audio, a QoE mindset should be taken that considers
system, context and human IFs.Engineering and
Physical Sciences Research Council (EPSRC) and the British Broadcasting Corporation
Research & Development department (BBC R&D
- …