56 research outputs found
Recommended from our members
Music Emotion Recognition based on Feature Combination, Deep Learning and Chord Detection
This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University London.As one of the most classic human inventions, music appeared in many artworks, such as songs, movies and theatres. It can be seen as another language, used to express the authors thoughts and emotion. In many cases, music can express the meaning and emotion emerged which is the authors hope and the audience feeling. However, the emotions which appear during human enjoying the music is complex and difļ¬cult to precisely explain. Therefore, Music Emotion Recognition (MER) is an interesting research topic in artiļ¬cial intelligence ļ¬eld for recognising the emotions from the music. The recognition methods and tools for the music signals are growing fast recently. With recent development of the signal processing, machine learning and algorithm optimization, the recognition accuracy is approaching perfection. In this thesis, the research is focused on three differentsigniļ¬cantpartsofMER,thatarefeatures, learningmethodsandmusicemotion theory, to explain and illustrate how to effectively build MER systems. Firstly, an automatic MER system for classing 4 emotions was proposed where OpenSMILE is used for feature extraction and IS09 feature was selected. After the combination with STAT statistic features, Random Forest classiļ¬er produced the best performance than previous systems. It shows that this approach of feature selection and machine learning can indeed improve the accuracy of MER by at least 3.5% from other combinations under suitable parameter setting and the performance of system was improved by new features combination by IS09 and STAT reaching 83.8% accuracy. Secondly, another MER system for 4 emotions was proposed basedon the dynamic property of music signals where the features are extracted from segments of music signals instead of the whole recording in APM database. Then Long Shot-Term Memory (LSTM) deep learning model was used for classiļ¬cation. The model can use the dynamic continuous information between the different time frame segments for more effective emotion recognition. However, the ļ¬nal performance just achieved 65.7% which was not as good as expected. The reason might be that the database is not suitable to the LSTM as the initial thoughts. The information between the segments might be not good enough to improve the performance of recognition in comparison with the traditional methods. The complex deep learning method do not suitable for every database was proved by the conclusion,which shown that the LSTM dynamic deep learning method did not work well in this continuous database. Finally, it was targeted to recognise the emotion by the identiļ¬cation of chord inside as these chords have particular emotion information inside stated in previous theoretical work. The research starts by building a new chord database that uses the Adobe audition to extract the chord clip from the piano chord teaching audio. Then the FFT features based on the 1000 points sampling pre-process data and STAT features were extracted for the selected samples from the database. After the calculation and comparison using Euclidean distance and correlation, the results shown the STAT features work well in most of chords except the Augmented chord. The new approach of recognise 6 emotions from the music was ļ¬rst time used in this research and approached 75% accuracy of chord identiļ¬cation. In summary, the research proposed new MER methods through the three different approaches. Some of them achieved good recognition performance and some of them will have more broad application prospect
Platform, culture, identities: exploring young people's game-making
Digital games are an important component in the contemporary media landscape. They are cultural artefacts and, as such, are subjected to specific conventions. These conventions shape our imaginary about games, defining, for example, what a game is, who can play them and where. Different research has been developed to understand and challenge these conventions, and one of the strategies often adopted is fostering game-making among āgaming minoritiesā. By popularising games and their means of production, critical skills towards these objects could be developed, these conventions could be fought, and our perceptions of those artefacts could be transformed. Nevertheless, digital games, as obvious as it sounds, are also digital: they depend on technology to exist and are subjected to different technologiesā affordances and constraints. Technologies, however, are not neutral and objective, but are also cultural: they too are influenced by values and conventions. This means that, even if the means of production of digital games are distributed among more diverse groups, we should not ignore the role played by technology in this process of shaping our imaginary about games. Cultural and technical aspects of digital media are not, therefore, as conflicting as it might seem, finding themselves entangled in digital games. They are also equally influential in our understanding and our cultural uses of these artefacts; but how influential are they? How easy can one go against cultural and technical conventions when producing a game as a non-professional? Can anyone make any kind of game? In this research, I explore young peopleās game-making practices in non-professional contexts to understand how repertoires, gaming conventions and platform affordances and constraints can be influential in this creative process. I organised two different game-making clubs for young people in London/UK (one at a community-led centre for Latin American migrants and other at a comprehensive primary school). The clubs consisted in a series of workshops offered in a weekly basis, totalling a minimum of 12 hours of instruction/production at each research site. The participants were aged between 11 and 18 and produced a total of 11 games across these two sites with MissionMaker, a software that facilitates the creation of 3D games by non-specialists through ready-made 3D assets, custom audio and image files, and a simplified drop-down-list-based scripting language. Three games and their production teams were selected as case studies and investigated through qualitative methods and under a descriptive-interpretive approach. Throughout the game-making clubs, short surveys, observations, unstructured and semi-structured interviews and a game archive (with week-by-week saves of participantsā games) were employed to generate data that was then analysed through a Multimodal Sociosemiotics framework to explore how cultural and technical conventions were appropriated by participants during this experience. Discourses, gaming conventions and MissionMakerās affordances and constraints were appropriated in different ways by participants in the process of game production, culminating in the realisation of different discourses and the construction of diverse identities. These results are relevant since they restate the value of a more holistic approach ā one that looks at both culture and technology ā to critical videogame production within non-professional contexts. These results are also useful to the mapping of the influence of repertoires, conventions and platforms in non-professional game-making contexts, highlighting how these elements are influential but at the same time not prescriptive to the games produced, and how game development processes within these contexts are better understood as dialogical
One Deep Music Representation to Rule Them All? : A comparative analysis of different representation learning strategies
Inspired by the success of deploying deep learning in the fields of Computer
Vision and Natural Language Processing, this learning paradigm has also found
its way into the field of Music Information Retrieval. In order to benefit from
deep learning in an effective, but also efficient manner, deep transfer
learning has become a common approach. In this approach, it is possible to
reuse the output of a pre-trained neural network as the basis for a new
learning task. The underlying hypothesis is that if the initial and new
learning tasks show commonalities and are applied to the same type of input
data (e.g. music audio), the generated deep representation of the data is also
informative for the new task. Since, however, most of the networks used to
generate deep representations are trained using a single initial learning
source, their representation is unlikely to be informative for all possible
future tasks. In this paper, we present the results of our investigation of
what are the most important factors to generate deep representations for the
data and learning tasks in the music domain. We conducted this investigation
via an extensive empirical study that involves multiple learning sources, as
well as multiple deep learning architectures with varying levels of information
sharing between sources, in order to learn music representations. We then
validate these representations considering multiple target datasets for
evaluation. The results of our experiments yield several insights on how to
approach the design of methods for learning widely deployable deep data
representations in the music domain.Comment: This work has been accepted to "Neural Computing and Applications:
Special Issue on Deep Learning for Music and Audio
Understanding Agreement and Disagreement in Listenersā Perceived Emotion in Live Music Performance
Emotion perception of music is subjective and time dependent. Most computational music emotion recognition (MER) systems overlook time- and listener-dependent factors by averaging emotion judgments across listeners. In this work, we investigate the influence of music, setting (live vs lab vs online), and individual factors on music emotion perception over time. In an initial study, we explore changes in perceived music emotions among audience members during live classical music performances. Fifteen audience members used a mobile application to annotate time-varying emotion judgments based on the valence-arousal model. Inter-rater reliability analyses indicate that consistency in emotion judgments varies significantly across rehearsal segments, with systematic disagreements in certain segments. In a follow-up study, we examine listeners' reasons for their ratings in segments with high and low agreement. We relate these reasons to acoustic features and individual differences. Twenty-one listeners annotated perceived emotions while watching a recorded video of the live performance. They then reflected on their judgments and provided explanations retrospectively. Disagreements were attributed to listeners attending to different musical features or being uncertain about the expressed emotions. Emotion judgments were significantly associated with personality traits, gender, cultural background, and music preference. Thematic analysis of explanations revealed cognitive processes underlying music emotion perception, highlighting attributes less frequently discussed in MER studies, such as instrumentation, arrangement, musical structure, and multimodal factors related to performer expression. Exploratory models incorporating these semantic features and individual factors were developed to predict perceived music emotion over time. Regression analyses confirmed the significance of listener-informed semantic features as independent variables, with individual factors acting as moderators between loudness, pitch range, and arousal. In our final study, we analyzed the effects of individual differences on music emotion perception among 128 participants with diverse backgrounds. Participants annotated perceived emotions for 51 piano performances of different compositions from the Western canon, spanning various era. Linear mixed effects models revealed significant variations in valence and arousal ratings, as well as the frequency of emotion ratings, with regard to several individual factors: music sophistication, music preferences, personality traits, and mood states. Additionally, participants' ratings of arousal, valence, and emotional agreement were significantly associated to the historical time periods of the examined clips. This research highlights the complexity of music emotion perception, revealing it to be a dynamic, individual and context-dependent process. It paves the way for the development of more individually nuanced, time-based models in music psychology, opening up new avenues for personalised music emotion recognition and recommendation, music emotion-driven generation and therapeutic applications
Handbook of Stemmatology
Stemmatology studies aspects of textual criticism that use genealogical methods. This handbook is the first to cover the entire field, encompassing both theoretical and practical aspects, ranging from traditional to digital methods. Authors from all the disciplines involved examine topics such as the material aspects of text traditions, methods of traditional textual criticism and their genesis, and modern digital approaches used in the field
Music Encoding Conference Proceedings 2021, 19ā22 July, 2021 University of Alicante (Spain): Onsite & Online
Este documento incluye los artĆculos y pĆ³sters presentados en el Music Encoding Conference 2021 realizado en Alicante entre el 19 y el 22 de julio de 2022.Funded by project Multiscore, MCIN/AEI/10.13039/50110001103
2018 FSDG Combined Abstracts
https://scholarworks.gvsu.edu/fsdg_abstracts/1000/thumbnail.jp
Low-resource speech translation
We explore the task of speech-to-text translation (ST), where speech in one language
(source) is converted to text in a different one (target). Traditional ST systems go
through an intermediate step where the source language speech is first converted to
source language text using an automatic speech recognition (ASR) system, which
is then converted to target language text using a machine translation (MT) system.
However, this pipeline based approach is impractical for unwritten languages spoken by
millions of people around the world, leaving them without access to free and automated
translation services such as Google Translate. The lack of such translation services can
have important real-world consequences. For example, in the aftermath of a disaster
scenario, easily available translation services can help better co-ordinate relief efforts.
How can we expand the coverage of automated ST systems to include scenarios which
lack source language text? In this thesis we investigate one possible solution: we
build ST systems to directly translate source language speech into target language text,
thereby forgoing the dependency on source language text. To build such a system, we
use only speech data paired with text translations as training data. We also specifically
focus on low-resource settings, where we expect at most tens of hours of training data
to be available for unwritten or endangered languages.
Our work can be broadly divided into three parts. First we explore how we can leverage
prior work to build ST systems. We find that neural sequence-to-sequence models are
an effective and convenient method for ST, but produce poor quality translations when
trained in low-resource settings.
In the second part of this thesis, we explore methods to improve the translation performance
of our neural ST systems which do not require labeling additional speech
data in the low-resource language, a potentially tedious and expensive process. Instead
we exploit labeled speech data for high-resource languages which is widely available
and relatively easier to obtain. We show that pretraining a neural model with ASR data
from a high-resource language, different from both the source and target ST languages,
improves ST performance.
In the final part of our thesis, we study whether ST systems can be used to build
applications which have traditionally relied on the availability of ASR systems, such
as information retrieval, clustering audio documents, or question/answering. We build
proof-of-concept systems for two downstream applications: topic prediction for speech
and cross-lingual keyword spotting. Our results indicate that low-resource ST systems
can still outperform simple baselines for these tasks, leaving the door open for further
exploratory work.
This thesis provides, for the first time, an in-depth study of neural models for the
task of direct ST across a range of training data settings on a realistic multi-speaker
speech corpus. Our contributions include a set of open-source tools to encourage further
research
A Critical Look at the Music Classification Experiment Pipeline: Using Interventions to Detect and Account for Confounding Effects
PhD ThesisThis dissertation focuses on the problemof confounding in the design and analysis of music
classification experiments. Classification experiments dominate evaluation of music
content analysis systems and methods, but achieving high performance on such experiments
does not guarantee systems properly address the intended problem. The research
presented here proposes and illustrates modifications to the conventional experimental
pipeline, which aim at improving the understanding of the evaluated systems and methods,
facilitating valid conclusions on their suitability for the target problem.
Firstly,multiple analyses are conducted to determinewhich cues scattering-based systems
use to predict the annotations of the GTZAN music genre collection. In-depth system
analysis informs empirical approaches that alter the experimental pipeline. In particular,
deflation manipulations and targeted interventions on the partitioning strategy,
the learning algorithm and the frequency content of the data reveal that systems using
scattering-based features exploit faults in GTZAN and previously unknown information
at inaudible frequencies.
Secondly, the use of interventions on the experimental pipeline is extended and systematised
to a procedure for characterising effects of confounding information in the
results of classification experiments. Regulated bootstrap, a novel resampling strategy,
is proposed to address challenges associated with interventions dealing with partitioning.
The procedure is demonstrated on GTZAN, analysing the effect of artist replication
and infrasonic information on performance measurements using a wide range of systemconstruction
methods.
Finally, mathematical models relating measurements from classification experiments
and potentially contributing factors are proposed and discussed. Suchmodels enable decomposing
measurements into contributions of interest, which may differ depending on
the goals of the study, including those from pipeline interventions. The adequacy for classification
experiments of some conventional assumptions underlying such models is also
examined.
The reported research highlights the need for evaluation procedures that go beyond
performance maximisation. Accounting for the effects of confounding information using
procedures grounded on the principles of experimental design promises to facilitate the
development of systems that generalise beyond the restricted experimental settings
- ā¦