1,317 research outputs found
Generating Labels for Regression of Subjective Constructs using Triplet Embeddings
Human annotations serve an important role in computational models where the
target constructs under study are hidden, such as dimensions of affect. This is
especially relevant in machine learning, where subjective labels derived from
related observable signals (e.g., audio, video, text) are needed to support
model training and testing. Current research trends focus on correcting
artifacts and biases introduced by annotators during the annotation process
while fusing them into a single annotation. In this work, we propose a novel
annotation approach using triplet embeddings. By lifting the absolute
annotation process to relative annotations where the annotator compares
individual target constructs in triplets, we leverage the accuracy of
comparisons over absolute ratings by human annotators. We then build a
1-dimensional embedding in Euclidean space that is indexed in time and serves
as a label for regression. In this setting, the annotation fusion occurs
naturally as a union of sets of sampled triplet comparisons among different
annotators. We show that by using our proposed sampling method to find an
embedding, we are able to accurately represent synthetic hidden constructs in
time under noisy sampling conditions. We further validate this approach using
human annotations collected from Mechanical Turk and show that we can recover
the underlying structure of the hidden construct up to bias and scaling
factors.Comment: 9 pages, 5 figures, accepted journal pape
Methodological considerations concerning manual annotation of musical audio in function of algorithm development
In research on musical audio-mining, annotated music databases are needed which allow the development of computational tools that extract from the musical audiostream the kind of high-level content that users can deal with in Music Information Retrieval (MIR) contexts. The notion of musical content, and therefore the notion of annotation, is ill-defined, however, both in the syntactic and semantic sense. As a consequence, annotation has been approached from a variety of perspectives (but mainly linguistic-symbolic oriented), and a general methodology is lacking. This paper is a step towards the definition of a general framework for manual annotation of musical audio in function of a computational approach to musical audio-mining that is based on algorithms that learn from annotated data. 1
Music emotion recognition: a multimodal machine learning approach
Music emotion recognition (MER) is an emerging domain of the Music Information Retrieval (MIR) scientific community, and besides, music searches through emotions are one of the major selection preferred by web users. As the world goes to digital, the musical contents in online databases, such as Last.fm have expanded exponentially, which require substantial manual efforts for managing them and also keeping them updated. Therefore, the demand for innovative and adaptable search mechanisms, which can be personalized according to users’ emotional state, has gained increasing consideration in recent years. This thesis concentrates on addressing music emotion recognition problem by presenting several classification models, which were fed by textual features, as well as audio attributes extracted from the music. In this study, we build both supervised and semisupervised classification designs under four research experiments, that addresses the emotional role of audio features, such as tempo, acousticness, and energy, and also the impact of textual features extracted by two different approaches, which are TF-IDF and Word2Vec. Furthermore, we proposed a multi-modal approach by using a combined feature-set consisting of the features from the audio content, as well as from context-aware data. For this purpose, we generated a ground truth dataset containing over 1500 labeled song lyrics and also unlabeled big data, which stands for more than 2.5 million Turkish documents, for achieving to generate an accurate automatic emotion classification system. The analytical models were conducted by adopting several algorithms on the crossvalidated data by using Python. As a conclusion of the experiments, the best-attained performance was 44.2% when employing only audio features, whereas, with the usage of textual features, better performances were observed with 46.3% and 51.3% accuracy scores considering supervised and semi-supervised learning paradigms, respectively. As of last, even though we created a comprehensive feature set with the combination of audio and textual features, this approach did not display any significant improvement for classification performanc
Emotional classification of music using neural networks with the MediaEval dataset
The proven ability of music to transmit emotions provokes the increasing interest in the development of new algorithms for music emotion recognition (MER). In this work, we present an automatic system of emotional classification of music by implementing a neural network. This work is based on a previous implementation of a dimensional emotional prediction system in which a multilayer perceptron (MLP) was trained with the freely available MediaEval database. Although these previous results are good in terms of the metrics of the prediction values, they are not good enough to obtain a classification by quadrant based on the valence and arousal values predicted by the neural network, mainly due to the imbalance between classes in the dataset. To achieve better classification values, a pre-processing phase was implemented to stratify and balance the dataset. Three different classifiers have been compared: linear support vector machine (SVM), random forest, and MLP. The best results are obtained with the MLP. An averaged F-measure of 50% is obtained in a four-quadrant classification schema. Two binary classification approaches are also presented: one vs. rest (OvR) approach in four-quadrants and binary classifier in valence and arousal. The OvR approach has an average F-measure of 69%, and the second one obtained F-measure of 73% and 69% in valence and arousal respectively. Finally, a dynamic classification analysis with different time windows was performed using the temporal annotation data of the MediaEval database. The results obtained show that the classification F-measures in four quadrants are practically constant, regardless of the duration of the time window. Also, this work reflects some limitations related to the characteristics of the dataset, including size, class balance, quality of the annotations, and the sound features available
Cooperative Learning and its Application to Emotion Recognition from Speech
In this paper, we propose a novel method for highly efficient exploitation of unlabeled data-Cooperative Learning. Our approach consists of combining Active Learning and Semi-Supervised Learning techniques, with the aim of reducing the costly effects of human annotation. The core underlying idea of Cooperative Learning is to share the labeling work between human and machine efficiently in such a way that instances predicted with insufficient confidence value are subject to human labeling, and those with high confidence values are machine labeled. We conducted various test runs on two emotion recognition tasks with a variable number of initial supervised training instances and two different feature sets. The results show that Cooperative Learning consistently outperforms individual Active and Semi-Supervised Learning techniques in all test cases. In particular, we show that our method based on the combination of Active Learning and Co-Training leads to the same performance of a model trained on the whole training set, but using 75% fewer labeled instances. Therefore, our method efficiently and robustly reduces the need for human annotations
The ordinal nature of emotions
Representing computationally everyday emotional
states is a challenging task and, arguably, one of the most fundamental
for affective computing. Standard practice in emotion annotation
is to ask humans to assign an absolute value of intensity
to each emotional behavior they observe. Psychological theories
and evidence from multiple disciplines including neuroscience,
economics and artificial intelligence, however, suggest that the
task of assigning reference-based (relative) values to subjective
notions is better aligned with the underlying representations
than assigning absolute values. Evidence also shows that we
use reference points, or else anchors, against which we evaluate
values such as the emotional state of a stimulus; suggesting
again that ordinal labels are a more suitable way to represent
emotions. This paper draws together the theoretical reasons to
favor relative over absolute labels for representing and annotating
emotion, reviewing the literature across several disciplines. We
go on to discuss good and bad practices of treating ordinal
and other forms of annotation data, and make the case for
preference learning methods as the appropriate approach for
treating ordinal labels. We finally discuss the advantages of
relative annotation with respect to both reliability and validity
through a number of case studies in affective computing, and
address common objections to the use of ordinal data. Overall,
the thesis that emotions are by nature relative is supported by
both theoretical arguments and evidence, and opens new horizons
for the way emotions are viewed, represented and analyzed
computationally.peer-reviewe
Multiple Instance Learning for Emotion Recognition using Physiological Signals
The problem of continuous emotion recognition has been the subject of several studies. The proposed affective computing approaches employ sequential machine learning algorithms for improving the classification stage, accounting for the time ambiguity of emotional responses. Modeling and predicting the affective state over time is not a trivial problem because continuous data labeling is costly and not always feasible. This is a crucial issue in real-life applications, where data labeling is sparse and possibly captures only the most important events rather than the typical continuous subtle affective changes that occur. In this work, we introduce a framework from the machine learning literature called Multiple Instance Learning, which is able to model time intervals by capturing the presence or absence of relevant states, without the need to label the affective responses continuously (as required by standard sequential learning approaches). This choice offers a viable and natural solution for learning in a weakly supervised setting, taking into account the ambiguity of affective responses. We demonstrate the reliability of the proposed approach in a gold-standard scenario and towards real-world usage by employing an existing dataset (DEAP) and a purposely built one (Consumer). We also outline the advantages of this method with respect to standard supervised machine learning algorithms
Crowdsourcing Emotions in Music Domain
An important source of intelligence for music emotion recognition today comes from user-provided
community tags about songs or artists. Recent crowdsourcing approaches such as harvesting social tags,
design of collaborative games and web services or the use of Mechanical Turk, are becoming popular in
the literature. They provide a cheap, quick and efficient method, contrary to professional labeling of songs
which is expensive and does not scale for creating large datasets. In this paper we discuss the viability of
various crowdsourcing instruments providing examples from research works. We also share our own
experience, illustrating the steps we followed using tags collected from Last.fm for the creation of two
music mood datasets which are rendered public. While processing affect tags of Last.fm, we observed that
they tend to be biased towards positive emotions; the resulting dataset thus contain more positive songs
than negative ones
Text-based Sentiment Analysis and Music Emotion Recognition
Nowadays, with the expansion of social media, large amounts of user-generated
texts like tweets, blog posts or product reviews are shared online. Sentiment polarity
analysis of such texts has become highly attractive and is utilized in recommender
systems, market predictions, business intelligence and more. We also witness deep
learning techniques becoming top performers on those types of tasks. There are
however several problems that need to be solved for efficient use of deep neural
networks on text mining and text polarity analysis.
First of all, deep neural networks are data hungry. They need to be fed with
datasets that are big in size, cleaned and preprocessed as well as properly labeled.
Second, the modern natural language processing concept of word embeddings as a
dense and distributed text feature representation solves sparsity and dimensionality
problems of the traditional bag-of-words model. Still, there are various uncertainties
regarding the use of word vectors: should they be generated from the same dataset
that is used to train the model or it is better to source them from big and popular
collections that work as generic text feature representations? Third, it is not easy for
practitioners to find a simple and highly effective deep learning setup for various
document lengths and types. Recurrent neural networks are weak with longer texts
and optimal convolution-pooling combinations are not easily conceived. It is thus
convenient to have generic neural network architectures that are effective and can
adapt to various texts, encapsulating much of design complexity.
This thesis addresses the above problems to provide methodological and practical
insights for utilizing neural networks on sentiment analysis of texts and achieving
state of the art results. Regarding the first problem, the effectiveness of various
crowdsourcing alternatives is explored and two medium-sized and emotion-labeled
song datasets are created utilizing social tags. One of the research interests of Telecom
Italia was the exploration of relations between music emotional stimulation and
driving style. Consequently, a context-aware music recommender system that aims
to enhance driving comfort and safety was also designed. To address the second
problem, a series of experiments with large text collections of various contents and
domains were conducted. Word embeddings of different parameters were exercised
and results revealed that their quality is influenced (mostly but not only) by the
size of texts they were created from. When working with small text datasets, it is
thus important to source word features from popular and generic word embedding
collections. Regarding the third problem, a series of experiments involving convolutional
and max-pooling neural layers were conducted. Various patterns relating
text properties and network parameters with optimal classification accuracy were
observed. Combining convolutions of words, bigrams, and trigrams with regional
max-pooling layers in a couple of stacks produced the best results. The derived
architecture achieves competitive performance on sentiment polarity analysis of
movie, business and product reviews.
Given that labeled data are becoming the bottleneck of the current deep learning
systems, a future research direction could be the exploration of various data programming
possibilities for constructing even bigger labeled datasets. Investigation
of feature-level or decision-level ensemble techniques in the context of deep neural
networks could also be fruitful. Different feature types do usually represent complementary
characteristics of data. Combining word embedding and traditional text
features or utilizing recurrent networks on document splits and then aggregating the
predictions could further increase prediction accuracy of such models
- …