16,795 research outputs found
PerformanceNet: Score-to-Audio Music Generation with Multi-Band Convolutional Residual Network
Music creation is typically composed of two parts: composing the musical
score, and then performing the score with instruments to make sounds. While
recent work has made much progress in automatic music generation in the
symbolic domain, few attempts have been made to build an AI model that can
render realistic music audio from musical scores. Directly synthesizing audio
with sound sample libraries often leads to mechanical and deadpan results,
since musical scores do not contain performance-level information, such as
subtle changes in timing and dynamics. Moreover, while the task may sound like
a text-to-speech synthesis problem, there are fundamental differences since
music audio has rich polyphonic sounds. To build such an AI performer, we
propose in this paper a deep convolutional model that learns in an end-to-end
manner the score-to-audio mapping between a symbolic representation of music
called the piano rolls and an audio representation of music called the
spectrograms. The model consists of two subnets: the ContourNet, which uses a
U-Net structure to learn the correspondence between piano rolls and
spectrograms and to give an initial result; and the TextureNet, which further
uses a multi-band residual network to refine the result by adding the spectral
texture of overtones and timbre. We train the model to generate music clips of
the violin, cello, and flute, with a dataset of moderate size. We also present
the result of a user study that shows our model achieves higher mean opinion
score (MOS) in naturalness and emotional expressivity than a WaveNet-based
model and two commercial sound libraries. We open our source code at
https://github.com/bwang514/PerformanceNetComment: 8 pages, 6 figures, AAAI 2019 camera-ready versio
Early Relationships, Pathologies of Attachment, and the Capacity to Love
Psychologists often characterize the infant’s attachment to her primary caregiver as love. Philosophical accounts of love, however, tend to speak against this possibility. Love is typically thought to require sophisticated cognitive capacities that infants do not possess. Nevertheless, there are important similarities between the infant-primary caregiver bond and mature love, and the former is commonly thought to play an important role in one’s capacity for the latter. In this work, I examine the relationship between the infant-primary caregiver bond and love. I argue that while these very early attachments do not represent genuine love, a fuller understanding of them can inform extant philosophical views of love
I hear you eat and speak: automatic recognition of eating condition and food type, use-cases, and impact on ASR performance
We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard deviation: 2.66 years, gender balanced, German speakers), six types of food (Apple, Nectarine, Banana, Haribo Smurfs, Biscuit, and Crisps), and read as well as spontaneous speech, which is made publicly available for research purposes. We start with demonstrating that for automatic speech recognition (ASR), it pays off to know whether speakers are eating or not. We also propose automatic classification both by brute-forcing of low-level acoustic features as well as higher-level features related to intelligibility, obtained from an Automatic Speech Recogniser. Prediction of the eating condition was performed with a Support Vector Machine (SVM) classifier employed in a leave-one-speaker-out evaluation framework. Results show that the binary prediction of eating condition (i. e., eating or not eating) can be easily solved independently of the speaking condition; the obtained average recalls are all above 90%. Low-level acoustic features provide the best performance on spontaneous speech, which reaches up to 62.3% average recall for multi-way classification of the eating condition, i. e., discriminating the six types of food, as well as not eating. The early fusion of features related to intelligibility with the brute-forced acoustic feature set improves the performance on read speech, reaching a 66.4% average recall for the multi-way classification task. Analysing features and classifier errors leads to a suitable ordinal scale for eating conditions, on which automatic regression can be performed with up to 56.2% determination coefficient
- …