56,304 research outputs found
Why I tense up when you watch me: inferior parietal cortex mediates an audienceâs influence on motor performance
The presence of an evaluative audience can alter skilled motor performance through changes in force output. To investigate how this is mediated within the brain, we emulated real-time social monitoring of participantsâ performance of a fine grip task during functional magnetic resonance neuroimaging. We observed an increase in force output during social evaluation that was accompanied by focal reductions in activity within bilateral inferior parietal cortex. Moreover, deactivation of the left inferior parietal cortex predicted both inter- and intra-individual differences in socially-induced change in grip force. Social evaluation also enhanced activation within the posterior superior temporal sulcus, which conveys visual information about othersâ actions to the inferior parietal cortex. Interestingly, functional connectivity between these two regions was attenuated by social evaluation. Our data suggest that social evaluation can vary force output through the altered engagement of inferior parietal cortex; a region implicated in sensorimotor integration necessary for object manipulation, and a component of the action-observation network which integrates and facilitates performance of observed actions. Social-evaluative situations may induce high-level representational incoherence between oneâs own intentioned action and the perceived intention of others which, by uncoupling the dynamics of sensorimotor facilitation, could ultimately perturbe motor output
Music Therapy Techniques for Memory Stabilization in Diverse Dementias
Music contains certain unmistakable healing properties pertaining specifically to the matured body and soul affected by various types of dementia. Music therapy aids in memory retention or the retarding of the loss of mental function as a result of Alzheimer\u27s disease, Dementia with Lewy bodies, and Senile Dementia. Music can help subjects access lost memories through interaction with a music therapist. Certain music therapy techniques have been shown to yield additional physical, communicative, and psychological benefits. The disease progress of Alzheimer\u27s disease, Dementia with Lewy bodies, and Senile Dementia may be further delayed by music therapy when paired with pharmaceutical interventions such as previously established memory enhancing medications
Object Referring in Visual Scene with Spoken Language
Object referring has important applications, especially for human-machine
interaction. While having received great attention, the task is mainly attacked
with written language (text) as input rather than spoken language (speech),
which is more natural. This paper investigates Object Referring with Spoken
Language (ORSpoken) by presenting two datasets and one novel approach. Objects
are annotated with their locations in images, text descriptions and speech
descriptions. This makes the datasets ideal for multi-modality learning. The
approach is developed by carefully taking down ORSpoken problem into three
sub-problems and introducing task-specific vision-language interactions at the
corresponding levels. Experiments show that our method outperforms competing
methods consistently and significantly. The approach is also evaluated in the
presence of audio noise, showing the efficacy of the proposed vision-language
interaction methods in counteracting background noise.Comment: 10 pages, Submitted to WACV 201
EMID: An Emotional Aligned Dataset in Audio-Visual Modality
In this paper, we propose Emotionally paired Music and Image Dataset (EMID),
a novel dataset designed for the emotional matching of music and images, to
facilitate auditory-visual cross-modal tasks such as generation and retrieval.
Unlike existing approaches that primarily focus on semantic correlations or
roughly divided emotional relations, EMID emphasizes the significance of
emotional consistency between music and images using an advanced 13-dimension
emotional model. By incorporating emotional alignment into the dataset, it aims
to establish pairs that closely align with human perceptual understanding,
thereby raising the performance of auditory-visual cross-modal tasks. We also
design a supplemental module named EMI-Adapter to optimize existing cross-modal
alignment methods. To validate the effectiveness of the EMID, we conduct a
psychological experiment, which has demonstrated that considering the emotional
relationship between the two modalities effectively improves the accuracy of
matching in abstract perspective. This research lays the foundation for future
cross-modal research in domains such as psychotherapy and contributes to
advancing the understanding and utilization of emotions in cross-modal
alignment. The EMID dataset is available at https://github.com/ecnu-aigc/EMID
Deep Learning Techniques for Music Generation -- A Survey
This paper is a survey and an analysis of different ways of using deep
learning (deep artificial neural networks) to generate musical content. We
propose a methodology based on five dimensions for our analysis:
Objective - What musical content is to be generated? Examples are: melody,
polyphony, accompaniment or counterpoint. - For what destination and for what
use? To be performed by a human(s) (in the case of a musical score), or by a
machine (in the case of an audio file).
Representation - What are the concepts to be manipulated? Examples are:
waveform, spectrogram, note, chord, meter and beat. - What format is to be
used? Examples are: MIDI, piano roll or text. - How will the representation be
encoded? Examples are: scalar, one-hot or many-hot.
Architecture - What type(s) of deep neural network is (are) to be used?
Examples are: feedforward network, recurrent network, autoencoder or generative
adversarial networks.
Challenge - What are the limitations and open challenges? Examples are:
variability, interactivity and creativity.
Strategy - How do we model and control the process of generation? Examples
are: single-step feedforward, iterative feedforward, sampling or input
manipulation.
For each dimension, we conduct a comparative analysis of various models and
techniques and we propose some tentative multidimensional typology. This
typology is bottom-up, based on the analysis of many existing deep-learning
based systems for music generation selected from the relevant literature. These
systems are described and are used to exemplify the various choices of
objective, representation, architecture, challenge and strategy. The last
section includes some discussion and some prospects.Comment: 209 pages. This paper is a simplified version of the book: J.-P.
Briot, G. Hadjeres and F.-D. Pachet, Deep Learning Techniques for Music
Generation, Computational Synthesis and Creative Systems, Springer, 201
Generating Realistic Images from In-the-wild Sounds
Representing wild sounds as images is an important but challenging task due
to the lack of paired datasets between sound and images and the significant
differences in the characteristics of these two modalities. Previous studies
have focused on generating images from sound in limited categories or music. In
this paper, we propose a novel approach to generate images from in-the-wild
sounds. First, we convert sound into text using audio captioning. Second, we
propose audio attention and sentence attention to represent the rich
characteristics of sound and visualize the sound. Lastly, we propose a direct
sound optimization with CLIPscore and AudioCLIP and generate images with a
diffusion-based model. In experiments, it shows that our model is able to
generate high quality images from wild sounds and outperforms baselines in both
quantitative and qualitative evaluations on wild audio datasets.Comment: Accepted to ICCV 202
Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset
Audio signals represent a wide diversity of acoustic events, from background environmental noise to spoken
communication. Machine learning models such as neural networks have already been proposed for audio signal
modeling, where recurrent structures can take advantage of temporal dependencies. This work aims to study the
implementation of several neural network-based systems for speech and music event detection over a collection of
77,937 10-second audio segments (216 h), selected from the Google AudioSet dataset. These segments belong to
YouTube videos and have been represented as mel-spectrograms. We propose and compare two approaches. The
first one is the training of two different neural networks, one for speech detection and another for music detection.
The second approach consists on training a single neural network to tackle both tasks at the same time. The studied
architectures include fully connected, convolutional and LSTM (long short-term memory) recurrent networks.
Comparative results are provided in terms of classification performance and model complexity. We would like to
highlight the performance of convolutional architectures, specially in combination with an LSTM stage. The hybrid
convolutional-LSTM models achieve the best overall results (85% accuracy) in the three proposed tasks. Furthermore,
a distractor analysis of the results has been carried out in order to identify which events in the ontology are the most
harmful for the performance of the models, showing some difficult scenarios for the detection of music and speechThis work has been supported by project âDSSL: Redes Profundas y Modelos
de Subespacios para Deteccion y Seguimiento de Locutor, Idioma y
Enfermedades Degenerativas a partir de la Vozâ (TEC2015-68172-C2-1-P),
funded by the Ministry of Economy and Competitivity of Spain and FEDE
- âŠ