16,089 research outputs found
Multi-Sensory Interaction for Blind and Visually Impaired People
This book conveyed the visual elements of artwork to the visually impaired through various sensory elements to open a new perspective for appreciating visual artwork. In addition, the technique of expressing a color code by integrating patterns, temperatures, scents, music, and vibrations was explored, and future research topics were presented. A holistic experience using multi-sensory interaction acquired by people with visual impairment was provided to convey the meaning and contents of the work through rich multi-sensory appreciation. A method that allows people with visual impairments to engage in artwork using a variety of senses, including touch, temperature, tactile pattern, and sound, helps them to appreciate artwork at a deeper level than can be achieved with hearing or touch alone. The development of such art appreciation aids for the visually impaired will ultimately improve their cultural enjoyment and strengthen their access to culture and the arts. The development of this new concept aids ultimately expands opportunities for the non-visually impaired as well as the visually impaired to enjoy works of art and breaks down the boundaries between the disabled and the non-disabled in the field of culture and arts through continuous efforts to enhance accessibility. In addition, the developed multi-sensory expression and delivery tool can be used as an educational tool to increase product and artwork accessibility and usability through multi-modal interaction. Training the multi-sensory experiences introduced in this book may lead to more vivid visual imageries or seeing with the mind’s eye
Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks
In this paper we propose the utterance-level Permutation Invariant Training
(uPIT) technique. uPIT is a practically applicable, end-to-end, deep learning
based solution for speaker independent multi-talker speech separation.
Specifically, uPIT extends the recently proposed Permutation Invariant Training
(PIT) technique with an utterance-level cost function, hence eliminating the
need for solving an additional permutation problem during inference, which is
otherwise required by frame-level PIT. We achieve this using Recurrent Neural
Networks (RNNs) that, during training, minimize the utterance-level separation
error, hence forcing separated frames belonging to the same speaker to be
aligned to the same output stream. In practice, this allows RNNs, trained with
uPIT, to separate multi-talker mixed speech without any prior knowledge of
signal duration, number of speakers, speaker identity or gender. We evaluated
uPIT on the WSJ0 and Danish two- and three-talker mixed-speech separation tasks
and found that uPIT outperforms techniques based on Non-negative Matrix
Factorization (NMF) and Computational Auditory Scene Analysis (CASA), and
compares favorably with Deep Clustering (DPCL) and the Deep Attractor Network
(DANet). Furthermore, we found that models trained with uPIT generalize well to
unseen speakers and languages. Finally, we found that a single model, trained
with uPIT, can handle both two-speaker, and three-speaker speech mixtures
From holism to compositionality: memes and the evolution of segmentation, syntax, and signification in music and language
Steven Mithen argues that language evolved from an antecedent he terms “Hmmmmm, [meaning it was] Holistic, manipulative, multi-modal, musical and mimetic”. Owing to certain innate and learned factors, a capacity for segmentation and cross-stream mapping in early Homo sapiens broke the continuous line of Hmmmmm, creating discrete replicated units which, with the initial support of Hmmmmm, eventually became the semantically freighted words of modern language. That which remained after what was a bifurcation of Hmmmmm arguably survived as music, existing as a sound stream segmented into discrete units, although one without the explicit and relatively fixed semantic content of language. All three types of utterance – the parent Hmmmmm, language, and music – are amenable to a memetic interpretation which applies Universal Darwinism to what are understood as language and musical memes. On the basis of Peter Carruthers’ distinction between ‘cognitivism’ and ‘communicativism’ in language, and William Calvin’s theories of cortical information encoding, a framework is hypothesized for the semantic and syntactic associations between, on the one hand, the sonic patterns of language memes (‘lexemes’) and of musical memes (‘musemes’) and, on the other hand, ‘mentalese’ conceptual structures, in Chomsky’s ‘Logical Form’ (LF)
16th Sound and Music Computing Conference SMC 2019 (28–31 May 2019, Malaga, Spain)
The 16th Sound and Music Computing Conference (SMC 2019) took place in Malaga, Spain, 28-31 May 2019 and it was organized by the Application of Information and Communication Technologies Research group (ATIC) of the University of Malaga (UMA). The SMC 2019 associated Summer School took place 25-28 May 2019. The First International Day of Women in Inclusive Engineering, Sound and Music Computing Research (WiSMC 2019) took place on 28 May 2019. The SMC 2019 TOPICS OF INTEREST included a wide selection of topics related to acoustics, psychoacoustics, music, technology for music, audio analysis, musicology, sonification, music games, machine learning, serious games, immersive audio, sound synthesis, etc
Language as the Medium: Multimodal Video Classification through text only
Despite an exciting new wave of multimodal machine learning models, current
approaches still struggle to interpret the complex contextual relationships
between the different modalities present in videos. Going beyond existing
methods that emphasize simple activities or objects, we propose a new
model-agnostic approach for generating detailed textual descriptions that
captures multimodal video information. Our method leverages the extensive
knowledge learnt by large language models, such as GPT-3.5 or Llama2, to reason
about textual descriptions of the visual and aural modalities, obtained from
BLIP-2, Whisper and ImageBind. Without needing additional finetuning of
video-text models or datasets, we demonstrate that available LLMs have the
ability to use these multimodal textual descriptions as proxies for ``sight''
or ``hearing'' and perform zero-shot multimodal classification of videos
in-context. Our evaluations on popular action recognition benchmarks, such as
UCF-101 or Kinetics, show these context-rich descriptions can be successfully
used in video understanding tasks. This method points towards a promising new
research direction in multimodal classification, demonstrating how an interplay
between textual, visual and auditory machine learning models can enable more
holistic video understanding.Comment: Accepted at "What is Next in Multimodal Foundation Models?" (MMFM)
workshop at ICCV 202
Sonification of Samba dance using periodic pattern analysis
In this study we focus on the sonification of
Samba dance, using a multi-modal analysis-by-synthesis
approach. In the analysis we use periodic pattern analysis to
decompose the Samba dance movements into basic
movement gestures along the music’s metric layers. In the
synthesis we start from the basic movement gestures and
extract peaks and valleys, which we use as basic material for
the sonification. This leads to a matrix of repetitive dance
gestures from which we select the proper cues that trigger
samples of a Samba ensemble. The straightforward
sonification procedure suggests that Samba rhythms may be
mirrored in choreographic forms or vice-versa
- …