398 research outputs found

    Multimodal Fusion as Communicative Acts during Human-Robot Interaction

    Get PDF
    Research on dialog systems is a very active area in social robotics. During the last two decades, these systems have evolved from those based only on speech recognition and synthesis to the current and modern systems, which include new components and multimodality. By multimodal dialogue we mean the interchange of information among several interlocutors, not just using their voice as the mean of transmission but also all the available channels such as gestures, facial expressions, touch, sounds, etc. These channels add information to the message to be transmitted in every dialogue turn. The dialogue manager (IDiM) is one of the components of the robotic dialog system (RDS) and is in charge of managing the dialogue flow during the conversational turns. In order to do that, it is necessary to coherently treat the inputs and outputs of information that flow by different communication channels: audio, vision, radio frequency, touch, etc. In our approach, this multichannel input of information is temporarily fused into communicative acts (CAs). Each CA groups the information that flows through the different input channels into the same pack, transmitting a unique message or global idea. Therefore, this temporary fusion of information allows the IDiM to abstract from the channels used during the interaction, focusing only on the message, not on the way it is transmitted. This article presents the whole RDS and the description of how the multimodal fusion of information is made as CAs. Finally, several scenarios where the multimodal dialogue is used are presented.Comunidad de Madri

    Multimodal Hate Speech Detection from Bengali Memes and Texts

    Full text link
    Numerous works have been proposed to employ machine learning (ML) and deep learning (DL) techniques to utilize textual data from social media for anti-social behavior analysis such as cyberbullying, fake news propagation, and hate speech mainly for highly resourced languages like English. However, despite having a lot of diversity and millions of native speakers, some languages such as Bengali are under-resourced, which is due to a lack of computational resources for natural language processing (NLP). Like English, Bengali social media content also includes images along with texts (e.g., multimodal contents are posted by embedding short texts into images on Facebook), only the textual data is not enough to judge them (e.g., to determine they are hate speech). In those cases, images might give extra context to properly judge. This paper is about hate speech detection from multimodal Bengali memes and texts. We prepared the only multimodal hate speech detection dataset1 for a kind of problem for Bengali. We train several neural architectures (i.e., neural networks like Bi-LSTM/Conv-LSTM with word embeddings, EfficientNet + transformer architectures such as monolingual Bangla BERT, multilingual BERT-cased/uncased, and XLM-RoBERTa) jointly analyze textual and visual information for hate speech detection. The Conv-LSTM and XLM-RoBERTa models performed best for texts, yielding F1 scores of 0.78 and 0.82, respectively. As of memes, ResNet152 and DenseNet201 models yield F1 scores of 0.78 and 0.7, respectively. The multimodal fusion of mBERT-uncased + EfficientNet-B1 performed the best, yielding an F1 score of 0.80. Our study suggests that memes are moderately useful for hate speech detection in Bengali, but none of the multimodal models outperform unimodal models analyzing only textual data

    Analysis of Brain Imaging Data for the Detection of Early Age Autism Spectrum Disorder Using Transfer Learning Approaches for Internet of Things

    Get PDF
    In recent years, advanced magnetic resonance imaging (MRI) methods including functional magnetic resonance imaging (fMRI) and structural magnetic resonance imaging (sMRI) have indicated an increase in the prevalence of neuropsychiatric disorders such as autism spectrum disorder (ASD), effects one out of six children worldwide. Data driven techniques along with medical image analysis techniques, such as computer-assisted diagnosis (CAD), benefiting from deep learning. With the use of artificial intelligence (AI) and IoT-based intelligent approaches, it would be convenient to support autistic children to adopt the new atmospheres. In this paper, we classify and represent learning tasks of the most powerful deep learning network such as convolution neural network (CNN) and transfer learning algorithm on a combination of data from autism brain imaging data exchange (ABIDE I and ABIDE II) datasets. Due to their four-dimensional nature (three spatial dimensions and one temporal dimension), the resting state-fMRI (rs-fMRI) data can be used to develop diagnostic biomarkers for brain dysfunction. ABIDE is a collaboration of global scientists, where ABIDE-I and ABIDE-II consists of 1112 rs-fMRI datasets from 573 typical control (TC) and 539 autism individuals, and 1114 rs-fMRI from 521 autism and 593 typical control individuals respectively, which were collected from 17 different sites. Our proposed optimized version of CNN achieved 81.56% accuracy. This outperforms prior conventional approaches presented only on the ABIDE I datasets

    Gesture retrieval and its application to the study of multimodal communication

    Full text link
    Comprehending communication is dependent on analyzing the different modalities of conversation, including audio, visual, and others. This is a natural process for humans, but in digital libraries, where preservation and dissemination of digital information are crucial, it is a complex task. A rich conversational model, encompassing all modalities and their co-occurrences, is required to effectively analyze and interact with digital information. Currently, the analysis of co-speech gestures in videos is done through manual annotation by linguistic experts based on textual searches. However, this approach is limited and does not fully utilize the visual modality of gestures. This paper proposes a visual gesture retrieval method using a deep learning architecture to extend current research in this area. The method is based on body keypoints and uses an attention mechanism to focus on specific groups. Experiments were conducted on a subset of the NewsScape dataset, which presents challenges such as multiple people, camera perspective changes, and occlusions. A user study was conducted to assess the usability of the results, establishing a baseline for future gesture retrieval methods in real-world video collections. The results of the experiment demonstrate the high potential of the proposed method in multimodal communication research and highlight the significance of visual gesture retrieval in enhancing interaction with video content. The integration of visual similarity search for gestures in the open-source multimedia retrieval stack, vitrivr, can greatly contribute to the field of computational linguistics. This research advances the understanding of the role of the visual modality in co-speech gestures and highlights the need for further development in this area

    Making modality: transmodal composing in a digital media studio.

    Get PDF
    The multiple media that exist for communication have historically been theorized as possessing different available means for persuasion and meaning-making. The exigence of these means has been the object of theoretical debate that ranges from cultural studies, language studies, semiology, and philosophies of the mind. This dissertation contributes to such debates by sharing the results of an ethnographically informed study of multimedia composing in a digital media studio. Drawing from Cultural Historical Activity Theory and theories of enactive perception, I analyze the organizational and infrastructural design of a media studio as well as the activity of composer/designers working in said studio. Throughout this analysis I find that implicit in the organization and infrastructure of the media studio is an ethos of conceptualizing communication technology as a legitimizing force. Such an ethos is troubled by my analysis of composer/designers working in the studio, whose activities do not seek outside legitimization but instead contribute to the media milieu. Following these analyses, I conclude that media’s means for persuasion and meaning-making emerge from local practices of communication and design. Finally, I provide a framework for studying the emergence of such means

    Textbooks: changing materiality, changing meaning

    Get PDF
    In a print-based society the physical characteristics of books not only controlled how knowledge was accessed, stored and disseminated, but also exerted a powerful unconscious influence over our perceptions about the status of knowledge. Textbooks, encyclopaediae, journals, comics and magazines were allocated status according to their material composition - for example, the amount of time it took to produce and disseminate the publication. The knowledge tools of Western pedagogy (textbooks) have traditionally been accorded powerful credibility and respectability because they were presented in tomes that imply content with enduring value. This pre-weighting of knowledge was difficult to discern in a print-based society but now, as meaning is increasingly digitally mediated, and teachers and learners are freed from the exigencies of print/textbook based knowledge, we are reassessing and reallocating our often unconscious perceptions of the status of knowledge. This paper examines the changing materiality of three sample pedagogical texts over a 16 year period in order to reveal, through an analysis of the changing dialectics in their material modes of representation, the epistemology embedded in each text. It then generalises these findings in order to shed light on the question: how does the changing materiality of texts shape our perceptions of knowledge

    Immersed in Pop! Excursions into Compositional Design

    Get PDF
    Recent changes in consumer audio and music technology and distribution - for example the addition of 3D audio formats such as Dolby Atmos to music streaming services, the recent release of “Spatial Audio” on Apple and Beats products, the proliferation of musical content in virtual reality and 360º videos, etc. - have reignited a public discourse on concepts of immersion and interactivity in popular music and media. This raises questions and necessitates a deepening of popular musicological discourse in these areas. This thesis thus asks: what is the relationship between so-called immersive media and immersive experience? How are immersive and interactive experiences of audiovisual popular music compositionally designed? And to what degree do interpretations of immersion and interactivity in popular music imply agency on part of the listener/viewer? To address these questions, Bresler has authored or co-authored four articles and book chapters on music in immersive and interactive media with a focus on compositional design and immersion in pop music. In the framing chapter, these articles are contextualized through the coining of the term immersive staging, which is a framework for understanding how the perceived relationship between the performer and listener is mediated through technology, performativity, audiovisual compositional design, and aesthetics. Additionally, the chapter makes a case for the hermeneutic methodologies employed throughout.publishedVersio
    corecore