236 research outputs found

    Multimodal Fusion as Communicative Acts during Human-Robot Interaction

    Get PDF
    Research on dialog systems is a very active area in social robotics. During the last two decades, these systems have evolved from those based only on speech recognition and synthesis to the current and modern systems, which include new components and multimodality. By multimodal dialogue we mean the interchange of information among several interlocutors, not just using their voice as the mean of transmission but also all the available channels such as gestures, facial expressions, touch, sounds, etc. These channels add information to the message to be transmitted in every dialogue turn. The dialogue manager (IDiM) is one of the components of the robotic dialog system (RDS) and is in charge of managing the dialogue flow during the conversational turns. In order to do that, it is necessary to coherently treat the inputs and outputs of information that flow by different communication channels: audio, vision, radio frequency, touch, etc. In our approach, this multichannel input of information is temporarily fused into communicative acts (CAs). Each CA groups the information that flows through the different input channels into the same pack, transmitting a unique message or global idea. Therefore, this temporary fusion of information allows the IDiM to abstract from the channels used during the interaction, focusing only on the message, not on the way it is transmitted. This article presents the whole RDS and the description of how the multimodal fusion of information is made as CAs. Finally, several scenarios where the multimodal dialogue is used are presented.Comunidad de Madri

    Multimodal Hate Speech Detection from Bengali Memes and Texts

    Full text link
    Numerous works have been proposed to employ machine learning (ML) and deep learning (DL) techniques to utilize textual data from social media for anti-social behavior analysis such as cyberbullying, fake news propagation, and hate speech mainly for highly resourced languages like English. However, despite having a lot of diversity and millions of native speakers, some languages such as Bengali are under-resourced, which is due to a lack of computational resources for natural language processing (NLP). Like English, Bengali social media content also includes images along with texts (e.g., multimodal contents are posted by embedding short texts into images on Facebook), only the textual data is not enough to judge them (e.g., to determine they are hate speech). In those cases, images might give extra context to properly judge. This paper is about hate speech detection from multimodal Bengali memes and texts. We prepared the only multimodal hate speech detection dataset1 for a kind of problem for Bengali. We train several neural architectures (i.e., neural networks like Bi-LSTM/Conv-LSTM with word embeddings, EfficientNet + transformer architectures such as monolingual Bangla BERT, multilingual BERT-cased/uncased, and XLM-RoBERTa) jointly analyze textual and visual information for hate speech detection. The Conv-LSTM and XLM-RoBERTa models performed best for texts, yielding F1 scores of 0.78 and 0.82, respectively. As of memes, ResNet152 and DenseNet201 models yield F1 scores of 0.78 and 0.7, respectively. The multimodal fusion of mBERT-uncased + EfficientNet-B1 performed the best, yielding an F1 score of 0.80. Our study suggests that memes are moderately useful for hate speech detection in Bengali, but none of the multimodal models outperform unimodal models analyzing only textual data

    An exploratory assessment of multistream deep neural network fusion : design and applications

    Get PDF
    Tese (doutorado) — Universidade de Brasília, Faculdade de Tecnologia, Departamento de Engenharia Mecânica, 2022.Os métodos de aprendizado de máquina dependem muito de quão bom o extrator de características selecionado pode representar os dados brutos de entrada. Atualmente, temos mais dados e capacidade computacional para lidar com isso. Com as Redes Neurais Convolucionais temos uma rede que é mais fácil de treinar e generaliza muito melhor do que o habitual. Há, no entanto, uma boa quantidade de características que são essenciais, mas são descartadas nesse processo, mesmo quando se utiliza uma CNN poderosa. As Redes Neurais Convolucionais Multistream podem processar mais de uma entrada usando fluxos separados e são projetadas usando qualquer arquitetura CNN clássica como base. O uso de M-CNNs gera mais informação de características e, assim, melhora o resultado geral. Este trabalho explorou arquiteturas M-CNNs e como os sinais de fluxo se comportam durante o processamento, chegando a uma nova estratégia de fusão cruzada de M-CNNs. O novo módulo é validado, inicialmente, com um conjunto de dados padrão, CIFAR-10, e comparado com as redes correspondentes (single-stream CNN e late fusion M-CNN). Os primeiros resultados neste cenário mostraram que nosso modelo adaptado superou todos os modelos mencionados acima em pelo menos 28% em comparação com todos os modelos testados. Expandindo o teste, usamos a base de antigas redes estado-da-arte na classificação de imagens e conjuntos de dados adicionais para investigar se a técnica pode colocar essas estruturas de volta ao jogo. No conjunto de dados NORB, mostramos que podemos aumentar a precisão em até 63, 21% quando comparado às estruturas básicas de M-CNNs. Variando nossas aplicações, o mAP@75 do conjunto de dados de detecção e reconhecimento de objetos BDD100K melhorou em 50, 16% em comparação com sua versão não adaptada, mesmo quando treinado do zero. A fusão proposta demonstrou robustez e estabilidade, mesmo quando distratores foram usados como entradas. Embora nosso objetivo seja reutilizar arquiteturas estado-da-arte anteriores com poucas modificações, também expomos as desvantagens de nossa estratégia explorada.Machine-learning methods depend heavily on how well the selected feature extractor can represent the raw input data. Nowadays, we have more data and computational capacity to deal with it. With Convolutional Neural Networks, we have a network that is easier to train and generalizes much better than usual. However, a good amount of essential features are discarded in this process, even when using a powerful CNN. Multistream Convolutional Neural Networks can process more than one input using separate streams and are designed using any classical CNN architecture as a base. The use of M-CNNs generates more features and thus, improves the overall outcome. This work explored M-CNNs architectures and how the stream signals behave during the processing, coming up with a novel M-CNN cross-fusion strategy. The new module is first validated with a standard dataset, CIFAR-10, and compared with the corresponding networks (single-stream CNN and late fusion M-CNN). Early results on this scenario showed that our adapted model outperformed all the abovementioned models by at least 28% compared to all tested models. Expanding the test, we used the backbones of former state-of-the-art networks on image classification and additional datasets to investigate if the technique can put these designs back in the game. On the NORB dataset, we showed that we could increase accuracy up to 63.21% compared to basic M-CNNs structures. Varying our applications, the mAP@75 of the BDD100K multi-object detection and recognition dataset improved by 50.16% compared to its unadapted version, even when trained from scratch. The proposed fusion demonstrated robustness and stability, even when distractors were used as inputs. While our goal is to reuse previous state-of-the-art architectures with few modifications, we also expose the disadvantages of our explored strategy

    Atlas construction and spatial normalisation to facilitate radiation-induced late effects research in childhood cancer

    Get PDF
    Reducing radiation-induced side effects is one of the most important challenges in paediatric cancer treatment. Recently, there has been growing interest in using spatial normalisation to enable voxel-based analysis of radiation-induced toxicities in a variety of patient groups. The need to consider three-dimensional distribution of doses, rather than dose-volume histograms, is desirable but not yet explored in paediatric populations. In this paper, we investigate the feasibility of atlas construction and spatial normalisation in paediatric radiotherapy. We used planning computed tomography (CT) scans from twenty paediatric patients historically treated with craniospinal irradiation to generate a template CT that is suitable for spatial normalisation. This childhood cancer population representative template was constructed using groupwise image registration. An independent set of 53 subjects from a variety of childhood malignancies was then used to assess the quality of the propagation of new subjects to this common reference space using deformable image registration (i.e., spatial normalisation). The method was evaluated in terms of overall image similarity metrics, contour similarity and preservation of dose-volume properties. After spatial normalisation, we report a dice similarity coefficient of 0.95±0.05, 0.85±0.04, 0.96±0.01, 0.91±0.03, 0.83±0.06 and 0.65±0.16 for brain and spinal canal, ocular globes, lungs, liver, kidneys and bladder. We then demonstrated the potential advantages of an atlas-based approach to study the risk of second malignant neoplasms after radiotherapy. Our findings indicate satisfactory mapping between a heterogeneous group of patients and the template CT. The poorest performance was for organs in the abdominal and pelvic region, likely due to respiratory and physiological motion and to the highly deformable nature of abdominal organs. More specialised algorithms should be explored in the future to improve mapping in these regions. This study is the first step toward voxel-based analysis in radiation-induced toxicities following paediatric radiotherapy

    Deep Learning for Sensor-based Human Activity Recognition: Overview, Challenges and Opportunities

    Full text link
    The vast proliferation of sensor devices and Internet of Things enables the applications of sensor-based activity recognition. However, there exist substantial challenges that could influence the performance of the recognition system in practical scenarios. Recently, as deep learning has demonstrated its effectiveness in many areas, plenty of deep methods have been investigated to address the challenges in activity recognition. In this study, we present a survey of the state-of-the-art deep learning methods for sensor-based human activity recognition. We first introduce the multi-modality of the sensory data and provide information for public datasets that can be used for evaluation in different challenge tasks. We then propose a new taxonomy to structure the deep methods by challenges. Challenges and challenge-related deep methods are summarized and analyzed to form an overview of the current research progress. At the end of this work, we discuss the open issues and provide some insights for future directions

    Advanced Signal Processing in Wearable Sensors for Health Monitoring

    Get PDF
    Smart, wearables devices on a miniature scale are becoming increasingly widely available, typically in the form of smart watches and other connected devices. Consequently, devices to assist in measurements such as electroencephalography (EEG), electrocardiogram (ECG), electromyography (EMG), blood pressure (BP), photoplethysmography (PPG), heart rhythm, respiration rate, apnoea, and motion detection are becoming more available, and play a significant role in healthcare monitoring. The industry is placing great emphasis on making these devices and technologies available on smart devices such as phones and watches. Such measurements are clinically and scientifically useful for real-time monitoring, long-term care, and diagnosis and therapeutic techniques. However, a pertaining issue is that recorded data are usually noisy, contain many artefacts, and are affected by external factors such as movements and physical conditions. In order to obtain accurate and meaningful indicators, the signal has to be processed and conditioned such that the measurements are accurate and free from noise and disturbances. In this context, many researchers have utilized recent technological advances in wearable sensors and signal processing to develop smart and accurate wearable devices for clinical applications. The processing and analysis of physiological signals is a key issue for these smart wearable devices. Consequently, ongoing work in this field of study includes research on filtration, quality checking, signal transformation and decomposition, feature extraction and, most recently, machine learning-based methods

    Distributed Sensing and Stimulation Systems Towards Sense of Touch Restoration in Prosthetics

    Get PDF
    Modern prostheses aim at restoring the functional and aesthetic characteristics of the lost limb. To foster prosthesis embodiment and functionality, it is necessary to restitute both volitional control and sensory feedback. Contemporary feedback interfaces presented in research use few sensors and stimulation units to feedback at most two discrete feedback variables (e.g. grasping force and aperture), whereas the human sense of touch relies on a distributed network of mechanoreceptors providing high-fidelity spatial information. To provide this type of feedback in prosthetics, it is necessary to sense tactile information from artificial skin placed on the prosthesis and transmit tactile feedback above the amputation in order to map the interaction between the prosthesis and the environment. This thesis proposes the integration of distributed sensing systems (e-skin) to acquire tactile sensation, and non-invasive multichannel electrotactile feedback and virtual reality to deliver high-bandwidth information to the user. Its core focus addresses the development and testing of close-loop sensory feedback human-machine interface, based on the latest distributed sensing and stimulation techniques for restoring the sense of touch in prosthetics. To this end, the thesis is comprised of two introductory chapters that describe the state of art in the field, the objectives and the used methodology and contributions; as well as three studies distributed over stimulation system level and sensing system level. The first study presents the development of close-loop compensatory tracking system to evaluate the usability and effectiveness of electrotactile sensory feedback in enabling real-time close-loop control in prosthetics. It examines and compares the subject\u2019s adaptive performance and tolerance to random latencies while performing the dynamic control task (i.e. position control) and simultaneously receiving either visual feedback or electrotactile feedback for communicating the momentary tracking error. Moreover, it reported the minimum time delay needed for an abrupt impairment of users\u2019 performance. The experimental results have shown that electrotactile feedback performance is less prone to changes with longer delays. However, visual feedback drops faster than electrotactile with increased time delays. This is a good indication for the effectiveness of electrotactile feedback in enabling close- loop control in prosthetics, since some delays are inevitable. The second study describes the development of a novel non-invasive compact multichannel interface for electrotactile feedback, containing 24 pads electrode matrix, with fully programmable stimulation unit, that investigates the ability of able-bodied human subjects to localize the electrotactile stimulus delivered through the electrode matrix. Furthermore, it designed a novel dual parameter -modulation (interleaved frequency and intensity) and compared it to conventional stimulation (same frequency for all pads). In addition and for the first time, it compared the electrotactile stimulation to mechanical stimulation. More, it exposes the integration of virtual prosthesis with the developed system in order to achieve better user experience and object manipulation through mapping the acquired real-time collected tactile data and feedback it simultaneously to the user. The experimental results demonstrated that the proposed interleaved coding substantially improved the spatial localization compared to same-frequency stimulation. Furthermore, it showed that same-frequency stimulation was equivalent to mechanical stimulation, whereas the performance with dual-parameter modulation was significantly better. The third study presents the realization of a novel, flexible, screen- printed e-skin based on P(VDF-TrFE) piezoelectric polymers, that would cover the fingertips and the palm of the prosthetic hand (particularly the Michelangelo hand by Ottobock) and an assistive sensorized glove for stroke patients. Moreover, it developed a new validation methodology to examine the sensors behavior while being solicited. The characterization results showed compatibility between the expected (modeled) behavior of the electrical response of each sensor to measured mechanical (normal) force at the skin surface, which in turn proved the combination of both fabrication and assembly processes was successful. This paves the way to define a practical, simplified and reproducible characterization protocol for e-skin patches In conclusion, by adopting innovative methodologies in sensing and stimulation systems, this thesis advances the overall development of close-loop sensory feedback human-machine interface used for restoration of sense of touch in prosthetics. Moreover, this research could lead to high-bandwidth high-fidelity transmission of tactile information for modern dexterous prostheses that could ameliorate the end user experience and facilitate it acceptance in the daily life
    corecore