1,531 research outputs found

    Bioinspired auditory sound localisation for improving the signal to noise ratio of socially interactive robots

    Get PDF
    In this paper we describe a bioinspired hybrid architecture for acoustic sound source localisation and tracking to increase the signal to noise ratio (SNR) between speaker and background sources for a socially interactive robot's speech recogniser system. The model presented incorporates the use of Interaural Time Differ- ence for azimuth estimation and Recurrent Neural Net- works for trajectory prediction. The results are then pre- sented showing the difference in the SNR of a localised and non-localised speaker source, in addition to presenting the recognition rates between a localised and non-localised speaker source. From the results presented in this paper it can be seen that by orientating towards the sound source of interest the recognition rates of that source can be in- creased

    Sound Event Localization, Detection, and Tracking by Deep Neural Networks

    Get PDF
    In this thesis, we present novel sound representations and classification methods for the task of sound event localization, detection, and tracking (SELDT). The human auditory system has evolved to localize multiple sound events, recognize and further track their motion individually in an acoustic environment. This ability of humans makes them context-aware and enables them to interact with their surroundings naturally. Developing similar methods for machines will provide an automatic description of social and human activities around them and enable machines to be context-aware similar to humans. Such methods can be employed to assist the hearing impaired to visualize sounds, for robot navigation, and to monitor biodiversity, the home, and cities. A real-life acoustic scene is complex in nature, with multiple sound events that are temporally and spatially overlapping, including stationary and moving events with varying angular velocities. Additionally, each individual sound event class, for example, a car horn can have a lot of variabilities, i.e., different cars have different horns, and within the same model of the car, the duration and the temporal structure of the horn sound is driver dependent. Performing SELDT in such overlapping and dynamic sound scenes while being robust is challenging for machines. Hence we propose to investigate the SELDT task in this thesis and use a data-driven approach using deep neural networks (DNNs). The sound event detection (SED) task requires the detection of onset and offset time for individual sound events and their corresponding labels. In this regard, we propose to use spatial and perceptual features extracted from multichannel audio for SED using two different DNNs, recurrent neural networks (RNNs) and convolutional recurrent neural networks (CRNNs). We show that using multichannel audio features improves the SED performance for overlapping sound events in comparison to traditional single-channel audio features. The proposed novel features and methods produced state-of-the-art performance for the real-life SED task and won the IEEE AASP DCASE challenge consecutively in 2016 and 2017. Sound event localization is the task of spatially locating the position of individual sound events. Traditionally, this has been approached using parametric methods. In this thesis, we propose a CRNN for detecting the azimuth and elevation angles of multiple temporally overlapping sound events. This is the first DNN-based method performing localization in complete azimuth and elevation space. In comparison to parametric methods which require the information of the number of active sources, the proposed method learns this information directly from the input data and estimates their respective spatial locations. Further, the proposed CRNN is shown to be more robust than parametric methods in reverberant scenarios. Finally, the detection and localization tasks are performed jointly using a CRNN. This method additionally tracks the spatial location with time, thus producing the SELDT results. This is the first DNN-based SELDT method and is shown to perform equally with stand-alone baselines for SED, localization, and tracking. The proposed SELDT method is evaluated on nine datasets that represent anechoic and reverberant sound scenes, stationary and moving sources with varying velocities, a different number of overlapping sound events and different microphone array formats. The results show that the SELDT method can track multiple overlapping sound events that are both spatially stationary and moving

    Adaptive and learning-based formation control of swarm robots

    Get PDF
    Autonomous aerial and wheeled mobile robots play a major role in tasks such as search and rescue, transportation, monitoring, and inspection. However, these operations are faced with a few open challenges including robust autonomy, and adaptive coordination based on the environment and operating conditions, particularly in swarm robots with limited communication and perception capabilities. Furthermore, the computational complexity increases exponentially with the number of robots in the swarm. This thesis examines two different aspects of the formation control problem. On the one hand, we investigate how formation could be performed by swarm robots with limited communication and perception (e.g., Crazyflie nano quadrotor). On the other hand, we explore human-swarm interaction (HSI) and different shared-control mechanisms between human and swarm robots (e.g., BristleBot) for artistic creation. In particular, we combine bio-inspired (i.e., flocking, foraging) techniques with learning-based control strategies (using artificial neural networks) for adaptive control of multi- robots. We first review how learning-based control and networked dynamical systems can be used to assign distributed and decentralized policies to individual robots such that the desired formation emerges from their collective behavior. We proceed by presenting a novel flocking control for UAV swarm using deep reinforcement learning. We formulate the flocking formation problem as a partially observable Markov decision process (POMDP), and consider a leader-follower configuration, where consensus among all UAVs is used to train a shared control policy, and each UAV performs actions based on the local information it collects. In addition, to avoid collision among UAVs and guarantee flocking and navigation, a reward function is added with the global flocking maintenance, mutual reward, and a collision penalty. We adapt deep deterministic policy gradient (DDPG) with centralized training and decentralized execution to obtain the flocking control policy using actor-critic networks and a global state space matrix. In the context of swarm robotics in arts, we investigate how the formation paradigm can serve as an interaction modality for artists to aesthetically utilize swarms. In particular, we explore particle swarm optimization (PSO) and random walk to control the communication between a team of robots with swarming behavior for musical creation

    A Review of Deep Learning Methods and Applications for Unmanned Aerial Vehicles

    Get PDF
    Deep learning is recently showing outstanding results for solving a wide variety of robotic tasks in the areas of perception, planning, localization, and control. Its excellent capabilities for learning representations from the complex data acquired in real environments make it extremely suitable for many kinds of autonomous robotic applications. In parallel, Unmanned Aerial Vehicles (UAVs) are currently being extensively applied for several types of civilian tasks in applications going from security, surveillance, and disaster rescue to parcel delivery or warehouse management. In this paper, a thorough review has been performed on recent reported uses and applications of deep learning for UAVs, including the most relevant developments as well as their performances and limitations. In addition, a detailed explanation of the main deep learning techniques is provided. We conclude with a description of the main challenges for the application of deep learning for UAV-based solutions

    A Review on Human-Computer Interaction and Intelligent Robots

    Get PDF
    In the field of artificial intelligence, human–computer interaction (HCI) technology and its related intelligent robot technologies are essential and interesting contents of research. From the perspective of software algorithm and hardware system, these above-mentioned technologies study and try to build a natural HCI environment. The purpose of this research is to provide an overview of HCI and intelligent robots. This research highlights the existing technologies of listening, speaking, reading, writing, and other senses, which are widely used in human interaction. Based on these same technologies, this research introduces some intelligent robot systems and platforms. This paper also forecasts some vital challenges of researching HCI and intelligent robots. The authors hope that this work will help researchers in the field to acquire the necessary information and technologies to further conduct more advanced research

    Low-Cost Indoor Localisation Based on Inertial Sensors, Wi-Fi and Sound

    Get PDF
    The average life expectancy has been increasing in the last decades, creating the need for new technologies to improve the quality of life of the elderly. In the Ambient Assisted Living scope, indoor location systems emerged as a promising technology capable of sup porting the elderly, providing them a safer environment to live in, and promoting their autonomy. Current indoor location technologies are divided into two categories, depend ing on their need for additional infrastructure. Infrastructure-based solutions require expensive deployment and maintenance. On the other hand, most infrastructure-free systems rely on a single source of information, being highly dependent on its availability. Such systems will hardly be deployed in real-life scenarios, as they cannot handle the absence of their source of information. An efficient solution must, thus, guarantee the continuous indoor positioning of the elderly. This work proposes a new room-level low-cost indoor location algorithm. It relies on three information sources: inertial sensors, to reconstruct users’ trajectories; environ mental sound, to exploit the unique characteristics of each home division; and Wi-Fi, to estimate the distance to the Access Point in the neighbourhood. Two data collection protocols were designed to resemble a real living scenario, and a data processing stage was applied to the collected data. Then, each source was used to train individual Ma chine Learning (including Deep Learning) algorithms to identify room-level positions. As each source provides different information to the classification, the data were merged to produce a more robust localization. Three data fusion approaches (input-level, early, and late fusion) were implemented for this goal, providing a final output containing complementary contributions from all data sources. Experimental results show that the performance improved when more than one source was used, attaining a weighted F1-score of 81.8% in the localization between seven home divisions. In conclusion, the evaluation of the developed algorithm shows that it can achieve accurate room-level indoor localization, being, thus, suitable to be applied in Ambient Assisted Living scenarios.O aumento da esperança média de vida nas últimas décadas, criou a necessidade de desenvolvimento de tecnologias que permitam melhorar a qualidade de vida dos idosos. No âmbito da Assistência à Autonomia no Domicílio, sistemas de localização indoor têm emergido como uma tecnologia promissora capaz de acompanhar os idosos e as suas atividades, proporcionando-lhes um ambiente seguro e promovendo a sua autonomia. As tecnologias de localização indoor atuais podem ser divididas em duas categorias, aquelas que necessitam de infrastruturas adicionais e aquelas que não. Sistemas dependentes de infrastrutura necessitam de implementação e manutenção que são muitas vezes dispendiosas. Por outro lado, a maioria das soluções que não requerem infrastrutura, dependem de apenas uma fonte de informação, sendo crucial a sua disponibilidade. Um sistema que não consegue lidar com a falta de informação de um sensor dificilmente será implementado em cenários reais. Uma solução eficiente deverá assim garantir o acompanhamento contínuo dos idosos. A solução proposta consiste no desenvolvimento de um algoritmo de localização indoor de baixo custo, baseando-se nas seguintes fontes de informação: sensores inerciais, capazes de reconstruir a trajetória do utilizador; som, explorando as características dis tintas de cada divisão da casa; e Wi-Fi, responsável pela estimativa da distância entre o ponto de acesso e o smartphone. Cada fonte sensorial, extraída dos sensores incorpora dos no dispositivo, foi, numa primeira abordagem, individualmente otimizada através de algoritmos de Machine Learning (incluindo Deep Learning). Como os dados das diversas fontes contêm informação diferente acerca das mesmas características do sistema, a sua fusão torna a classificação mais informada e robusta. Com este objetivo, foram implementadas três abordagens de fusão de dados (input data, early and late fusion), fornecendo um resultado final derivado de contribuições complementares de todas as fontes de dados. Os resultados experimentais mostram que o desempenho do algoritmo desenvolvido melhorou com a inclusão de informação multi-sensor, alcançando um valor para F1- score de 81.8% na distinção entre sete divisões domésticas. Concluindo, o algoritmo de localização indoor, combinando informações de três fontes diferentes através de métodos de fusão de dados, alcançou uma localização room-level e está apto para ser aplicado num cenário de Assistência à Autonomia no Domicílio

    Learning Attention Mechanisms and Context: An Investigation into Vision and Emotion

    Get PDF
    Attention mechanisms for context modelling are becoming ubiquitous in neural architectures in machine learning. The attention mechanism is a technique that filters out information that is irrelevant to a given task and focuses on learning task-dependent fixation points or regions. Furthermore, attention mechanisms suggest a question about a given task, i.e. `what' to learn and `where/how' to learn for task-specific context modelling. The context is the conditional variables instrumental in deciding the categorical distribution for the given data. Also, why is learning task-specific context necessary? In order to answer these questions, context modelling with attention in the vision and emotion domains is explored in this thesis using attention mechanisms with different hierarchical structures. The three main goals of this thesis are building superior classifiers using attention-based deep neural networks~(DNNs), investigating the role of context modelling in the given tasks, and developing a framework for interpreting hierarchies and attention in deep attention networks. In the vision domain, gesture and posture recognition tasks in diverse environments, are chosen. In emotion, visual and speech emotion recognition tasks are chosen. These tasks are selected for their sequential properties for modelling a spatiotemporal context. One of the key challenges from a machine learning standpoint is to extract patterns which bear maximum correlation with the information encoded in its signal while being as insensitive as possible to other types of information carried by the signal. A possible way to overcome this problem is to learn task-dependent representations. In order to achieve that, novel spatiotemporal context modelling networks and the mixture of multi-view attention~(MOMA) networks are proposed using bidirectional long-short-term memory network (BLSTM), convolutional neural network~(CNN), Capsule and attention networks. A framework has been proposed to interpret the internal attention states with respect to the given task. The results of the classifiers in the assigned tasks are compared with the \textit{state-of-the-art} DNNs, and the proposed classifiers achieve superior results. The context in speech emotion recognition is explored deeply with the attention interpretation framework, and it shows that the proposed model can assign word importance based on acoustic context. Furthermore, it has been observed that the internal states of the attention bear correlation with human perception of acoustic cues for speech emotion recognition. Overall, the results demonstrate superior classifiers and context learning models with interpretable frameworks. The findings are very important for speech emotion recognition systems. In this thesis, not only better models are produced, but also the interpretability of those models are explored, and their internal states are analysed. The phones and words are aligned with the attention vectors, and it is seen that the vowel sounds are more important for defining emotion acoustic cues than the consonants, and the model can assign word importance based on acoustic context. Also, how these approaches for emotion recognition using word importance for predicting emotions are demonstrated by the attention weight visualisation over the words. In a broader perspective, the findings from the thesis about gesture, posture and emotion recognition may be helpful in tasks like human-robot interaction~(HRI) and conversational artificial agents (such as Siri, Alexa). The communication is grounded with the symbolic and sub-symbolic cues of intent either from visual, audio or haptics. The understanding of intent is much dependent on the reasoning about the situational context. Emotion, i.e.\ speech and visual emotion, provides context to a situation, and it is a deciding factor in the response generation. Emotional intelligence and information from vision, audio and other modalities are essential for making human-human and human-robot communication more natural and feedback-driven
    • …
    corecore