98 research outputs found

    Acoustic localization of people in reverberant environments using deep learning techniques

    Get PDF
    La localización de las personas a partir de información acústica es cada vez más importante en aplicaciones del mundo real como la seguridad, la vigilancia y la interacción entre personas y robots. En muchos casos, es necesario localizar con precisión personas u objetos en función del sonido que generan, especialmente en entornos ruidosos y reverberantes en los que los métodos de localización tradicionales pueden fallar, o en escenarios en los que los métodos basados en análisis de vídeo no son factibles por no disponer de ese tipo de sensores o por la existencia de oclusiones relevantes. Por ejemplo, en seguridad y vigilancia, la capacidad de localizar con precisión una fuente de sonido puede ayudar a identificar posibles amenazas o intrusos. En entornos sanitarios, la localización acústica puede utilizarse para controlar los movimientos y actividades de los pacientes, especialmente los que tienen problemas de movilidad. En la interacción entre personas y robots, los robots equipados con capacidades de localización acústica pueden percibir y responder mejor a su entorno, lo que permite interacciones más naturales e intuitivas con los humanos. Por lo tanto, el desarrollo de sistemas de localización acústica precisos y robustos utilizando técnicas avanzadas como el aprendizaje profundo es de gran importancia práctica. Es por esto que en esta tesis doctoral se aborda dicho problema en tres líneas de investigación fundamentales: (i) El diseño de un sistema extremo a extremo (end-to-end) basado en redes neuronales capaz de mejorar las tasas de localización de sistemas ya existentes en el estado del arte. (ii) El diseño de un sistema capaz de localizar a uno o varios hablantes simultáneos en entornos con características y con geometrías de arrays de sensores diferentes sin necesidad de re-entrenar. (iii) El diseño de sistemas capaces de refinar los mapas de potencia acústica necesarios para localizar a las fuentes acústicas para conseguir una mejor localización posterior. A la hora de evaluar la consecución de dichos objetivos se han utilizado diversas bases de datos realistas con características diferentes, donde las personas involucradas en las escenas pueden actuar sin ningún tipo de restricción. Todos los sistemas propuestos han sido evaluados bajo las mismas condiciones consiguiendo superar en términos de error de localización a los sistemas actuales del estado del arte

    Training Noise-Robust Spoken Phrase Detectors with Scarce and Private Data: An Application to Classroom Observation Videos

    Get PDF
    We explore how to automatically detect specific phrases in audio from noisy, multi-speaker videos using deep neural networks. Specifically, we focus on classroom observation videos that contain a few adult teachers and several small children (\u3c 5 years old). At any point in these videos, multiple people may be talking, shouting, crying, or singing simultaneously. Our goal is to recognize polite speech phrases such as Good job , Thank you , Please , and You\u27re welcome , as the occurrence of such speech is one of the behavioral markers used in classroom observation coding via the Classroom Assessment Scoring System (CLASS) protocol. Commercial speech recognition services such as Google Cloud Speech are impractical because of data privacy concerns. Therefore, we train and test our own custom models using a combination of publicly available classroom videos from YouTube, as well as a private dataset of real classroom observation videos collected by our colleagues at the University of Virginia. We also crowdsource an additional 1152 recordings of polite speech phrases to augment our training dataset. Our contributions are the following: (1) we design a crowdsourcing task for efficiently labeling speech events in classroom videos, (2) we develop a neural network-based architecture for speech recognition, robust to noise and overlapping speech, and (3) we explore methods to synthesize new and authentic audio data, both to increase the training set size and reduce the class imbalance. Finally, using our trained polite speech detector, (4) we investigate the relationship between polite speech and CLASS scores and enable teachers to visualize their use of polite language

    Designing instruments towards networked music practices

    Get PDF
    It is commonly noted in New Interfaces for Musical Expression (NIME) research that few of these make it to the mainstream and are adopted by the general public. Some research in Sound and Music Computing (SMC) suggests that the lack of humanistic research guiding technological development may be one of the causes. Many new technologies are invented, however without real aim else than for technical innovation, great products however emphasize the user-friendliness, user involvement in the design process or User-Centred Design (UCD), that seek to guarantee that innovation address real, existing needs among users. Such an approach includes not only traditionally quantifiable usability goals, but also qualitative, psychological, philosophical and musical such. The latter approach has come to be called experience design, while the former is referred to as interaction design. Although the Human Computer Interaction (HCI) community in general has recognized the significance of qualitative needs and experience design, NIME has been slower to adopt this new paradigm. This thesis therefore attempts to investigate its relevance in NIME, and specifically Computer Supported Cooperative Work (CSCW) for music applications by devising a prototype for group music action based on needs defined from pianists engaging in piano duets, one of the more common forms of group creation seen in the western musical tradition. These needs, some which are socio-emotional in nature, are addressed through our prototype although in the context of computers and global networks by allowing for composers from all over the world to submit music to a group concert on a Yamaha Disklavier in location in Porto, Portugal. Although this prototype is not a new gestural controller per se, and therefore not a traditional NIME, but rather a platform that interfaces groups of composers with a remote audience, the aim of this research is on investigating how contextual parameters like venue, audience, joint concert and technologies impact the overall user experience of such a system. The results of this research has been important not only in understanding the processes, services, events or environments in which NIME’s operate, but also understanding reciprocity, creativity, experience design in Networked Music practices.É de conhecimento generalizado que na área de investigação em novos interfaces para expressão musical (NIME - New Interfaces for Musical Expression), poucos dos resultantes dispositivos acabam por ser popularizados e adoptados pelo grande público. Algum do trabalho em computação sonora e musical (SMC- Sound and Music Computing) sugere que uma das causas para esta dificuldade, reside numalacuna ao nível da investigação dos comportamentos humanos como linha orientadora para os desenvolvimentos tecnológicos. Muitos dos desenvolvimentos tecnológicos são conduzidos sem um real objectivo, para além da inovação tecnológica, resultando em excelentes produtos, mas sem qualquer enfâse na usabilidade humana ou envolvimento do utilizador no processo de Design (UCDUser Centered Design), no sentido de garantir que a inovação atende a necessidades reais dos utilizadores finais. Esta estratégia implica, não só objectivos quantitativos tradicionais de usabilidade, mas também princípios qualitativos, fisiológicos, psicológicos e musicológicos. Esta ultima abordagem é atualmente reconhecida como Design de Experiência (Experience Design) enquanto a abordagem tradicional é vulgarmente reconhecida apenas como Design de Interação (Interaction Design). Apesar de na área Interação Homem-Computador (HCI – Human Computer Interaction) as necessidades qualitativas no design de experiência ser amplamente reconhecido em termos do seu significado e aplicabilidade, a comunidade NIME tem sido mais lenta em adoptar este novo paradigma. Neste sentido, esta Tese procura investigar a relevância em NIME, especificamente nu subtópico do trabalho cooperativo suportado por Computadores (CSCW – Computer Supported Cooperative Work), para aplicações musicais, através do desenvolvimento de um protótipo de um sistema que suporta ações musicais coletivas, baseado nas necessidades especificas de Pianistas em duetos de Piano, uma das formas mais comuns de criação musical em grupo popularizada na tradição musical ocidental. Estes requisitos, alguns sócioemocionais na sua natureza, são atendidos através do protótipo, neste caso aplicado ao contexto informático e da rede de comunicações global, permitindo a compositores de todo o mundo submeterem a sua música para um concerto de piano em grupo num piano acústico Yamaha Disklavier, localizado fisicamente na cidade do Porto, Portugal. Este protótipo não introduz um novo controlador em si mesmo, e consequentemente não está alinhado com as típicas propostas de NIME. Trata-se sim, de uma nova plataforma de interface em grupo para compositores com uma audiência remota, enquadrado com objectivos de experimentação e investigação sobre o impacto de diversos parâmetros, tais como o espaço performativo, as audiências, concertos colaborativos e tecnologias em termos do sistema global. O resultado deste processo de investigação foi relevante, não só para compreender os processos, serviços, eventos ou ambiente em que os NIME podem operar, mas também para melhor perceber a reciprocidade, criatividade e design de experiencia nas práticas musicais em rede

    Privacy-preserving and Privacy-attacking Approaches for Speech and Audio -- A Survey

    Full text link
    In contemporary society, voice-controlled devices, such as smartphones and home assistants, have become pervasive due to their advanced capabilities and functionality. The always-on nature of their microphones offers users the convenience of readily accessing these devices. However, recent research and events have revealed that such voice-controlled devices are prone to various forms of malicious attacks, hence making it a growing concern for both users and researchers to safeguard against such attacks. Despite the numerous studies that have investigated adversarial attacks and privacy preservation for images, a conclusive study of this nature has not been conducted for the audio domain. Therefore, this paper aims to examine existing approaches for privacy-preserving and privacy-attacking strategies for audio and speech. To achieve this goal, we classify the attack and defense scenarios into several categories and provide detailed analysis of each approach. We also interpret the dissimilarities between the various approaches, highlight their contributions, and examine their limitations. Our investigation reveals that voice-controlled devices based on neural networks are inherently susceptible to specific types of attacks. Although it is possible to enhance the robustness of such models to certain forms of attack, more sophisticated approaches are required to comprehensively safeguard user privacy

    Should Acoustic Simulation Technology be Utilised in Architectural Practice? Does it have the Potential for BIM Integration?

    Get PDF
    The research presented in this paper, firstly, aims to convey the importance of our acoustic environment through focusing on the effects of undesirable acoustic conditions on cognitive abilities in spaces where cognitive performance is of the utmost concern, our learning environments. Secondly, it aims to investigate current state-of-the-art acoustic simulation methods, available platforms, and their levels of interoperability with architectural BIM authoring software. Structured interviews were carried out with 7 Irish architects and architectural technologists to determine if a disconnection between architectural design and acoustic performance exists and to identify the advantages and disadvantages of current workflows for acoustic performance evaluation. Additionally, industry opinions were gathered on whether it is measurable that our acoustic environments are at a disadvantage as a result of the apparent gap in available integrated acoustic evaluation solutions for a BIM-enabled design workflow, and finally to investigate industry demand for better integration of acoustic evaluation tools with BIM authoring platforms

    Self-Supervised Training of Speaker Encoder with Multi-Modal Diverse Positive Pairs

    Full text link
    We study a novel neural architecture and its training strategies of speaker encoder for speaker recognition without using any identity labels. The speaker encoder is trained to extract a fixed-size speaker embedding from a spoken utterance of various length. Contrastive learning is a typical self-supervised learning technique. However, the quality of the speaker encoder depends very much on the sampling strategy of positive and negative pairs. It is common that we sample a positive pair of segments from the same utterance. Unfortunately, such poor-man's positive pairs (PPP) lack necessary diversity for the training of a robust encoder. In this work, we propose a multi-modal contrastive learning technique with novel sampling strategies. By cross-referencing between speech and face data, we study a method that finds diverse positive pairs (DPP) for contrastive learning, thus improving the robustness of the speaker encoder. We train the speaker encoder on the VoxCeleb2 dataset without any speaker labels, and achieve an equal error rate (EER) of 2.89\%, 3.17\% and 6.27\% under the proposed progressive clustering strategy, and an EER of 1.44\%, 1.77\% and 3.27\% under the two-stage learning strategy with pseudo labels, on the three test sets of VoxCeleb1. This novel solution outperforms the state-of-the-art self-supervised learning methods by a large margin, at the same time, achieves comparable results with the supervised learning counterpart. We also evaluate our self-supervised learning technique on LRS2 and LRW datasets, where the speaker information is unknown. All experiments suggest that the proposed neural architecture and sampling strategies are robust across datasets.Comment: 13 page

    Sound Event Localization, Detection, and Tracking by Deep Neural Networks

    Get PDF
    In this thesis, we present novel sound representations and classification methods for the task of sound event localization, detection, and tracking (SELDT). The human auditory system has evolved to localize multiple sound events, recognize and further track their motion individually in an acoustic environment. This ability of humans makes them context-aware and enables them to interact with their surroundings naturally. Developing similar methods for machines will provide an automatic description of social and human activities around them and enable machines to be context-aware similar to humans. Such methods can be employed to assist the hearing impaired to visualize sounds, for robot navigation, and to monitor biodiversity, the home, and cities. A real-life acoustic scene is complex in nature, with multiple sound events that are temporally and spatially overlapping, including stationary and moving events with varying angular velocities. Additionally, each individual sound event class, for example, a car horn can have a lot of variabilities, i.e., different cars have different horns, and within the same model of the car, the duration and the temporal structure of the horn sound is driver dependent. Performing SELDT in such overlapping and dynamic sound scenes while being robust is challenging for machines. Hence we propose to investigate the SELDT task in this thesis and use a data-driven approach using deep neural networks (DNNs). The sound event detection (SED) task requires the detection of onset and offset time for individual sound events and their corresponding labels. In this regard, we propose to use spatial and perceptual features extracted from multichannel audio for SED using two different DNNs, recurrent neural networks (RNNs) and convolutional recurrent neural networks (CRNNs). We show that using multichannel audio features improves the SED performance for overlapping sound events in comparison to traditional single-channel audio features. The proposed novel features and methods produced state-of-the-art performance for the real-life SED task and won the IEEE AASP DCASE challenge consecutively in 2016 and 2017. Sound event localization is the task of spatially locating the position of individual sound events. Traditionally, this has been approached using parametric methods. In this thesis, we propose a CRNN for detecting the azimuth and elevation angles of multiple temporally overlapping sound events. This is the first DNN-based method performing localization in complete azimuth and elevation space. In comparison to parametric methods which require the information of the number of active sources, the proposed method learns this information directly from the input data and estimates their respective spatial locations. Further, the proposed CRNN is shown to be more robust than parametric methods in reverberant scenarios. Finally, the detection and localization tasks are performed jointly using a CRNN. This method additionally tracks the spatial location with time, thus producing the SELDT results. This is the first DNN-based SELDT method and is shown to perform equally with stand-alone baselines for SED, localization, and tracking. The proposed SELDT method is evaluated on nine datasets that represent anechoic and reverberant sound scenes, stationary and moving sources with varying velocities, a different number of overlapping sound events and different microphone array formats. The results show that the SELDT method can track multiple overlapping sound events that are both spatially stationary and moving

    Collaborative Artificial Intelligence Development for Social Robots

    Get PDF
    The main aim of this doctoral thesis was to investigate on how to involve a community for collaborative artificial intelligence (AI) development of a social robot. The work was initiated by the author’s personal interest in developing the Sony AIBO robots that have been unavailable on the retail markets, however, user communities with special interests in these robots remained on the internet. At first, to attract people’s attention, the author developed three specific features for the robot. These consisted of teaching the robot 1) sound event recognition in order to react to environmental audio stimuli, 2) a method to detect the underlying surface under the robot, and 3) of how to recognize its own body states. As this AI development proved to be very challenging, the author decided to start a community project for artificial intelligence development. Community involvement has a long history in open-source software projects and some robotics companies tried to benefit from their userbase in product development. An active online community of Sony AIBO owners was approached to investigate factors to engage its members in the creative processes. For this purpose, 78 Sony AIBO owners were recruited online to fill a questionnaire and their data were analyzed with respect to age, gender, culture, length of ownership, user contribution, and model preference. The results revealed the motives to own these robots for many years and how these heavy users perceived their social robots after a long period in the robot acceptance phase. For example, female participants tended to have more emotional relation to their robots than male who had more technically oriented long-term engagement motivation. The user expectations were also explored by analyzing the answers to this questionnaire to discover the key needs of this user group. The results revealed that the most-wanted skills were the interaction with humans and the autonomous operation. The integration with the AI agents and Internet services was important, but the long-term memory and learning capabilities were not so relevant for the participants. The diverse preferences for robot skills led to creating a prioritized recommendation list to complement the design guidelines for social robots in the literature. In sum, the findings of this thesis showed that developing AI features for an outdated robot is possible but takes a lot of time and shared community efforts. To involve a specific community, one needs first to build up trust by working with and for the community. Also, the trust for the long-term endurance of the development project was found as a precondition for the community commitment. The discoveries of this thesis can be applied to similar types of collaborative AI developments in the future. There are significant contributions in this dissertation to robotics. First, the long-term robot usage was not studied on a years-long scale before and the most extended human-robot interactions analyzed test subjects for only a few months. A questionnaire investigated the robot owners with 1-10+ years-long ownership in this work and their attitude towards robot acceptance. The survey results helped to understand the viable strategies to engage users for a long time. Second, innovative ways were explored to involve online communities in robotics development. The past approaches introduced the community ideas and opinions into product design and innovation iterations. The community in this dissertation tested the developed AI engine, provided inputs for further development directions, created content for the actual AI and gave their feedback about product quality. These contributions advance the social robotics field
    corecore