47 research outputs found

    Audio-based localization for ubiquitous sensor networks

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2005.Includes bibliographical references (p. 97-101).This research presents novel techniques for acoustic-source location for both actively triggered, and passively detected signals using pervasive, distributed networks of devices, and investigates the combination of existing resources available in personal electronics to build a digital sensing 'commons'. By connecting personal resources with those of the people nearby, tasks can be achieved, through distributed placement and statistical improvement, that a single device could not do alone. The utility and benefits of spatio-temporal acoustic sensing are presented, in the context of ubiquitous computing and machine listening history. An active audio self-localisation algorithm is described which is effective in distributed sensor networks even if only coarse temporal synchronisation can be established. Pseudo-noise 'chirps' are emitted and recorded at each of the nodes. Pair-wise distances are calculated by comparing the difference in the audio delays between the peaks measured in each recording. By removing dependence on fine grained temporal synchronisation it is hoped that this technique can be used concurrently across a wide range of devices to better leverage the existing audio sensing resources that surround us.(cont.) A passive acoustic source location estimation method is then derived which is suited to the microphone resources of network-connected heterogeneous devices containing asynchronous processors and uncalibrated sensors. Under these constraints position coordinates must be simultaneously determined for pairs of sounds and recorded at each microphone to form a chain of acoustic events. It is shown that an iterative, numerical least-squares estimator can be used. Initial position estimates of the source pair can be first found from the previous estimate in the chain and a closed-form least squares approach, improving the convergence rate of the second step. Implementations of these methods using the Smart Architectural Surfaces development platform are described and assessed. The viability of the active ranging technique is further demonstrated in a mixed-device ad-hoc sensor network case using existing off-the-shelf technology. Finally, drawing on human-centric onset detection as a means of discovering suitable sound features, to be passed between nodes for comparison, the extension of the source location algorithm beyond the use of pseudo-noise test sounds to enable the location of extraneous noises and acoustic streams is discussed for further study.Benjamin Christopher Dalton.S.M

    Online Audio-Visual Multi-Source Tracking and Separation: A Labeled Random Finite Set Approach

    Get PDF
    The dissertation proposes an online solution for separating an unknown and time-varying number of moving sources using audio and visual data. The random finite set framework is used for the modeling and fusion of audio and visual data. This enables an online tracking algorithm to estimate the source positions and identities for each time point. With this information, a set of beamformers can be designed to separate each desired source and suppress the interfering sources

    Simulating realistic multiparty speech data: for the development of distant microphone ASR systems

    Get PDF
    Automatic speech recognition has become a ubiquitous technology integrated into our daily lives. However, the problem remains challenging when the speaker is far away from the microphone. In such scenarios, the speech is degraded both by reverberation and by the presence of additive noise. This situation is particularly challenging when there are competing speakers present (i.e. multi-party scenarios) Acoustic scene simulation has been a major tool for training and developing distant microphone speech recognition systems, and is now being used to develop solutions for mult-party scenarios. It has been used both in training -- as it allows cheap generation of limitless amounts of data -- and for evaluation -- because it can provide easy access to a ground truth (i.e. a noise-free target signal). However, whilst much work has been conducted to produce realistic artificial scene simulators, the signals produced from such simulators are only as good as the `metadata' being used to define the setups, i.e., the data describing, for example, the number of speakers and their distribution relative to the microphones. This thesis looks at how realistic metadata can be derived by analysing how speakers behave in real domestic environments. In particular, how to produce scenes that provide a realistic distribution for various factors that are known to influence the 'difficulty' of the scene, including the separation angle between speakers, the absolute and relative distances of speakers to microphones, and the pattern of temporal overlap of speech. Using an existing audio-visual multi-party conversational dataset, CHiME-5, each of these aspects has been studied in turn. First, producing a realistic angular separation between speakers allows for algorithms which enhance signals based on the direction of arrival to be fairly evaluated, reducing the mismatch between real and simulated data. This was estimated using automatic people detection techniques in video recordings from CHiME-5. Results show that commonly used datasets of simulated signals do not follow a realistic distribution, and when a realistic distribution is enforced, a significant drop in performance is observed. Second, by using multiple cameras it has been possible to estimate the 2-D positions of people inside each scene. This has allowed the estimation of realistic distributions for the absolute distance to the microphone and relative distance to the competing speaker. The results show grouping behaviour among participants when located in a room and the impact this has on performance depends on the room size considered. Finally, the amount of overlap and points in the mixture which contain overlap were explored using finite-state models. These models allowed for mixtures to be generated, which approached the overlap patterns observed in the real data. Features derived from these models were also shown to be a predictor of the difficulty of the mixture. At each stage of the project, simulated datasets derived using the realistic metadata distributions have been compared to existing standard datasets that use naive or uninformed metadata distributions, and implications for speech recognition performance are observed and discussed. This work has demonstrated how unrealistic approaches can produce over-promising results, and can bias research towards techniques that might not work well in practice. Results will also be valuable in informing the design of future simulated datasets

    Exploring the use of speech in audiology: A mixed methods study

    Get PDF
    This thesis aims to advance the understanding of how speech testing is, and can be, used for hearing device users within the audiological test battery. To address this, I engaged with clinicians and patients to understand the current role that speech testing plays in audiological testing in the UK, and developed a new listening test, which combined speech testing with localisation judgments in a dual task design. Normal hearing listeners and hearing aid users were tested, and a series of technical measurements were made to understand how advanced hearing aid settings might determine task performance. A questionnaire was completed by public and private sector hearing healthcare professionals in the UK to explore the use of speech testing. Overall, results revealed this assessment tool was underutilised by UK clinicians, but there was a significantly greater use in the private sector. Through a focus group and semi structured interviews with hearing aid users I identified a mismatch between their common listening difficulties and the assessment tools used in audiology and highlighted a lack of deaf awareness in UK adult audiology. The Spatial Speech in Noise Test (SSiN) is a dual task paradigm to simultaneously assess relative localisation and word identification performance. Testing on normal hearing listeners to investigate the impact of the dual task design found the SSiN to increase cognitive load and therefore better reflect challenging listening situations. A comparison of relative localisation and word identification performance showed that hearing aid users benefitted less from spatially separating speech and noise in the SSiN than normal hearing listeners. To investigate how the SSiN could be used to assess advanced hearing aid features, a subset of hearing aid users were fitted with the same hearing aid type and completed the SSiN once with adaptive directionality and once with omnidirectionality. The SSiN results differed between conditions but a larger sample size is needed to confirm these effects. Hearing aid technical measurements were used to quantify how hearing aid output changed in response to the SSiN paradigm

    Multimodal Probabilistic Person Tracking and Identification in Smart Spaces

    Get PDF
    In this thesis, a new methodology is introduced for the multimodal tracking and identification of multiple persons by seeking and integrating reliable ID cues whenever they become observable. The method opportunistically integrates person-specific identification cues that can only sparsely be observed for each person over time and keeps track of the location of identified persons while ID cues are not available

    State of the art of audio- and video based solutions for AAL

    Get PDF
    Working Group 3. Audio- and Video-based AAL ApplicationsIt is a matter of fact that Europe is facing more and more crucial challenges regarding health and social care due to the demographic change and the current economic context. The recent COVID-19 pandemic has stressed this situation even further, thus highlighting the need for taking action. Active and Assisted Living (AAL) technologies come as a viable approach to help facing these challenges, thanks to the high potential they have in enabling remote care and support. Broadly speaking, AAL can be referred to as the use of innovative and advanced Information and Communication Technologies to create supportive, inclusive and empowering applications and environments that enable older, impaired or frail people to live independently and stay active longer in society. AAL capitalizes on the growing pervasiveness and effectiveness of sensing and computing facilities to supply the persons in need with smart assistance, by responding to their necessities of autonomy, independence, comfort, security and safety. The application scenarios addressed by AAL are complex, due to the inherent heterogeneity of the end-user population, their living arrangements, and their physical conditions or impairment. Despite aiming at diverse goals, AAL systems should share some common characteristics. They are designed to provide support in daily life in an invisible, unobtrusive and user-friendly manner. Moreover, they are conceived to be intelligent, to be able to learn and adapt to the requirements and requests of the assisted people, and to synchronise with their specific needs. Nevertheless, to ensure the uptake of AAL in society, potential users must be willing to use AAL applications and to integrate them in their daily environments and lives. In this respect, video- and audio-based AAL applications have several advantages, in terms of unobtrusiveness and information richness. Indeed, cameras and microphones are far less obtrusive with respect to the hindrance other wearable sensors may cause to one’s activities. In addition, a single camera placed in a room can record most of the activities performed in the room, thus replacing many other non-visual sensors. Currently, video-based applications are effective in recognising and monitoring the activities, the movements, and the overall conditions of the assisted individuals as well as to assess their vital parameters (e.g., heart rate, respiratory rate). Similarly, audio sensors have the potential to become one of the most important modalities for interaction with AAL systems, as they can have a large range of sensing, do not require physical presence at a particular location and are physically intangible. Moreover, relevant information about individuals’ activities and health status can derive from processing audio signals (e.g., speech recordings). Nevertheless, as the other side of the coin, cameras and microphones are often perceived as the most intrusive technologies from the viewpoint of the privacy of the monitored individuals. This is due to the richness of the information these technologies convey and the intimate setting where they may be deployed. Solutions able to ensure privacy preservation by context and by design, as well as to ensure high legal and ethical standards are in high demand. After the review of the current state of play and the discussion in GoodBrother, we may claim that the first solutions in this direction are starting to appear in the literature. A multidisciplinary 4 debate among experts and stakeholders is paving the way towards AAL ensuring ergonomics, usability, acceptance and privacy preservation. The DIANA, PAAL, and VisuAAL projects are examples of this fresh approach. This report provides the reader with a review of the most recent advances in audio- and video-based monitoring technologies for AAL. It has been drafted as a collective effort of WG3 to supply an introduction to AAL, its evolution over time and its main functional and technological underpinnings. In this respect, the report contributes to the field with the outline of a new generation of ethical-aware AAL technologies and a proposal for a novel comprehensive taxonomy of AAL systems and applications. Moreover, the report allows non-technical readers to gather an overview of the main components of an AAL system and how these function and interact with the end-users. The report illustrates the state of the art of the most successful AAL applications and functions based on audio and video data, namely (i) lifelogging and self-monitoring, (ii) remote monitoring of vital signs, (iii) emotional state recognition, (iv) food intake monitoring, activity and behaviour recognition, (v) activity and personal assistance, (vi) gesture recognition, (vii) fall detection and prevention, (viii) mobility assessment and frailty recognition, and (ix) cognitive and motor rehabilitation. For these application scenarios, the report illustrates the state of play in terms of scientific advances, available products and research project. The open challenges are also highlighted. The report ends with an overview of the challenges, the hindrances and the opportunities posed by the uptake in real world settings of AAL technologies. In this respect, the report illustrates the current procedural and technological approaches to cope with acceptability, usability and trust in the AAL technology, by surveying strategies and approaches to co-design, to privacy preservation in video and audio data, to transparency and explainability in data processing, and to data transmission and communication. User acceptance and ethical considerations are also debated. Finally, the potentials coming from the silver economy are overviewed.publishedVersio

    Bayesian framework for multiple acoustic source tracking

    Get PDF
    Acoustic source (speaker) tracking in the room environment plays an important role in many speech and audio applications such as multimedia, hearing aids and hands-free speech communication and teleconferencing systems; the position information can be fed into a higher processing stage for high-quality speech acquisition, enhancement of a specific speech signal in the presence of other competing talkers, or keeping a camera focused on the speaker in a video-conferencing scenario. Most of existing systems focus on the single source tracking problem, which assumes one and only one source is active all the time, and the state to be estimated is simply the source position. However, in practical scenarios, multiple speakers may be simultaneously active, and the tracking algorithm should be able to localise each individual source and estimate the number of sources. This thesis contains three contributions towards solutions to multiple acoustic source tracking in a moderate noisy and reverberant environment. The first contribution of this thesis is proposing a time-delay of arrival (TDOA) estimation approach for multiple sources. Although the phase transform (PHAT) weighted generalised cross-correlation (GCC) method has been employed to extract the TDOAs of multiple sources, it is primarily used for a single source scenario and its performance for multiple TDOA estimation has not been comprehensively studied. The proposed approach combines the degenerate unmixing estimation technique (DUET) and GCC method. Since the speech mixtures are assumed window-disjoint orthogonal (WDO) in the time-frequency domain, the spectrograms can be separated by employing DUET, and the GCC method can then be applied to the spectrogram of each individual source. The probabilities of detection and false alarm are also proposed to evaluate the TDOA estimation performance under a series of experimental parameters. Next, considering multiple acoustic sources may appear nonconcurrently, an extended Kalman particle filtering (EKPF) is developed for a special multiple acoustic source tracking problem, namely “nonconcurrent multiple acoustic tracking (NMAT)”. The extended Kalman filter (EKF) is used to approximate the optimum weights, and the subsequent particle filtering (PF) naturally takes the previous position estimates as well as the current TDOA measurements into account. The proposed approach is thus able to lock on the sharp change of the source position quickly, and avoid the tracking-lag in the general sequential importance resampling (SIR) PF. Finally, these investigations are extended into an approach to track the multiple unknown and time-varying number of acoustic sources. The DUET-GCC method is used to obtain the TDOA measurements for multiple sources and a random finite set (RFS) based Rao-blackwellised PF is employed and modified to track the sources. Each particle has a RFS form encapsulating the states of all sources and is capable of addressing source dynamics: source survival, new source appearance and source deactivation. A data association variable is defined to depict the source dynamic and its relation to the measurements. The Rao-blackwellisation step is used to decompose the state: the source positions are marginalised by using an EKF, and only the data association variable needs to be handled by a PF. The performances of all the proposed approaches are extensively studied under different noisy and reverberant environments, and are favorably comparable with the existing tracking techniques
    corecore