214 research outputs found
Deep Room Recognition Using Inaudible Echos
Recent years have seen the increasing need of location awareness by mobile
applications. This paper presents a room-level indoor localization approach
based on the measured room's echos in response to a two-millisecond single-tone
inaudible chirp emitted by a smartphone's loudspeaker. Different from other
acoustics-based room recognition systems that record full-spectrum audio for up
to ten seconds, our approach records audio in a narrow inaudible band for 0.1
seconds only to preserve the user's privacy. However, the short-time and
narrowband audio signal carries limited information about the room's
characteristics, presenting challenges to accurate room recognition. This paper
applies deep learning to effectively capture the subtle fingerprints in the
rooms' acoustic responses. Our extensive experiments show that a two-layer
convolutional neural network fed with the spectrogram of the inaudible echos
achieve the best performance, compared with alternative designs using other raw
data formats and deep models. Based on this result, we design a RoomRecognize
cloud service and its mobile client library that enable the mobile application
developers to readily implement the room recognition functionality without
resorting to any existing infrastructures and add-on hardware.
Extensive evaluation shows that RoomRecognize achieves 99.7%, 97.7%, 99%, and
89% accuracy in differentiating 22 and 50 residential/office rooms, 19 spots in
a quiet museum, and 15 spots in a crowded museum, respectively. Compared with
the state-of-the-art approaches based on support vector machine, RoomRecognize
significantly improves the Pareto frontier of recognition accuracy versus
robustness against interfering sounds (e.g., ambient music).Comment: 29 page
Intelligent ultrasound hand gesture recognition system
With the booming development of technology, hand gesture recognition has become a hotspot in Human-Computer Interaction (HCI) systems. Ultrasound hand gesture recognition is an innovative method that has attracted ample interest due to its strong real-time performance, low cost, large field of view, and illumination independence. Well-investigated HCI applications include external digital pens, game controllers on smart mobile devices, and web browser control on laptops. This thesis probes gesture recognition systems on multiple platforms to study the behavior of system performance with various gesture features. Focused on this topic, the contributions of this thesis can be summarized from the perspectives of smartphone acoustic field and hand model simulation, real-time gesture recognition on smart devices with speed categorization algorithm, fast reaction gesture recognition based on temporal neural networks, and angle of arrival-based gesture recognition system.
Firstly, a novel pressure-acoustic simulation model is developed to examine its potential for use in acoustic gesture recognition. The simulation model is creating a new system for acoustic verification, which uses simulations mimicking real-world sound elements to replicate a sound pressure environment as authentically as possible. This system is fine-tuned through sensitivity tests within the simulation and validate with real-world measurements. Following this, the study constructs novel simulations for acoustic applications, informed by the verified acoustic field distribution, to assess their effectiveness in specific devices. Furthermore, a simulation focused on understanding the effects of the placement of sound devices and hand-reflected sound waves is properly designed. Moreover, a feasibility test on phase control modification is conducted, revealing the practical applications and boundaries of this model.
Mobility and system accuracy are two significant factors that determine gesture recognition performance. As smartphones have high-quality acoustic devices for developing gesture recognition, to achieve a portable gesture recognition system with high accuracy, novel algorithms were developed to distinguish gestures using smartphone built-in speakers and microphones. The proposed system adopts Short-Time-Fourier-Transform (STFT) and machine learning to capture hand movement and determine gestures by the pretrained neural network. To differentiate gesture speeds, a specific neural network was designed and set as part of the classification algorithm. The final accuracy rate achieves 96% among nine gestures and three speed levels. The proposed algorithms were evaluated comparatively through algorithm comparison, and the accuracy outperformed state-of-the-art systems.
Furthermore, a fast reaction gesture recognition based on temporal neural networks was designed. Traditional ultrasound gesture recognition adopts convolutional neural networks that have flaws in terms of response time and discontinuous operation. Besides, overlap intervals in network processing cause cross-frame failures that greatly reduce system performance. To mitigate these problems, a novel fast reaction gesture recognition system that slices signals in short time intervals was designed. The proposed system adopted a novel convolutional recurrent neural network (CRNN) that calculates gesture features in a short time and combines features over time. The results showed the reaction time significantly reduced from 1s to 0.2s, and accuracy improved to 100% for six gestures.
Lastly, an acoustic sensor array was built to investigate the angle information of performed gestures. The direction of a gesture is a significant feature for gesture classification, which enables the same gesture in different directions to represent different actions. Previous studies mainly focused on types of gestures and analyzing approaches (e.g., Doppler Effect and channel impulse response, etc.), while the direction of gestures was not extensively studied. An acoustic gesture recognition system based on both speed information and gesture direction was developed. The system achieved 94.9% accuracy among ten different gestures from two directions. The proposed system was evaluated comparatively through numerical neural network structures, and the results confirmed that incorporating additional angle information improved the system's performance.
In summary, the work presented in this thesis validates the feasibility of recognizing hand gestures using remote ultrasonic sensing across multiple platforms. The acoustic simulation explores the smartphone acoustic field distribution and response results in the context of hand gesture recognition applications. The smartphone gesture recognition system demonstrates the accuracy of recognition through ultrasound signals and conducts an analysis of classification speed. The fast reaction system proposes a more optimized solution to address the cross-frame issue using temporal neural networks, reducing the response latency to 0.2s. The speed and angle-based system provides an additional feature for gesture recognition. The established work will accelerate the development of intelligent hand gesture recognition, enrich the available gesture features, and contribute to further research in various gestures and application scenarios
Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems
Voice Processing Systems (VPSes), now widely deployed, have been made
significantly more accurate through the application of recent advances in
machine learning. However, adversarial machine learning has similarly advanced
and has been used to demonstrate that VPSes are vulnerable to the injection of
hidden commands - audio obscured by noise that is correctly recognized by a VPS
but not by human beings. Such attacks, though, are often highly dependent on
white-box knowledge of a specific machine learning model and limited to
specific microphones and speakers, making their use across different acoustic
hardware platforms (and thus their practicality) limited. In this paper, we
break these dependencies and make hidden command attacks more practical through
model-agnostic (blackbox) attacks, which exploit knowledge of the signal
processing algorithms commonly used by VPSes to generate the data fed into
machine learning systems. Specifically, we exploit the fact that multiple
source audio samples have similar feature vectors when transformed by acoustic
feature extraction algorithms (e.g., FFTs). We develop four classes of
perturbations that create unintelligible audio and test them against 12 machine
learning models, including 7 proprietary models (e.g., Google Speech API, Bing
Speech API, IBM Speech API, Azure Speaker API, etc), and demonstrate successful
attacks against all targets. Moreover, we successfully use our maliciously
generated audio samples in multiple hardware configurations, demonstrating
effectiveness across both models and real systems. In so doing, we demonstrate
that domain-specific knowledge of audio signal processing represents a
practical means of generating successful hidden voice command attacks
Acoustic-channel attack and defence methods for personal voice assistants
Personal Voice Assistants (PVAs) are increasingly used as interface to digital environments. Voice commands are used to interact with phones, smart homes or cars. In the US alone the number of smart speakers such as Amazonâs Echo and Google Home has grown by 78% to 118.5 million and 21% of the US population own at least one device. Given the increasing dependency of society on PVAs, security and privacy of these has become a major concern of users, manufacturers and policy makers. Consequently, a steep increase in research efforts addressing security and privacy of PVAs can be observed in recent years. While some security and privacy research applicable to the PVA domain predates their recent increase in popularity and many new research strands have emerged, there lacks research dedicated to PVA security and privacy. The most important interaction interface between users and a PVA is the acoustic channel and acoustic channel related security and privacy studies are desirable and required. The aim of the work presented in this thesis is to enhance the cognition of security and privacy issues of PVA usage related to the acoustic channel, to propose principles and solutions to key usage scenarios to mitigate potential security threats, and to present a novel type of dangerous attack which can be launched only by using a PVA alone. The five core contributions of this thesis are: (i) a taxonomy is built for the research domain of PVA security and privacy issues related to acoustic channel. An extensive research overview on the state of the art is provided, describing a comprehensive research map for PVA security and privacy. It is also shown in this taxonomy where the contributions of this thesis lie; (ii) Work has emerged aiming to generate adversarial audio inputs which sound harmless to humans but can trick a PVA to recognise harmful commands. The majority of work has been focused on the attack side, but there rarely exists work on how to defend against this type of attack. A defence method against white-box adversarial commands is proposed and implemented as a prototype. It is shown that a defence Automatic Speech Recognition (ASR) can work in parallel with the PVAâs main one, and adversarial audio input is detected if the difference in the speech decoding results between both ASR surpasses a threshold. It is demonstrated that an ASR that differs in architecture and/or training data from the the PVAâs main ASR is usable as protection ASR; (iii) PVAs continuously monitor conversations which may be transported to a cloud back end where they are stored, processed and maybe even passed on to other service providers. A user has limited control over this process when a PVA is triggered without userâs intent or a PVA belongs to others. A user is unable to control the recording behaviour of surrounding PVAs, unable to signal privacy requirements and unable to track conversation recordings. An acoustic tagging solution is proposed aiming to embed additional information into acoustic signals processed by PVAs. A user employs a tagging device which emits an acoustic signal when PVA activity is assumed. Any active PVA will embed this tag into their recorded audio stream. The tag may signal a cooperating PVA or back-end system that a user has not given a recording consent. The tag may also be used to trace when and where a recording was taken if necessary. A prototype tagging device based on PocketSphinx is implemented. Using Google Home Mini as the PVA, it is demonstrated that the device can tag conversations and the tagging signal can be retrieved from conversations stored in the Google back-end system; (iv) Acoustic tagging provides users the capability to signal their permission to the back-end PVA service, and another solution inspired by Denial of Service (DoS) is proposed as well for protecting user privacy. Although PVAs are very helpful, they are also continuously monitoring conversations. When a PVA detects a wake word, the immediately following conversation is recorded and transported to a cloud system for further analysis. An active protection mechanism is proposed: reactive jamming. A Protection Jamming Device (PJD) is employed to observe conversations. Upon detection of a PVA wake word the PJD emits an acoustic jamming signal. The PJD must detect the wake word faster than the PVA such that the jamming signal still prevents wake word detection by the PVA. An evaluation of the effectiveness of different jamming signals and overlap between wake words and the jamming signals is carried out. 100% jamming success can be achieved with an overlap of at least 60% with a negligible false positive rate; (v) Acoustic components (speakers and microphones) on a PVA can potentially be re-purposed to achieve acoustic sensing. This has great security and privacy implication due to the key role of PVAs in digital environments. The first active acoustic side-channel attack is proposed. Speakers are used to emit human inaudible acoustic signals and the echo is recorded via microphones, turning the acoustic system of a smartphone into a sonar system. The echo signal can be used to profile user interaction with the device. For example, a victimâs finger movement can be monitored to steal Android unlock patterns. The number of candidate unlock patterns that an attacker must try to authenticate herself to a Samsung S4 phone can be reduced by up to 70% using this novel unnoticeable acoustic side-channel
Recommended from our members
From active to passive spatial acoustic sensing and applications
The active acoustic sensing system emits modulated acoustic waves and analyzes reflection signals. It is dominant in acoustic spatial sensing. On the other side, the passive acoustic sensing system receives and investigates nature sounds directly. It is good at semantic tasks but has weak performance on spatial sensing. In this dissertation, we manage to bridge three gaps in existing systems. They are the gap between the assumption of signal processing algorithms and the real acoustic environment, the gap between powerful active spatial sensing and limited passive spatial sensing, and the gap between the semantic features and spatial information. We evolve the acoustic sensing system design and extend the functionalities by three novel systems.
First, we develop a fully active spatial sensing system DeepRange which can adapt to the real environment easily. We develop an effective mechanism to generate synthetic training data that captures noise, speaker/mic distortion, and interference in the signals. It removes the need of collecting a large volume of data. We then design a deep range neural network (DRNet) to estimate the distance from raw acoustic signals. It is inspired by signal processing that an ultra-long convolution kernel size helps to combat noise and interference. The model is fully trained over synthetic data, but it can achieve sub-centimeter error robustly in real data despite various environments, background noise, interference, and mobile phone models.
Second, we develop a fused active and passive spatial sensing system for speech separation noted as Spatial Aware Multi-task learning-based Separation (SAMS). We leverage both active sensing and passive sensing to improve AoA estimation and jointly optimize the semantic task and the spatial task. SAMS estimates the spatial location and extracts speech for the target user during teleconferencing simultaneously. We first generate fine-grained spatial embeddings from the userâs voice and inaudible tracking sound, which contains the userâs position and rich multipath information. Furthermore, we develop a deep neural network with multi-task learning to jointly optimize source separation and location. We significantly speed up inference to provide a real-time guarantee.
Finally, we deeply fuse the semantic features and spatial cues to combat the interference and noise in the real environment as well as enable depth sensing in a fully passive setup. Inspired by the âflash-to-bangâ phenomenon (i.e.hearing the thunder after seeing the lightning), we propose FBDepth to measure the depth of the sound source. We formulate the problem as an audio-visual event localization task for collision events. Specifically, FBDepth first aligns correspondence between the video track and audio track to locate the target object and target sound in a coarse granularity. Based on the observation of moving objectsâ trajectories, it proposes to estimate the intersection of optical flow before and after the collision to locate video events in time. It feeds the estimated timestamp of the video event and the other modalities for the final depth estimation. We use a mobile phone to collect the 3.6K+ video clips involving 24 different objects at up to 60m. FBDepth shows superior performance especially at a long range compared to monocular and stereo methods.Computer Science
Recommended from our members
Enhancing touch interactions with passive finger acoustics
In this report, we introduce a unique method of interacting with mobile devices through passive finger acoustics. Phone users typically use one of two fingers when interacting with a touchscreen device, leaving the other fingers idle. The motivation is that users can make use of their idle fingers to enhance the experience of an application. By wearing minimally obtrusive rings on the thumb and index finger, users can make distinguishable clicking sounds to quickly perform actions, without having to interact directly with the screen. Our system leverages the microphone embedded in the mobile device to capture sound and recognize sound in real-time without requiring Internet connection. The different sounds introduce new ways for users to interact with their devices, without cluttering the screen. We evaluated our system on an Android drawing application, which allows the user to switch tools based on clicks. We optimized our system based on accuracy, classification speed, and ease of use, in order to create a comfortable user experienceInformatio
PnP Maxtools: Autonomous Parameter Control in MaxMSP Utilizing MIR Algorithms
This research presents a new approach to computer automation through the implementation of novel real-time music information retrieval algorithms developed for this project. It documents the development of the PnP.Maxtools package, a set of open source objects designed within the popular programming environment MaxMSP. The package is a set of pre/post processing filters, objective and subjective timbral descriptors, audio effects, and other objects that are designed to be used together to compose music or improvise without the use of external controllers or hardware. The PnP.Maxtools package objects are designed to be used quickly and easily using a `plug and play\u27 style with as few initial arguments needed as possible. The PnP.Maxtools package is designed to take incoming audio from a microphone, analyze it, and use the analysis to control an audio effect on the incoming signal in real-time. In this way, the audio content has a real musical and analogous relationship with the resulting musical transformations while the control parameters become more multifaceted and better able to serve the needs of artists. The term Reflexive Automation is presented that describes this unsupervised relationship between the content of the sound being analyzed and the analogous and automatic control over a specific musical parameter. A set of compositions are also presented that demonstrate ideal usage of the object categories for creating reflexive systems and achieving fully autonomous control over musical parameters
- âŠ