488 research outputs found
Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems
Voice Processing Systems (VPSes), now widely deployed, have been made
significantly more accurate through the application of recent advances in
machine learning. However, adversarial machine learning has similarly advanced
and has been used to demonstrate that VPSes are vulnerable to the injection of
hidden commands - audio obscured by noise that is correctly recognized by a VPS
but not by human beings. Such attacks, though, are often highly dependent on
white-box knowledge of a specific machine learning model and limited to
specific microphones and speakers, making their use across different acoustic
hardware platforms (and thus their practicality) limited. In this paper, we
break these dependencies and make hidden command attacks more practical through
model-agnostic (blackbox) attacks, which exploit knowledge of the signal
processing algorithms commonly used by VPSes to generate the data fed into
machine learning systems. Specifically, we exploit the fact that multiple
source audio samples have similar feature vectors when transformed by acoustic
feature extraction algorithms (e.g., FFTs). We develop four classes of
perturbations that create unintelligible audio and test them against 12 machine
learning models, including 7 proprietary models (e.g., Google Speech API, Bing
Speech API, IBM Speech API, Azure Speaker API, etc), and demonstrate successful
attacks against all targets. Moreover, we successfully use our maliciously
generated audio samples in multiple hardware configurations, demonstrating
effectiveness across both models and real systems. In so doing, we demonstrate
that domain-specific knowledge of audio signal processing represents a
practical means of generating successful hidden voice command attacks
Improved acoustics for semi-enclosed spaces in the proximity of residential buildings
Continuous urban densification exacerbates acoustic challenges for residents of housing complexes. They are confronted with higher noise immission from railway, road traffic, construction, as well as louder neighborhood acoustic environments. Thereby, not only noise immission indoors is associated with stress, annoyance, and sleep disturbance, but also the immediate outdoor living environment (e.g., courtyards, private gardens and playgrounds, etc.) can be acoustically unpleasant and annoying. This non-exhaustive narrative review paper elaborates on the role of a number of design parameters on improving the quality of the outdoor soundscape of housing complexes: architectural and morphological design, facade material characteristics, balconies, greenery, ground, background sounds, and several factors concerning quality of sounds (e.g., multisensory perception, holistic design, the relevance of space, context, social factors, co-creation, etc.). It mainly covers literature including both acoustical (e.g., sound pressure level and room acoustical parameters) and human/perceptual (e.g., comfort and annoyance) factors. A series of recommendations are presented here as to how the semi-enclosed outdoor spaces in the proximity of residential complexes can be acoustically improved
Physiology-based model of multi-source auditory processing
Our auditory systems are evolved to process a myriad of acoustic environments. In complex listening scenarios, we can tune our attention to one sound source (e.g., a conversation partner), while monitoring the entire acoustic space for cues we might be interested in (e.g., our names being called, or the fire alarm going off). While normal hearing listeners handle complex listening scenarios remarkably well, hearing-impaired listeners experience difficulty even when wearing hearing-assist devices. This thesis presents both theoretical work in understanding the neural mechanisms behind this process, as well as the application of neural models to segregate mixed sources and potentially help the hearing impaired population.
On the theoretical side, auditory spatial processing has been studied primarily up to the midbrain region, and studies have shown how individual neurons can localize sounds using spatial cues. Yet, how higher brain regions such as the cortex use this information to process multiple sounds in competition is not clear. This thesis demonstrates a physiology-based spiking neural network model, which provides a mechanism illustrating how the auditory cortex may organize up-stream spatial information when there are multiple competing sound sources in space.
Based on this model, an engineering solution to help hearing-impaired listeners segregate mixed auditory inputs is proposed. Using the neural model to perform sound-segregation in the neural domain, the neural outputs (representing the source of interest) are reconstructed back to the acoustic domain using a novel stimulus reconstruction method.2017-09-22T00:00:00
CELP and speech enhancement
This thesis addresses the intelligibility enhancement of speech that is heard within an acoustically noisy environment. In particular, a realistic target situation of a police vehicle interior, with speech generated from a CELP (codebook-excited linear prediction) speech compression-based communication system, is adopted.
The research has centred on the role of the CELP speech compression algorithm, and its transmission parameters. In particular, novel methods of LSP-based (line spectral pair) speech analysis and speech modification are developed and described. CELP parameters have been utilised in the analysis and processing stages of a speech intelligibility enhancement system to minimise additional computational complexity over existing CELP coder requirements.
Details are given of the CELP analysis process and its effects on speech, the development of speech analysis and alteration algorithms coexisting with a CELP system, their effects and performance.
Both objective and subjective tests have been used to characterize the effectiveness of the analysis and processing methods. Subjective testing of a complete simulation enhancement system indicates its effectiveness under the tested conditions, and is extrapolated to predict real-life performance.
The developed system presents a novel integrated solution to the intelligibility enhancement of speech, and can provide a doubling, on average, of intelligibility under the tested conditions of very low intelligibility
Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed
Speechreading or lipreading is the technique of understanding and getting
phonetic features from a speaker's visual features such as movement of lips,
face, teeth and tongue. It has a wide range of multimedia applications such as
in surveillance, Internet telephony, and as an aid to a person with hearing
impairments. However, most of the work in speechreading has been limited to
text generation from silent videos. Recently, research has started venturing
into generating (audio) speech from silent video sequences but there have been
no developments thus far in dealing with divergent views and poses of a
speaker. Thus although, we have multiple camera feeds for the speech of a user,
but we have failed in using these multiple video feeds for dealing with the
different poses. To this end, this paper presents the world's first ever
multi-view speech reading and reconstruction system. This work encompasses the
boundaries of multimedia research by putting forth a model which leverages
silent video feeds from multiple cameras recording the same subject to generate
intelligent speech for a speaker. Initial results confirm the usefulness of
exploiting multiple camera views in building an efficient speech reading and
reconstruction system. It further shows the optimal placement of cameras which
would lead to the maximum intelligibility of speech. Next, it lays out various
innovative applications for the proposed system focusing on its potential
prodigious impact in not just security arena but in many other multimedia
analytics problems.Comment: 2018 ACM Multimedia Conference (MM '18), October 22--26, 2018, Seoul,
Republic of Kore
Speech Enhancement for Automatic Analysis of Child-Centered Audio Recordings
Analysis of child-centred daylong naturalist audio recordings has become a de-facto research protocol in the scientific study of child language development. The researchers are increasingly using these recordings to understand linguistic environment a child encounters in her routine interactions with the world. These audio recordings are captured by a microphone that a child wears throughout a day. The audio recordings, being naturalistic, contain a lot of unwanted sounds from everyday life which degrades the performance of speech analysis tasks. The purpose of this thesis is to investigate the utility of speech enhancement (SE) algorithms in the automatic analysis of such recordings. To this effect, several classical signal processing and modern machine learning-based SE methods were employed 1) as a denoiser for speech corrupted with additive noise sampled from real-life child-centred daylong recordings and 2) as front-end for downstream speech processing tasks of addressee classification (infant vs. adult-directed speech) and automatic syllable count estimation from the speech. The downstream tasks were conducted on data derived from a set of geographically, culturally, and linguistically diverse child-centred daylong audio recordings. The performance of denoising was evaluated through objective quality metrics (spectral distortion and instrumental intelligibility) and through the downstream task performance. Finally, the objective evaluation results were compared with downstream task performance results to find whether objective metrics can be used as a reasonable proxy to select SE front-end for a downstream task. The results obtained show that a recently proposed Long Short-Term Memory (LSTM)-based progressive learning architecture provides maximum performance gains in the downstream tasks in comparison with the other SE methods and baseline results. Classical signal processing-based SE methods also lead to competitive performance. From the comparison of objective assessment and downstream task performance results, no predictive relationship between task-independent objective metrics and performance of downstream tasks was found
Perceptually Motivated, Intelligent Audio Mixing Approaches for Hearing Loss
The growing population of listeners with hearing loss, along with the limitations of current audio enhancement solutions, have created the need for novel approaches that take into consideration the perceptual aspects of hearing loss, while taking advantage of the benefits produced by intelligent audio mixing. The aim of this thesis is to explore perceptually motivated intelligent approaches to audio mixing for listeners with hearing loss, through the development of a hearing loss simulation and its use as a referencing tool in automatic audio mixing. To achieve this aim, a real-time hearing loss simulation was designed and tested for its accuracy and effectiveness through the conduction of listening studies with participants with real and simulated hearing loss. The simulation was then used by audio engineering students and professionals during mixing, in order to provide information on the techniques and practices used by engineers to combat the effects of hearing loss while mixing content through the simulation. The extracted practices were then used to inform the following automatic mixing approaches: a deep learning approach utilising a differentiable digital signal processing architecture, a knowledge-based approach to gain mixing utilising fuzzy logic, a genetic algorithm approach to equalisation and finally a combined system of the fuzzy mixer and genetic equaliser. The outputs of all four systems were analysed, and each approach’s strengths and weaknesses were discussed in the thesis. The results of this work present the potential of integrating perceptual information into intelligent audio mixing production for hearing loss, paving the way for further exploration of this approach’s capabilities
Auf einem menschlichen Gehörmodell basierende Elektrodenstimulationsstrategie für Cochleaimplantate
Cochleaimplantate (CI), verbunden mit einer professionellen Rehabilitation,
haben mehreren hunderttausenden Hörgeschädigten die verbale Kommunikation
wieder ermöglicht. Betrachtet man jedoch die Rehabilitationserfolge, so
haben CI-Systeme inzwischen ihre Grenzen erreicht. Die Tatsache, dass die
meisten CI-Träger nicht in der Lage sind, Musik zu genießen oder einer
Konversation in geräuschvoller Umgebung zu folgen, zeigt, dass es noch Raum
für Verbesserungen gibt.Diese Dissertation stellt die neue
CI-Signalverarbeitungsstrategie Stimulation based on Auditory Modeling
(SAM) vor, die vollständig auf einem Computermodell des menschlichen
peripheren Hörsystems beruht.Im Rahmen der vorliegenden Arbeit wurde die
SAM Strategie dreifach evaluiert: mit vereinfachten Wahrnehmungsmodellen
von CI-Nutzern, mit fünf CI-Nutzern, und mit 27 Normalhörenden mittels
eines akustischen Modells der CI-Wahrnehmung. Die Evaluationsergebnisse
wurden stets mit Ergebnissen, die durch die Verwendung der Advanced
Combination Encoder (ACE) Strategie ermittelt wurden, verglichen. ACE
stellt die zurzeit verbreitetste Strategie dar. Erste Simulationen zeigten,
dass die Sprachverständlichkeit mit SAM genauso gut wie mit ACE ist.
Weiterhin lieferte SAM genauere binaurale Merkmale, was potentiell zu einer
Verbesserung der Schallquellenlokalisierungfähigkeit führen kann. Die
Simulationen zeigten ebenfalls einen erhöhten Anteil an zeitlichen
Pitchinformationen, welche von SAM bereitgestellt wurden. Die Ergebnisse
der nachfolgenden Pilotstudie mit fünf CI-Nutzern zeigten mehrere Vorteile
von SAM auf. Erstens war eine signifikante Verbesserung der
Tonhöhenunterscheidung bei Sinustönen und gesungenen Vokalen zu erkennen.
Zweitens bestätigten CI-Nutzer, die kontralateral mit einem Hörgerät
versorgt waren, eine natürlicheren Klangeindruck. Als ein sehr bedeutender
Vorteil stellte sich drittens heraus, dass sich alle Testpersonen in sehr
kurzer Zeit (ca. 10 bis 30 Minuten) an SAM gewöhnen konnten. Dies ist
besonders wichtig, da typischerweise Wochen oder Monate nötig sind. Tests
mit Normalhörenden lieferten weitere Nachweise für die verbesserte
Tonhöhenunterscheidung mit SAM.Obwohl SAM noch keine marktreife Alternative
ist, versucht sie den Weg für zukünftige Strategien, die auf Gehörmodellen
beruhen, zu ebnen und ist somit ein erfolgversprechender Kandidat für
weitere Forschungsarbeiten.Cochlear implants (CIs) combined with professional rehabilitation have
enabled several hundreds of thousands of hearing-impaired individuals to
re-enter the world of verbal communication. Though very successful, current
CI systems seem to have reached their peak potential. The fact that most
recipients claim not to enjoy listening to music and are not capable of
carrying on a conversation in noisy or reverberative environments shows
that there is still room for improvement.This dissertation presents a new
cochlear implant signal processing strategy called Stimulation based on
Auditory Modeling (SAM), which is completely based on a computational model
of the human peripheral auditory system.SAM has been evaluated through
simplified models of CI listeners, with five cochlear implant users, and
with 27 normal-hearing subjects using an acoustic model of CI perception.
Results have always been compared to those acquired using Advanced
Combination Encoder (ACE), which is today’s most prevalent CI strategy.
First simulations showed that speech intelligibility of CI users fitted
with SAM should be just as good as that of CI listeners fitted with ACE.
Furthermore, it has been shown that SAM provides more accurate binaural
cues, which can potentially enhance the sound source localization ability
of bilaterally fitted implantees. Simulations have also revealed an
increased amount of temporal pitch information provided by SAM. The
subsequent pilot study, which ran smoothly, revealed several benefits of
using SAM. First, there was a significant improvement in pitch
discrimination of pure tones and sung vowels. Second, CI users fitted with
a contralateral hearing aid reported a more natural sound of both speech
and music. Third, all subjects were accustomed to SAM in a very short
period of time (in the order of 10 to 30 minutes), which is particularly
important given that a successful CI strategy change typically takes weeks
to months. An additional test with 27 normal-hearing listeners using an
acoustic model of CI perception delivered further evidence for improved
pitch discrimination ability with SAM as compared to ACE.Although SAM is
not yet a market-ready alternative, it strives to pave the way for future
strategies based on auditory models and it is a promising candidate for
further research and investigation
- …