790 research outputs found
EMG-to-Speech: Direct Generation of Speech from Facial Electromyographic Signals
The general objective of this work is the design, implementation, improvement and evaluation of a system that uses surface electromyographic (EMG) signals and directly synthesizes an audible speech output: EMG-to-speech
Glottal-synchronous speech processing
Glottal-synchronous speech processing is a field of speech science where the pseudoperiodicity
of voiced speech is exploited. Traditionally, speech processing involves segmenting
and processing short speech frames of predefined length; this may fail to exploit the inherent
periodic structure of voiced speech which glottal-synchronous speech frames have
the potential to harness. Glottal-synchronous frames are often derived from the glottal
closure instants (GCIs) and glottal opening instants (GOIs).
The SIGMA algorithm was developed for the detection of GCIs and GOIs from
the Electroglottograph signal with a measured accuracy of up to 99.59%. For GCI and
GOI detection from speech signals, the YAGA algorithm provides a measured accuracy
of up to 99.84%. Multichannel speech-based approaches are shown to be more robust to
reverberation than single-channel algorithms.
The GCIs are applied to real-world applications including speech dereverberation,
where SNR is improved by up to 5 dB, and to prosodic manipulation where the importance
of voicing detection in glottal-synchronous algorithms is demonstrated by subjective
testing. The GCIs are further exploited in a new area of data-driven speech modelling,
providing new insights into speech production and a set of tools to aid deployment into
real-world applications. The technique is shown to be applicable in areas of speech coding,
identification and artificial bandwidth extension of telephone speec
Configurable EBEN: Extreme Bandwidth Extension Network to enhance body-conducted speech capture
This paper presents a configurable version of Extreme Bandwidth Extension
Network (EBEN), a Generative Adversarial Network (GAN) designed to improve
audio captured with body-conduction microphones. We show that although these
microphones significantly reduce environmental noise, this insensitivity to
ambient noise happens at the expense of the bandwidth of the speech signal
acquired by the wearer of the devices. The obtained captured signals therefore
require the use of signal enhancement techniques to recover the full-bandwidth
speech. EBEN leverages a configurable multiband decomposition of the raw
captured signal. This decomposition allows the data time domain dimensions to
be reduced and the full band signal to be better controlled. The multiband
representation of the captured signal is processed through a U-Net-like model,
which combines feature and adversarial losses to generate an enhanced speech
signal. We also benefit from this original representation in the proposed
configurable discriminators architecture. The configurable EBEN approach can
achieve state-of-the-art enhancement results on synthetic data with a
lightweight generator that allows real-time processing.Comment: Accepted in IEEE/ACM Transactions on Audio, Speech and Language
Processing on 14/08/202
Models and Analysis of Vocal Emissions for Biomedical Applications
The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies
Systems And Methods For Detecting Call Provenance From Call Audio
Various embodiments of the invention are detection systems and methods for detecting call provenance based on call audio. An exemplary embodiment of the detection system can comprise a characterization unit, a labeling unit, and an identification unit. The characterization unit can extract various characteristics of networks through which a call traversed, based on call audio. The labeling unit can be trained on prior call data and can identify one or more codecs used to encode the call, based on the call audio. The identification unit can utilize the characteristics of traversed networks and the identified codecs, and based on this information, the identification unit can provide a provenance fingerprint for the call. Based on the call provenance fingerprint, the detection system can identify, verify, or provide forensic information about a call audio source.Georgia Tech Research Corporatio
Non-Intrusive Subscriber Authentication for Next Generation Mobile Communication Systems
Merged with duplicate record 10026.1/753 on 14.03.2017 by CS (TIS)The last decade has witnessed massive growth in both the technological development, and
the consumer adoption of mobile devices such as mobile handsets and PDAs. The recent
introduction of wideband mobile networks has enabled the deployment of new services
with access to traditionally well protected personal data, such as banking details or
medical records. Secure user access to this data has however remained a function of the
mobile device's authentication system, which is only protected from masquerade abuse by
the traditional PIN, originally designed to protect against telephony abuse.
This thesis presents novel research in relation to advanced subscriber authentication for
mobile devices. The research began by assessing the threat of masquerade attacks on
such devices by way of a survey of end users. This revealed that the current methods of
mobile authentication remain extensively unused, leaving terminals highly vulnerable to
masquerade attack. Further investigation revealed that, in the context of the more
advanced wideband enabled services, users are receptive to many advanced
authentication techniques and principles, including the discipline of biometrics which
naturally lends itself to the area of advanced subscriber based authentication.
To address the requirement for a more personal authentication capable of being applied
in a continuous context, a novel non-intrusive biometric authentication technique was
conceived, drawn from the discrete disciplines of biometrics and Auditory Evoked
Responses. The technique forms a hybrid multi-modal biometric where variations in the
behavioural stimulus of the human voice (due to the propagation effects of acoustic
waves within the human head), are used to verify the identity o f a user. The resulting
approach is known as the Head Authentication Technique (HAT).
Evaluation of the HAT authentication process is realised in two stages. Firstly, the
generic authentication procedures of registration and verification are automated within a
prototype implementation. Secondly, a HAT demonstrator is used to evaluate the
authentication process through a series of experimental trials involving a representative
user community. The results from the trials confirm that multiple HAT samples from
the same user exhibit a high degree of correlation, yet samples between users exhibit a
high degree of discrepancy. Statistical analysis of the prototypes performance realised
early system error rates of; FNMR = 6% and FMR = 0.025%. The results clearly
demonstrate the authentication capabilities of this novel biometric approach and the
contribution this new work can make to the protection of subscriber data in next
generation mobile networks.Orange Personal Communication Services Lt
Speech Enhancement for Automatic Analysis of Child-Centered Audio Recordings
Analysis of child-centred daylong naturalist audio recordings has become a de-facto research protocol in the scientific study of child language development. The researchers are increasingly using these recordings to understand linguistic environment a child encounters in her routine interactions with the world. These audio recordings are captured by a microphone that a child wears throughout a day. The audio recordings, being naturalistic, contain a lot of unwanted sounds from everyday life which degrades the performance of speech analysis tasks. The purpose of this thesis is to investigate the utility of speech enhancement (SE) algorithms in the automatic analysis of such recordings. To this effect, several classical signal processing and modern machine learning-based SE methods were employed 1) as a denoiser for speech corrupted with additive noise sampled from real-life child-centred daylong recordings and 2) as front-end for downstream speech processing tasks of addressee classification (infant vs. adult-directed speech) and automatic syllable count estimation from the speech. The downstream tasks were conducted on data derived from a set of geographically, culturally, and linguistically diverse child-centred daylong audio recordings. The performance of denoising was evaluated through objective quality metrics (spectral distortion and instrumental intelligibility) and through the downstream task performance. Finally, the objective evaluation results were compared with downstream task performance results to find whether objective metrics can be used as a reasonable proxy to select SE front-end for a downstream task. The results obtained show that a recently proposed Long Short-Term Memory (LSTM)-based progressive learning architecture provides maximum performance gains in the downstream tasks in comparison with the other SE methods and baseline results. Classical signal processing-based SE methods also lead to competitive performance. From the comparison of objective assessment and downstream task performance results, no predictive relationship between task-independent objective metrics and performance of downstream tasks was found
- …