2 research outputs found
DeepVOX: Discovering Features from Raw Audio for Speaker Recognition in Degraded Audio Signals
Automatic speaker recognition algorithms typically use pre-defined
filterbanks, such as Mel-Frequency and Gammatone filterbanks, for
characterizing speech audio. The design of these filterbanks is based on
domain-knowledge and limited empirical observations. The resultant features,
therefore, may not generalize well to different types of audio degradation. In
this work, we propose a deep learning-based technique to induce the filterbank
design from vast amounts of speech audio. The purpose of such a filterbank is
to extract features robust to degradations in the input audio. To this effect,
a 1D convolutional neural network is designed to learn a time-domain filterbank
called DeepVOX directly from raw speech audio. Secondly, an adaptive triplet
mining technique is developed to efficiently mine the data samples best suited
to train the filterbank. Thirdly, a detailed ablation study of the DeepVOX
filterbanks reveals the presence of both vocal source and vocal tract
characteristics in the extracted features. Experimental results on VOXCeleb2,
NIST SRE 2008 and 2010, and Fisher speech datasets demonstrate the efficacy of
the DeepVOX features across a variety of audio degradations, multi-lingual
speech data, and varying-duration speech audio. The DeepVOX features also
improve the performance of existing speaker recognition algorithms, such as the
xVector-PLDA and the iVector-PLDA
Treatise on Hearing: The Temporal Auditory Imaging Theory Inspired by Optics and Communication
A new theory of mammalian hearing is presented, which accounts for the
auditory image in the midbrain (inferior colliculus) of objects in the
acoustical environment of the listener. It is shown that the ear is a temporal
imaging system that comprises three transformations of the envelope functions:
cochlear group-delay dispersion, cochlear time lensing, and neural group-delay
dispersion. These elements are analogous to the optical transformations in
vision of diffraction between the object and the eye, spatial lensing by the
lens, and second diffraction between the lens and the retina. Unlike the eye,
it is established that the human auditory system is naturally defocused, so
that coherent stimuli do not react to the defocus, whereas completely
incoherent stimuli are impacted by it and may be blurred by design. It is
argued that the auditory system can use this differential focusing to enhance
or degrade the images of real-world acoustical objects that are partially
coherent. The theory is founded on coherence and temporal imaging theories that
were adopted from optics. In addition to the imaging transformations, the
corresponding inverse-domain modulation transfer functions are derived and
interpreted with consideration to the nonuniform neural sampling operation of
the auditory nerve. These ideas are used to rigorously initiate the concepts of
sharpness and blur in auditory imaging, auditory aberrations, and auditory
depth of field. In parallel, ideas from communication theory are used to show
that the organ of Corti functions as a multichannel phase-locked loop (PLL)
that constitutes the point of entry for auditory phase locking and hence
conserves the signal coherence. It provides an anchor for a dual coherent and
noncoherent auditory detection in the auditory brain that culminates in
auditory accommodation. Implications on hearing impairments are discussed as
well.Comment: 603 pages, 131 figures, 13 tables, 1570 reference