15 research outputs found
Ultra Dual-Path Compression For Joint Echo Cancellation And Noise Suppression
Echo cancellation and noise reduction are essential for full-duplex
communication, yet most existing neural networks have high computational costs
and are inflexible in tuning model complexity. In this paper, we introduce
time-frequency dual-path compression to achieve a wide range of compression
ratios on computational cost. Specifically, for frequency compression,
trainable filters are used to replace manually designed filters for dimension
reduction. For time compression, only using frame skipped prediction causes
large performance degradation, which can be alleviated by a post-processing
network with full sequence modeling. We have found that under fixed compression
ratios, dual-path compression combining both the time and frequency methods
will give further performance improvement, covering compression ratios from 4x
to 32x with little model size change. Moreover, the proposed models show
competitive performance compared with fast FullSubNet and DeepFilterNet. A demo
page can be found at
hangtingchen.github.io/ultra_dual_path_compression.github.io/.Comment: Accepted by Interspeech 202
Linguistically-constrained formant-based i-vectors for automatic speaker recognition
This is the author’s version of a work that was accepted for publication in Speech Communication. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Speech Communication, VOL 76 (2016) DOI 10.1016/j.specom.2015.11.002This paper presents a large-scale study of the discriminative abilities of formant frequencies for automatic speaker recognition. Exploiting both the static and dynamic information in formant frequencies, we present linguistically-constrained formant-based i-vector systems providing well calibrated likelihood ratios per comparison of the occurrences of the same isolated linguistic units in two given utterances. As a first result, the reported analysis on the discriminative and calibration properties of the different linguistic units provide useful insights, for instance, to forensic phonetic practitioners. Furthermore, it is shown that the set of units which are more discriminative for every speaker vary from speaker to speaker. Secondly, linguistically-constrained systems are combined at score-level through average and logistic regression speaker-independent fusion rules exploiting the different speaker-distinguishing information spread among the different linguistic units. Testing on the English-only trials of the core condition of the NIST 2006 SRE (24,000 voice comparisons of 5 minutes telephone conversations from 517 speakers -219 male and 298 female-), we report equal error rates of 9.57 and 12.89% for male and female speakers respectively, using only formant frequencies as speaker discriminative information. Additionally, when the formant-based system is fused with a cepstral i-vector system, we obtain relative improvements of ∼6% in EER (from 6.54 to 6.13%) and ∼15% in minDCF (from 0.0327 to 0.0279), compared to the cepstral system alone.This work has been supported by the Spanish Ministry of Economy and Competitiveness (project CMC-V2: Caracterizacion, Modelado y Compensacion de Variabilidad en la Señal de Voz, TEC2012-37585-C02-01). Also, the authors would like to thank SRI for providing the Decipher phonetic transcriptions of the NIST 2004, 2005 and 2006 SREs that have allowed to carry out this work
Deep Learning for Quality Assessment in Live Video Streaming
Video content providers put stringent requirements on the quality assessment methods realized on their services. They need to be accurate, real-time, adaptable to new content, and scal-able as the video set grows. In this letter, we introduce a novel automated and computationally efficient video assessment method. It enables accurate real-time (online) analysis of delivered quality in an adaptable and scalable manner. Offline deep unsupervised learning processes are employed at the server side and inexpensive no-reference measurements at the client side. This provides both real-time assessment and performance comparable to the full reference counterpart, while maintaining its no-reference characteristics. We tested our approach on the LIMP Video Quality Database (an extensive packet loss impaired video set) obtaining a correlation between 78% and 91% to the FR benchmark (the video quality metric). Due to its unsupervised learning essence, our method is flexible and dynamically adaptable to new content and scalable with the number of videos. Index Terms-Deep learning (DL), multimedia video services, unsupervised learning (UL), video quality assessment
Representation learning for unsupervised speech processing
Automatic speech recognition for our most widely used languages has recently seen
substantial improvements, driven by improved training procedures for deep artificial
neural networks, cost-effective availability of computational power at large scale, and,
crucially, availability of large quantities of labelled training data. This success cannot
be transferred to low and zero resource languages where the requisite transcriptions are
unavailable.
Unsupervised speech processing promises better methods for dealing with under-resourced
languages. Here we investigate unsupervised neural network based models
for learning frame- and sequence- level representations with the goal of improving
zero-resource speech processing. Good representations eliminate differences in accent,
gender, channel characteristics, and other factors to model subword or whole-term units
for within- and across- speaker speech unit discrimination.
We present two contributions focussing on unsupervised learning of frame-level
representations: (1) an improved version of the correspondence autoencoder applied
to the INTERSPEECH 2015 Zero Resource Challenge, and (2) a proposed model for
learning representations that explicitly optimize speech unit discrimination.
We also present two contributions focussing on efficiency and scalability of unsupervised
speech processing: (1) a proposed model and pilot experiments for learning a
linear-time approximation of the quadratic-time dynamic time warping algorithm, and
(2) a series of model proposals for learning fixed size representations of variable length
speech segments enabling efficient vector space similarity measures