18 research outputs found

    A Bandpass Transform for Speaker Normalization

    Get PDF
    One of the major challenges for Automatic Speech Recognition is to handle speech variability. Inter-speaker variability is partly due to differences in speakers' anatomy and especially in their Vocal Tract geometry. Dissimilarities in Vocal Tract Length (VTL) are a known source of speech variation. Vocal Tract Length Normalization is a popular Speaker Normalization technique that can be implemented as a transformation of a spectrum frequency axis. We introduce in this document a new spectral transformation for Speaker Normalization. We use the Bilinear Transformation to introduce a new frequency warping resulting from a mapping of a prototype Band-Pass (BP) filter into a general BP filter. This new transformation called the Bandpass Transformation (BPT) offers two degrees of freedom enabling complex warpings of the frequency axis that are different from previous works with the Bilinear Transform. We then define a procedure to use BPT for Speaker Normalization based on the Nelder-Mead algorithm for the estimation of the BPT parameters. We present a detailed study of the performance of our new approach on two test sets with gender dependent and independent systems. Our results demonstrate clear improvements compared to standard methods used in VTL Normalization. A score compensation procedure is presented and results in further improvements of our results by refining our BPT parameter estimation

    Wasserstein Barycenter Model Ensembling

    Get PDF
    In this paper we propose to perform model ensembling in a multiclass or a multilabel learning setting using Wasserstein (W.) barycenters. Optimal transport metrics, such as the Wasserstein distance, allow incorporating semantic side information such as word embeddings. Using W. barycenters to find the consensus between models allows us to balance confidence and semantics in finding the agreement between the models. We show applications of Wasserstein ensembling in attribute-based classification, multilabel learning and image captioning generation. These results show that the W. ensembling is a viable alternative to the basic geometric or arithmetic mean ensembling.Comment: ICLR 201
    corecore