229 research outputs found
A Monte-Carlo Method For Score Normalization in Automatic Speaker Verification Using Kullback-Leibler Distances
In this paper, we propose a new score normalization technique in Automatic Speaker Verification (ASV): the D-Norm. The main advantage of this score normalization is that it does not need any additional speech data nor external speaker population, as opposed to the state-ofthe-art approaches. The D-Norm is based on the use of Kullback-Leibler (KL) distances in an ASV context. In a first step, we estimate the KL distances with a Monte-Carlo method and we experimentally show that they are correlated with the verification scores. In a second step, we use this correlation to implement a score normalization procedure, the D-Norm. We analyse its performance and we compare it to that of a conventional normalization, the Z-Norm. The results show that performance of the D-Norm is comparable to that of the Z-Norm. We then conclude about the results we obtain and we discuss the applications of this work.
Zero-resource audio-only spoken term detection based on a combination of template matching techniques
spoken term detection, template matching, unsupervised learning, posterior featuresInternational audienceSpoken term detection is a well-known information retrieval task that seeks to extract contentful information from audio by locating occurrences of known query words of interest. This paper describes a zero-resource approach to such task based on pattern matching of spoken term queries at the acoustic level. The template matching module comprises the cascade of a segmental variant of dynamic time warping and a self-similarity matrix comparison to further improve robustness to speech variability. This solution notably differs from more traditional train and test methods that, while shown to be very accurate, rely upon the availability of large amounts of linguistic resources. We evaluate our framework on different parameterizations of the speech templates: raw MFCC features and Gaussian posteriorgrams, French and English phonetic posteriorgrams output by two different state of the art phoneme recognizers
Supplementary material to the article: Estimating the structural segmentation of popular music pieces under regularity constraints
This document gathers descriptions of the structural segmentation systems considered in the IEEE/ACM TASLP paper by the same authors
Barwise Music Structure Analysis with the Correlation Block-Matching Segmentation Algorithm
Music Structure Analysis (MSA) is a Music Information Retrieval task
consisting of representing a song in a simplified, organized manner by breaking
it down into sections typically corresponding to ``chorus'', ``verse'',
``solo'', etc. In this work, we extend an MSA algorithm called the Correlation
Block-Matching (CBM) algorithm introduced by (Marmoret et al., 2020, 2022b).
The CBM algorithm is a dynamic programming algorithm that segments
self-similarity matrices, which are a standard description used in MSA and in
numerous other applications. In this work, self-similarity matrices are
computed from the feature representation of an audio signal and time is sampled
at the bar-scale. This study examines three different standard similarity
functions for the computation of self-similarity matrices. Results show that,
in optimal conditions, the proposed algorithm achieves a level of performance
which is competitive with supervised state-of-the-art methods while only
requiring knowledge of bar positions. In addition, the algorithm is made
open-source and is highly customizable.Comment: 19 pages, 13 figures, 11 tables, 1 algorithm, published in
Transactions of the International Society for Music Information Retrieva
Convolutive Block-Matching Segmentation Algorithm with Application to Music Structure Analysis
Music Structure Analysis (MSA) consists of representing a song in sections
(such as ``chorus'', ``verse'', ``solo'' etc), and can be seen as the retrieval
of a simplified organization of the song. This work presents a new algorithm,
called Convolutive Block-Matching (CBM) algorithm, devoted to MSA. In
particular, the CBM algorithm is a dynamic programming algorithm, applying on
autosimilarity matrices, a standard tool in MSA. In this work, autosimilarity
matrices are computed from the feature representation of an audio signal, and
time is sampled on the barscale. We study three different similarity functions
for the computation of autosimilarity matrices. We report that the proposed
algorithm achieves a level of performance competitive to that of supervised
state-of-the-art methods on 3 among 4 metrics, while being fully unsupervised.Comment: 4 pages, 5 figures, 1 table. Submitted at ICASSP 2023. The associated
toolbox is available at
https://gitlab.inria.fr/amarmore/autosimilarity_segmentatio
Methodological and musicological investigation of the System & Contrast model for musical form description
The semiotic description of music structure aims at representing the high-level organization of music pieces in a concise, generic and reproducible way as a low-rate stream of arbitrary symbols from a limited alphabet, which results into a sequence of " semiotic units ". In this context, the purpose of the System & Contrast model is to address the internal organization of the semiotic units. In this report, the System & Contrast model is approached from different angles in relation to varied disciplines : cognitive psychology, music analysis and information theory. After establishing a number of links between the System & Contrast model and other approaches of music structure, the model is illustrated on studio-based popular music pieces, as well as on music from the classical Viennese period
Well-posedness of the permutation problem in sparse filter estimation with lp minimization
Convolutive source separation is often done in two stages: 1) estimation of
the mixing filters and 2) estimation of the sources. Traditional approaches
suffer from the ambiguities of arbitrary permutations and scaling in each
frequency bin of the estimated filters and/or the sources, and they are usually
corrected by taking into account some special properties of the
filters/sources. This paper focusses on the filter permutation problem in the
absence of scaling, investigating the possible use of the temporal sparsity of
the filters as a property enabling permutation correction. Theoretical and
experimental results highlight the potential as well as the limits of sparsity
as an hypothesis to obtain a well-posed permutation problem
Likelihood ratio adjustment for the compensation of model mismatch in speaker verification
Cet article présente une méthode d'ajustement des seuils de vérification du locuteur basée sur un modÚle Gaussien des distributions du logarithme du rapport de vraisemblance. L'article expose les hypothÚses sous lesquelles ce modÚle est valide, indique plusieurs méthodes d'ajustement des seuils, et en illustre les apports et les limites par des expériences de vérification sur une base de données de 20 locuteurs
Adaptation robuste de modeles HMM pour la verification du locuteur dependante du texte
When deploying a secure system based on speaker verification, the limited amount of training data is usually critical. Indeed, the enrollment procedure must be fast and user-friendly. An incremental training of HMM speaker models, based on a MAP (Maximum A Posteriori) adaptation technique is used in order to make the enrollment more robust with only one or two utterances of the client password. This paper presents the improvements which can be achieved, in term of verification performance and stability of the decision thresholds. Our results highlight the benefits of MAP adaptation in conjunction with a synchronous alignment approach
Signal modeling with Non Uniform Topology lattice filters
This article presents a new class of constrained and specialized Auto-Regressive (AR) processes. They are derived from lattice filters where some reflection coefficients are forced to zero at a priori locations. Optimizing the filter topology allows to build parametric spectral models that have a greater number of poles than the number of parameters needed to describe their location. These NUT (Non-Uniform Topology) models are assessed by evaluating the reduction of modeling error with respect to conventional AR models
- âŠ