Search CORE

223 research outputs found

LP-TRAPs in all senses

Author: Motlicek Petr
Publication venue: IDIAP
Publication date: 11/02/2010
Field of study

This report describes additional experiments with LP-TRAPs -- speech features derived from autoregressive model applied to approximate temporal evolution of speech spectra in critical band-sized frequency sub-bands. The importance of free parameters, such as order model, length of the approximated temporal pattern, compression factor, or number of resulting cepstral coefficients, is investigated and evaluated using TANDEM-ASR approach

Infoscience - École polytechnique fédérale de Lausanne

Automatic Out-of-Language Detection Based on Confidence Measures Derived fromLVCSR Word and Phone Lattices

Author: Motlicek Petr
Publication venue
Publication date: 11/02/2010
Field of study

Confidence Measures (CMs) estimated from Large Vocabulary Continuous Speech Recognition (LVCSR) outputs are commonly used metrics to detect incorrectly recognized words. In this paper, we propose to exploit CMs derived from frame-based word and phone posteriors to detect speech segments containing pronunciations from non-target (alien) languages. The LVCSR system used is built for English, which is the target language, with medium-size recognition vocabulary (5k words). The efficiency of detection is tested on a set comprising speech from three different languages (English, German, Czech). Results achieved indicate that employment of specific temporal context (integrated in the word or phone level) significantly increases the detection accuracies. Furthermore, we show that combination of several CMs can also improve the efficiency of detection

Infoscience - École polytechnique fédérale de Lausanne

Claim-Dissector: An Interpretable Fact-Checking System with Joint Re-ranking and Veracity Prediction

Author: Fajcik Martin
Motlicek Petr
Smrz Pavel
Publication venue
Publication date: 01/06/2023
Field of study

We present Claim-Dissector: a novel latent variable model for fact-checking and analysis, which given a claim and a set of retrieved evidences jointly learns to identify: (i) the relevant evidences to the given claim, (ii) the veracity of the claim. We propose to disentangle the per-evidence relevance probability and its contribution to the final veracity probability in an interpretable way -- the final veracity probability is proportional to a linear ensemble of per-evidence relevance probabilities. In this way, the individual contributions of evidences towards the final predicted probability can be identified. In per-evidence relevance probability, our model can further distinguish whether each relevant evidence is supporting (S) or refuting (R) the claim. This allows to quantify how much the S/R probability contributes to the final verdict or to detect disagreeing evidence. Despite its interpretable nature, our system achieves results competitive with state-of-the-art on the FEVER dataset, as compared to typical two-stage system pipelines, while using significantly fewer parameters. It also sets new state-of-the-art on FAVIQ and RealFC datasets. Furthermore, our analysis shows that our model can learn fine-grained relevance cues while using coarse-grained supervision, and we demonstrate it in 2 ways. (i) We show that our model can achieve competitive sentence recall while using only paragraph-level relevance supervision. (ii) Traversing towards the finest granularity of relevance, we show that our model is capable of identifying relevance at the token level. To do this, we present a new benchmark TLR-FEVER focusing on token-level interpretability -- humans annotate tokens in relevant evidences they considered essential when making their judgment. Then we measure how similar are these annotations to the tokens our model is focusing on.Comment: updated acknowledgemen

arXiv.org e-Print Archive

Audio Coding Based on Long Temporal Segments: Experiments With Quantization of Excitation Signal

Author: Motlicek Petr
Ullal Vijay
Publication venue: IDIAP
Publication date: 11/02/2010
Field of study

In this paper, we describe additional experiments based on a novel audio coding technique that uses an autoregressive model to approximate an audio signal's Hilbert envelope. This technique is performed over long segments (1000 ms) in critical-band-sized sub-bands. We have performed a series of experiments to find more efficient methods of quantizing the frequency components of the Hilbert carrier, which is the excitation found in the temporal audio signal. When using linear quantization, it was found that allocating 5 bits for transmitting the Hilbert carrier every 200 ms was sufficient. Other techniques, such as quantizing the first derivative of phase and using an iterative adaptive threshold, were examined

Infoscience - École polytechnique fédérale de Lausanne

Node-weighted Graph Convolutional Network for Depression Detection in Transcribed Clinical Interviews

Author: Burdisso Sergio
Madikeri Srikanth
Motlicek Petr
Villatoro-Tello Esaú
Publication venue
Publication date: 11/03/2024
Field of study

We propose a simple approach for weighting self-connecting edges in a Graph Convolutional Network (GCN) and show its impact on depression detection from transcribed clinical interviews. To this end, we use a GCN for modeling non-consecutive and long-distance semantics to classify the transcriptions into depressed or control subjects. The proposed method aims to mitigate the limiting assumptions of locality and the equal importance of self-connections vs. edges to neighboring nodes in GCNs, while preserving attractive features such as low computational cost, data agnostic, and interpretability capabilities. We perform an exhaustive evaluation in two benchmark datasets. Results show that our approach consistently outperforms the vanilla GCN model as well as previously reported results, achieving an F1=0.84 on both datasets. Finally, a qualitative analysis illustrates the interpretability capabilities of the proposed approach and its alignment with previous findings in psychology.Comment: Paper Accepted to Interspeech 202

arXiv.org e-Print Archive

Application of Out-Of-Language Detection To Spoken-Term Detection

Author: Motlicek Petr
Valente Fabio
Publication venue: Rue Marconi 19, Martigny, Idiap
Publication date: 01/01/2010
Field of study

This paper investigates the detection of English spoken terms in a conversational multi-language scenario. The speech is processed using a large vocabulary continuous speech recognition system. The recognition output is represented in the form of word recognition lattices which are then used to search required terms. Due to the potential multi-lingual speech segments at the input, the spoken term detection system is combined with a module performing out-of-language detection to adjust its confidence scores. First, experimental results of spoken term detection are provided on the conversational telephone speech database distributed by NIST in 2006. Then, the system is evaluated on a multi-lingual database with and without employment of the out-of-language detection module, where we are only interested in detecting English terms (stored in the index database). Several strategies to combine these two systems in an efficient way are proposed and evaluated. Around 7% relative improvement over a stand-alone STD is achieved

Infoscience - École polytechnique fédérale de Lausanne

CiteSeerX

Crossref

HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition

Author: Mai Florian
Motlicek Petr
Parcollet Titouan
Zuluaga-Gomez Juan
Publication venue
Publication date: 29/05/2023
Field of study

State-of-the-art ASR systems have achieved promising results by modeling local and global interactions separately. While the former can be computed efficiently, global interactions are usually modeled via attention mechanisms, which are expensive for long input sequences. Here, we address this by extending HyperMixer, an efficient alternative to attention exhibiting linear complexity, to the Conformer architecture for speech recognition, leading to HyperConformer. In particular, multi-head HyperConformer achieves comparable or higher recognition performance while being more efficient than Conformer in terms of inference speed, memory, parameter count, and available training data. HyperConformer achieves a word error rate of 2.9% on Librispeech test-clean with less than 8M neural parameters and a peak memory during training of 5.7GB, hence trainable with accessible hardware. Encoder speed is between 38% on mid-length speech and 56% on long speech faster than an equivalent Conformer. (The HyperConformer recipe is publicly available in: https://github.com/speechbrain/speechbrain/tree/develop/recipes/LibriSpeech/ASR/transformer/)Comment: Florian Mai and Juan Zuluaga-Gomez contributed equally. To appear in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 202

arXiv.org e-Print Archive

Who Wants To Be A Millionaire? (II)

Author: Bourlard Hervé
Gasimov Huseyn
Motlicek Petr
Publication venue: Rue Marocni 19, Martigny, Switzerland, Idiap
Publication date: 19/12/2013
Field of study

The developed and implemented game exploiting state-of-the-art speech processing technologies is called “Who wants to be a millionaire?”. This game simulates well-known TV-based quiz game spread in many countries. This optional project, partially made at EPFL and Idiap, is a follow-up of the previous work done in the summer semester 2012. In this project, we improve capabilities of the speech control of the game which makes the game more comfortable and fun for players. Further, the game application was extended to support an open-architecture distributed system enabling to be played on different platforms

Infoscience - École polytechnique fédérale de Lausanne