5,177 research outputs found

    The Blame Game: Performance Analysis of Speaker Diarization System Components

    Get PDF
    In this paper we discuss the performance analysis of a speaker diarization system similar to the system that was submitted by ICSI at the NIST RT06s evaluation benchmark. The analysis that is based on a series of oracle experiments, provides a good understanding of the performance of each system component on a test set of twelve conference meetings used in previous NIST benchmarks. Our analysis shows that the speech activity detection component contributes most to the total diarization error rate (23%). The lack of ability to model verlapping speech is also a large source of errors (22%) followed by the component that creates the initial system models (15%)

    Energy-based Self-attentive Learning of Abstractive Communities for Spoken Language Understanding

    Full text link
    Abstractive community detection is an important spoken language understanding task, whose goal is to group utterances in a conversation according to whether they can be jointly summarized by a common abstractive sentence. This paper provides a novel approach to this task. We first introduce a neural contextual utterance encoder featuring three types of self-attention mechanisms. We then train it using the siamese and triplet energy-based meta-architectures. Experiments on the AMI corpus show that our system outperforms multiple energy-based and non-energy based baselines from the state-of-the-art. Code and data are publicly available.Comment: Update baseline

    Filtering the Unknown: Speech Activity Detection in Heterogeneous Video Collections

    Get PDF
    In this paper we discuss the speech activity detection system that we used for detecting speech regions in the Dutch TRECVID video collection. The system is designed to filter non-speech like music or sound effects out of the signal without the use of predefined non-speech models. Because the system trains its models on-line, it is robust for handling out-of-domain data. The speech activity error rate on an out-of-domain test set, recordings of English conference meetings, was 4.4%. The overall error rate on twelve randomly selected five minute TRECVID fragments was 11.5%

    Recognition of Dialogue Acts in Multiparty Meetings using a Switching DBN

    Get PDF
    This paper is concerned with the automatic recognition of dialogue acts (DAs) in multiparty conversational speech. We present a joint generative model for DA recognition in which segmentation and classification of DAs are carried out in parallel. Our approach to DA recognition is based on a switching dynamic Bayesian network (DBN) architecture. This generative approach models a set of features, related to lexical content and prosody, and incorporates a weighted interpolated factored language model. The switching DBN coordinates the recognition process by integrating the component models. The factored language model, which is estimated from multiple conversational data corpora, is used in conjunction with additional task-specific language models. In conjunction with this joint generative model, we have also investigated the use of a discriminative approach, based on conditional random fields, to perform a reclassification of the segmented DAs. We have carried out experiments on the AMI corpus of multimodal meeting recordings, using both manually transcribed speech, and the output of an automatic speech recognizer, and using different configurations of the generative model. Our results indicate that the system performs well both on reference and fully automatic transcriptions. A further significant improvement in recognition accuracy is obtained by the application of the discriminative reranking approach based on conditional random fields

    A Memory Augmented Architecture for Continuous Speaker Identification in Meetings

    Full text link
    We introduce and analyze a novel approach to the problem of speaker identification in multi-party recorded meetings. Given a speech segment and a set of available candidate profiles, we propose a novel data-driven way to model the distance relations between them, aiming at identifying the speaker label corresponding to that segment. To achieve this we employ a recurrent, memory-based architecture, since this class of neural networks has been shown to yield advanced performance in problems requiring relational reasoning. The proposed encoding of distance relations is shown to outperform traditional distance metrics, such as the cosine distance. Additional improvements are reported when the temporal continuity of the audio signals and the speaker changes is modeled in. In this paper, we have evaluated our method in two different tasks, i.e. scripted and real-world business meeting scenarios, where we report a relative reduction in speaker error rate of 39.28% and 51.84%, respectively, compared to the baseline.Comment: Submitted to ICASSP 202

    Genome-Wide Association with Select Biomarker Traits in the Framingham Heart Study

    Get PDF
    BACKGROUND: Systemic biomarkers provide insights into disease pathogenesis, diagnosis, and risk stratification. Many systemic biomarker concentrations are heritable phenotypes. Genome-wide association studies (GWAS) provide mechanisms to investigate the genetic contributions to biomarker variability unconstrained by current knowledge of physiological relations. METHODS: We examined the association of Affymetrix 100K GeneChip single nucleotide polymorphisms (SNPs) to 22 systemic biomarker concentrations in 4 biological domains: inflammation/oxidative stress; natriuretic peptides; liver function; and vitamins. Related members of the Framingham Offspring cohort (n = 1012; mean age 59 ± 10 years, 51% women) had both phenotype and genotype data (minimum-maximum per phenotype n = 507–1008). We used Generalized Estimating Equations (GEE), Family Based Association Tests (FBAT) and variance components linkage to relate SNPs to multivariable-adjusted biomarker residuals. Autosomal SNPs (n = 70,987) meeting the following criteria were studied: minor allele frequency ≥ 10%, call rate ≥ 80% and Hardy-Weinberg equilibrium p ≥ 0.001. RESULTS: With GEE, 58 SNPs had p < 10-6: the top SNPs were rs2494250 (p = 1.00*10-14) and rs4128725 (p = 3.68*10-12) for monocyte chemoattractant protein-1 (MCP1), and rs2794520 (p = 2.83*10-8) and rs2808629 (p = 3.19*10-8) for C-reactive protein (CRP) averaged from 3 examinations (over about 20 years). With FBAT, 11 SNPs had p < 10-6: the top SNPs were the same for MCP1 (rs4128725, p = 3.28*10-8, and rs2494250, p = 3.55*10-8), and also included B-type natriuretic peptide (rs437021, p = 1.01*10-6) and Vitamin K percent undercarboxylated osteocalcin (rs2052028, p = 1.07*10-6). The peak LOD (logarithm of the odds) scores were for MCP1 (4.38, chromosome 1) and CRP (3.28, chromosome 1; previously described) concentrations; of note the 1.5 support interval included the MCP1 and CRP SNPs reported above (GEE model). Previous candidate SNP associations with circulating CRP concentrations were replicated at p < 0.05; the SNPs rs2794520 and rs2808629 are in linkage disequilibrium with previously reported SNPs. GEE, FBAT and linkage results are posted at . CONCLUSION: The Framingham GWAS represents a resource to describe potentially novel genetic influences on systemic biomarker variability. The newly described associations will need to be replicated in other studies.National Heart, Lung, and Blood Institute's Framingham Heart Study (N01-HC25195); National Institutes of Health National Center for Research Resources Shared Instrumentation grant (1S10RR163736-01A1); National Institutes of Health (HL064753, HL076784, AG028321, HL71039, 2 K24HL04334, 1K23 HL083102); Doris Duke Charitable Foundation; American Diabetes Association Career Developement Award; National Center for Research Resources (GCRC M01-RR01066); US Department of Agriculture Agricultural Research Service (58-1950-001, 58-1950-401); National Institute of Aging (AG14759

    The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse Scenarios

    Full text link
    The CHiME challenges have played a significant role in the development and evaluation of robust automatic speech recognition (ASR) systems. We introduce the CHiME-7 distant ASR (DASR) task, within the 7th CHiME challenge. This task comprises joint ASR and diarization in far-field settings with multiple, and possibly heterogeneous, recording devices. Different from previous challenges, we evaluate systems on 3 diverse scenarios: CHiME-6, DiPCo, and Mixer 6. The goal is for participants to devise a single system that can generalize across different array geometries and use cases with no a-priori information. Another departure from earlier CHiME iterations is that participants are allowed to use open-source pre-trained models and datasets. In this paper, we describe the challenge design, motivation, and fundamental research questions in detail. We also present the baseline system, which is fully array-topology agnostic and features multi-channel diarization, channel selection, guided source separation and a robust ASR model that leverages self-supervised speech representations (SSLR)

    Analysis of Unsupervised and Noise-Robust Speaker-Adaptive HMM-Based Speech Synthesis Systems toward a Unified ASR and TTS Framework

    Get PDF
    For the 2009 Blizzard Challenge we have built an unsupervised version of the HTS-2008 speaker-adaptive HMM-based speech synthesis system for English, and a noise robust version of the systems for Mandarin. They are designed from a multidisciplinary application point of view in that we attempt to integrate the components of the TTS system with other technologies such as ASR. All the average voice models are trained exclusively from recognized, publicly available, ASR databases. Multi-pass LVCSR and confidence scores calculated from confusion network are used for the unsupervised systems, and noisy data recorded in cars or public spaces is used for the noise robust system. We believe the developed systems form solid benchmarks and provide good connections to ASR fields. This paper describes the development of the systems and reports the results and analysis of their evaluation
    corecore