13 research outputs found

    Automated speech and audio analysis for semantic access to multimedia

    Get PDF
    The deployment and integration of audio processing tools can enhance the semantic annotation of multimedia content, and as a consequence, improve the effectiveness of conceptual access tools. This paper overviews the various ways in which automatic speech and audio analysis can contribute to increased granularity of automatically extracted metadata. A number of techniques will be presented, including the alignment of speech and text resources, large vocabulary speech recognition, key word spotting and speaker classification. The applicability of techniques will be discussed from a media crossing perspective. The added value of the techniques and their potential contribution to the content value chain will be illustrated by the description of two (complementary) demonstrators for browsing broadcast news archives

    The AMIDA 2009 Meeting Transcription System

    Get PDF
    We present the AMIDA 2009 system for participation in the NIST RT’2009 STT evaluations. Systems for close-talking, far field and speaker attributed STT conditions are described. Im- provements to our previous systems are: segmentation and diar- isation; stacked bottle-neck posterior feature extraction; fMPE training of acoustic models; adaptation on complete meetings; improvements to WFST decoding; automatic optimisation of decoders and system graphs. Overall these changes gave a 6- 13% relative reduction in word error rate while at the same time reducing the real-time factor by a factor of five and using con- siderably less data for acoustic model training

    Visual recognition of gestures in a meeting to detect when documents being talked about are missing

    Get PDF
    Meetings frequently involve discussion of documents and can be significantly affected if a document is absent. An agent system capable of spontaneously retrieving a document at the point it is needed would have to judge whether a meeting is talking about a particular document and whether that document is already present. We report the exploratory application of agent techniques for making these two judgements. To obtain examples from which an agent system can learn, we first conducted a study of participants making these judgements with video recordings of meetings. We then show that interactions between hands and paper documents in meetings can be used to recognise when a document being talked about is not to hand. The work demonstrates the potential for multimodal agent systems using these techniques to learn to perform specific, discourse-level tasks during meetings

    An Information Theoretic Combination of MFCC and TDOA Features for Speaker Diarization

    Full text link

    Recognition and Understanding of Meetings The AMI and AMIDA Projects

    Get PDF
    The AMI and AMIDA projects are concerned with the recognition and interpretation of multiparty meetings. Within these projects we have: developed an infrastructure for recording meetings using multiple microphones and cameras; released a 100 hour annotated corpus of meetings; developed techniques for the recognition and interpretation of meetings based primarily on speech recognition and computer vision; and developed an evaluation framework at both component and system levels. In this paper we present an overview of these projects, with an emphasis on speech recognition and content extraction

    Multidisciplinary perspectives on automatic analysis of children's language samples : where do we go from here?

    Get PDF
    BACKGROUND : Language sample analysis (LSA) is invaluable to describe and understand child language use and development for clinical purposes and research. Digital tools supporting LSA are available, but many of the LSA steps have not been automated. Nevertheless, programs that include automatic speech recognition (ASR), the first step of LSA, have already reached mainstream applicability. SUMMARY : To better understand the complexity, challenges, and future needs of automatic LSA from a technological perspective, including the tasks of transcribing, annotating, and analysing natural child language samples, this article takes on a multidisciplinary view. Requirements of a fully automated LSA process are characterized, features of existing LSA software tools compared, and prior work from the disciplines of information science and computational linguistics reviewed. KEY MESSAGES : Existing tools vary in their extent of automation provided across the process of LSA. Advances in machine learning for speech recognition and processing have potential to facilitate LSA, but the specifics of child speech and language as well as the lack of child data complicate software design. A transdisciplinary approach is recommended as feasible to support future software development for LSA.https://karger.com/fplhj2023Centre for Augmentative and Alternative Communication (CAAC)Speech-Language Pathology and Audiolog

    Acoustic Beamforming for Speaker Diarization of Meetings

    Full text link

    An Information Theoretic Approach to Speaker Diarization of Meeting Recordings

    Get PDF
    In this thesis we investigate a non parametric approach to speaker diarization for meeting recordings based on an information theoretic framework. The problem is formulated using the Information Bottleneck (IB) principle. Unlike other approaches where the distance between speaker segments is arbitrarily introduced, the IB method seeks the partition that maximizes the mutual information between observations and variables relevant for the problem while minimizing the distortion between observations. The distance between speech segments is selected as the Jensen-Shannon divergence as it arises from the IB objective function optimization. In the first part of the thesis, we explore IB based diarization with Mel frequency cepstral coefficients (MFCC) as input features. We study issues related to IB based speaker diarization such as optimizing the IB objective function, criteria for inferring the number of speakers. Furthermore, we benchmark the proposed system against a state-of-the-art systemon the NIST RT06 (Rich Transcription) meeting data for speaker diarization. The IB based system achieves similar speaker error rates (16.8%) as compared to a baseline HMM/GMM system (17.0%). This approach being non parametric clustering, perform diarization six times faster than realtime while the baseline is slower than realtime. The second part of thesis proposes a novel feature combination system in the context of IB diarization. Both speaker clustering and speaker realignment steps are discussed. In contrary to current systems, the proposed method avoids the feature combination by averaging log-likelihood scores. Two different sets of features were considered – (a) combination of MFCC features with time delay of arrival features (b) a four feature stream combination that combines MFCC, TDOA, modulation spectrum and frequency domain linear prediction. Experiments show that the proposed system achieve 5% absolute improvement over the baseline in case of two feature combination, and 7% in case of four feature combination. The increase in algorithm complexity of the IB system is minimal with more features. The system with four feature input performs in real time that is ten times faster than the GMM based system
    corecore