7 research outputs found

    Blending Sentence Optimization Weights of Unsupervised Approaches for Extractive Speech Summarization

    Get PDF
    AbstractThis paper evaluates the performance of two unsupervised approaches, Maximum Marginal Relevance (MMR) and concept-based global optimization framework for speech summarization. Automatic summarization is very useful techniques that can help the users browse a large amount of data. This study focuses on automatic extractive summarization on multi-dialogue speech corpus. We propose improved methods by blending each unsupervised approach at sentence level. Sentence level information is leveraged to improve the linguistic quality of selected summaries. First, these scores are used to filter sentences for concept extraction and concept weight computation. Second, we pre-select a subset of candidate summary sentences according to their sentence weights. Last, we extend the optimization function to a joint optimization of concept and sentence weights to cover both important concepts and sentences. Our experimental results show that these methods can improve the system performance comparing to the concept-based optimization baseline for both human transcripts and ASR output. The best scores are achieved by combining all three approaches, which are significantly better than the baseline system

    The AMIDA 2009 Meeting Transcription System

    Get PDF
    We present the AMIDA 2009 system for participation in the NIST RT’2009 STT evaluations. Systems for close-talking, far field and speaker attributed STT conditions are described. Im- provements to our previous systems are: segmentation and diar- isation; stacked bottle-neck posterior feature extraction; fMPE training of acoustic models; adaptation on complete meetings; improvements to WFST decoding; automatic optimisation of decoders and system graphs. Overall these changes gave a 6- 13% relative reduction in word error rate while at the same time reducing the real-time factor by a factor of five and using con- siderably less data for acoustic model training

    Recognition and Understanding of Meetings The AMI and AMIDA Projects

    Get PDF
    The AMI and AMIDA projects are concerned with the recognition and interpretation of multiparty meetings. Within these projects we have: developed an infrastructure for recording meetings using multiple microphones and cameras; released a 100 hour annotated corpus of meetings; developed techniques for the recognition and interpretation of meetings based primarily on speech recognition and computer vision; and developed an evaluation framework at both component and system levels. In this paper we present an overview of these projects, with an emphasis on speech recognition and content extraction

    Speaker normalisation for large vocabulary multiparty conversational speech recognition

    Get PDF
    One of the main problems faced by automatic speech recognition is the variability of the testing conditions. This is due both to the acoustic conditions (different transmission channels, recording devices, noises etc.) and to the variability of speech across different speakers (i.e. due to different accents, coarticulation of phonemes and different vocal tract characteristics). Vocal tract length normalisation (VTLN) aims at normalising the acoustic signal, making it independent from the vocal tract length. This is done by a speaker specific warping of the frequency axis parameterised through a warping factor. In this thesis the application of VTLN to multiparty conversational speech was investigated focusing on the meeting domain. This is a challenging task showing a great variability of the speech acoustics both across different speakers and across time for a given speaker. VTL, the distance between the lips and the glottis, varies over time. We observed that the warping factors estimated using Maximum Likelihood seem to be context dependent: appearing to be influenced by the current conversational partner and being correlated with the behaviour of formant positions and the pitch. This is because VTL also influences the frequency of vibration of the vocal cords and thus the pitch. In this thesis we also investigated pitch-adaptive acoustic features with the goal of further improving the speaker normalisation provided by VTLN. We explored the use of acoustic features obtained using a pitch-adaptive analysis in combination with conventional features such as Mel frequency cepstral coefficients. These spectral representations were combined both at the acoustic feature level using heteroscedastic linear discriminant analysis (HLDA), and at the system level using ROVER. We evaluated this approach on a challenging large vocabulary speech recognition task: multiparty meeting transcription. We found that VTLN benefits the most from pitch-adaptive features. Our experiments also suggested that combining conventional and pitch-adaptive acoustic features using HLDA results in a consistent, significant decrease in the word error rate across all the tasks. Combining at the system level using ROVER resulted in a further significant improvement. Further experiments compared the use of pitch adaptive spectral representation with the adoption of a smoothed spectrogram for the extraction of cepstral coefficients. It was found that pitch adaptive spectral analysis, providing a representation which is less affected by pitch artefacts (especially for high pitched speakers), delivers features with an improved speaker independence. Furthermore this has also shown to be advantageous when HLDA is applied. The combination of a pitch adaptive spectral representation and VTLN based speaker normalisation in the context of LVCSR for multiparty conversational speech led to more speaker independent acoustic models improving the overall recognition performances

    Generating automated meeting summaries

    Get PDF
    The thesis at hand introduces a novel approach for the generation of abstractive summaries of meetings. While the automatic generation of document summaries has been studied for some decades now, the novelty of this thesis is mainly the application to the meeting domain (instead of text documents) as well as the use of a lexicalized representation formalism on the basis of Frame Semantics. This allows us to generate summaries abstractively (instead of extractively).Die vorliegende Arbeit stellt einen neuartigen Ansatz zur Generierung abstraktiver Zusammenfassungen von Gruppenbesprechungen vor. WĂ€hrend automatische Textzusammenfassungen bereits seit einigen Jahrzehnten erforscht werden, liegt die Neuheit dieser Arbeit vor allem in der AnwendungsdomĂ€ne (Gruppenbesprechungen statt Textdokumenten), sowie der Verwendung eines lexikalisierten ReprĂ€sentationsformulism auf der Basis von Frame-Semantiken, der es erlaubt, Zusammenfassungen abstraktiv (statt extraktiv) zu generieren. Wir argumentieren, dass abstraktive AnsĂ€tze fĂŒr die Zusammenfassung spontansprachlicher Interaktionen besser geeignet sind als extraktive

    The SSPNet-Mobile Corpus: from the detection of non-verbal cues to the inference of social behaviour during mobile phone conversations

    Get PDF
    Mobile phones are one of the main channels of communication in contemporary society. However, the effect of the mobile phone on both the process of and, also, the non-verbal behaviours used during conversations mediated by this technology, remain poorly understood. This thesis aims to investigate the role of the phone on the negotiation process as well as, the automatic analysis of non-verbal behavioural cues during conversations using mobile telephones, by following the Social Signal Processing approach. The work in this thesis includes the collection of a corpus of 60 mobile phone conversations involving 120 subjects, development of methods for the detection of non-verbal behavioural events (laughter, fillers, speech and silence) and the inference of characteristics influencing social interactions (personality traits and conflict handling style) from speech and movements while using the mobile telephone, as well as the analysis of several factors that influence the outcome of decision-making processes while using mobile phones (gender, age, personality, conflict handling style and caller versus receiver role). The findings show that it is possible to recognise behavioural events at levels well above chance level, by employing statistical language models, and that personality traits and conflict handling styles can be partially recognised. Among the factors analysed, participant role (caller versus receiver) was the most important in determining the outcome of negotiation processes in the case of disagreement between parties. Finally, the corpus collected for the experiments (the SSPNet-Mobile Corpus) has been used in an international benchmarking campaign and constitutes a valuable resource for future research in Social Signal Processing and more generally in the area of human-human communication
    corecore