Search CORE

61 research outputs found

Joint Learning of Correlated Sequence Labelling Tasks Using Bidirectional Recurrent Neural Networks

Author: Kotlerman Lili
Laha Anirban
Lev Guy
Mirkin Shachar
Pahuja Vardaan
Raykar Vikas
Publication venue
Publication date: 18/07/2017
Field of study

The stream of words produced by Automatic Speech Recognition (ASR) systems is typically devoid of punctuations and formatting. Most natural language processing applications expect segmented and well-formatted texts as input, which is not available in ASR output. This paper proposes a novel technique of jointly modeling multiple correlated tasks such as punctuation and capitalization using bidirectional recurrent neural networks, which leads to improved performance for each of these tasks. This method could be extended for joint modeling of any other correlated sequence labeling tasks.Comment: Accepted in Interspeech 201

arXiv.org e-Print Archive

Crossref

Comparing Different Methods for Disfluency Structure Detection

Author: Batista Fernando
Medeiros Henrique
Moniz Helena
Nunes Luis
Trancoso Isabel
Publication venue: OASIcs - OpenAccess Series in Informatics. 2nd Symposium on Languages, Applications and Technologies
Publication date: 01/01/2013
Field of study

This paper presents a number of experiments focusing on assessing the performance of different machine learning methods on the identification of disfluencies and their distinct structural regions over speech data. Several machine learning methods have been applied, namely Naive Bayes, Logistic Regression, Classification and Regression Trees (CARTs), J48 and Multilayer Perceptron. Our experiments show that CARTs outperform the other methods on the identification of the distinct structural disfluent regions. Reported experiments are based on audio segmentation and prosodic features, calculated from a corpus of university lectures in European Portuguese, containing about 32h of speech and about 7.7% of disfluencies. The set of features automatically extracted from the forced alignment corpus proved to be discriminant of the regions contained in the production of a disfluency. This work shows that using fully automatic prosodic features, disfluency structural regions can be reliably identified using CARTs, where the best results achieved correspond to 81.5% precision, 27.6% recall, and 41.2% F-measure. The best results concern the detection of the interregnum, followed by the detection of the interruption point

Repositório Institucional do ISCTE-IUL

Dagstuhl Research Online Publication Server

Comparing different machine learning approaches for disfluency structure detection in a corpus of university lectures

Author: Batista F.
Medeiros H.
Moniz H.
Nunes L.
Trancoso I.
Publication venue: OASIcs
Publication date: 01/01/2013
Field of study

Repositório Institucional do ISCTE-IUL

End-to-End Speech Recognition and Disfluency Removal with Acoustic Language Model Pretraining

Author: Bassi Saksham
Duregon Giulio
Jalagam Siddhartha
Roth David
Publication venue
Publication date: 08/09/2023
Field of study

The SOTA in transcription of disfluent and conversational speech has in recent years favored two-stage models, with separate transcription and cleaning stages. We believe that previous attempts at end-to-end disfluency removal have fallen short because of the representational advantage that large-scale language model pretraining has given to lexical models. Until recently, the high dimensionality and limited availability of large audio datasets inhibited the development of large-scale self-supervised pretraining objectives for learning effective audio representations, giving a relative advantage to the two-stage approach, which utilises pretrained representations for lexical tokens. In light of recent successes in large scale audio pretraining, we revisit the performance comparison between two-stage and end-to-end model and find that audio based language models pretrained using weak self-supervised objectives match or exceed the performance of similarly trained two-stage models, and further, that the choice of pretraining objective substantially effects a model's ability to be adapted to the disfluency removal task

arXiv.org e-Print Archive

Towards automatic detection of reported speech in dialogue using prosodic cues

Author: Bell Peter
Cervone Alessandra
Lai Catherine
Pareti Silvia
Publication venue
Publication date: 01/09/2015
Field of study

Edinburgh Research Explorer

Sentence boundary detection in chinese broadcast news using conditional random fields and prosodic features

Author: Chenglin Xu
Lei Xie
Xiaoxuan Wang
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2014
Field of study

In this paper, we explore the use of prosodic features in sen-tence boundary detection in Chinese broadcast news. The prosodic features include speaker turn, music, pause dura-tion, pitch, energy and speaking rate. Specifically, consider-ing the Chinese tonal effects in pitch trajectory, we propose to use tone-normalized pitch features. Experiments using deci-sion trees demonstrate that the tone-normalized pitch features show superior performance in sentence boundary detection in Chinese broadcast news. Furthermore, feature combination is able to achieve apparent performance improvement by in-tuitive feature interactive rules formed in the decision tree. Pause duration and a tone-normalized pitch feature contribute the most part of the feature usage in the best-performing de-cision tree. Index Terms — sentence boundary detection, sentence segmentation, speech prosody, rich transcription 1

CiteSeerX

Crossref