Search CORE

372 research outputs found

Restoring Punctuation and Capitalization in Transcribed Speech

Author: Bacchiani Michiel
Gravano Agustin
Jansche Martin
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2009
Field of study

Adding punctuation and capitalization greatly improves the readability of automatic speech transcripts. We discuss an approach for performing both tasks in a single pass using a purely text-basedn-gram language model. We study the effect on performance of varying the n-gram order (from n = 3 to n = 6) and the amount of training data (from 58 million to 55 billion tokens). Our results show that using larger training data sets consistently improves performance, while increasing the n-gram order does not help nearly as much

CiteSeerX

Crossref

Columbia University Academic Commons

Joint Learning of Correlated Sequence Labelling Tasks Using Bidirectional Recurrent Neural Networks

Author: Kotlerman Lili
Laha Anirban
Lev Guy
Mirkin Shachar
Pahuja Vardaan
Raykar Vikas
Publication venue
Publication date: 18/07/2017
Field of study

The stream of words produced by Automatic Speech Recognition (ASR) systems is typically devoid of punctuations and formatting. Most natural language processing applications expect segmented and well-formatted texts as input, which is not available in ASR output. This paper proposes a novel technique of jointly modeling multiple correlated tasks such as punctuation and capitalization using bidirectional recurrent neural networks, which leads to improved performance for each of these tasks. This method could be extended for joint modeling of any other correlated sequence labeling tasks.Comment: Accepted in Interspeech 201

arXiv.org e-Print Archive

Crossref

Query recovery of short user queries: on query expansion with stopwords

Author: Jones Gareth J.F.
Leveling Johannes
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2010
Field of study

User queries to search engines are observed to predominantly contain inflected content words but lack stopwords and capitalization. Thus, they often resemble natural language queries after case folding and stopword removal. Query recovery aims to generate a linguistically well-formed query from a given user query as input to provide natural language processing tasks and cross-language information retrieval (CLIR). The evaluation of query translation shows that translation scores (NIST and BLEU) decrease after case folding, stopword removal, and stemming. A baseline method for query recovery reconstructs capitalization and stopwords, which considerably increases translation scores and significantly increases mean average precision for a standard CLIR task

CiteSeerX

Crossref

Irish Universities

DCU Online Research Access Service

Four-in-One: A Joint Approach to Inverse Text Normalization, Punctuation, Capitalization, and Disfluency for Automatic Speech Recognition

Author: Alphonso Issac
Behre Piyush
Chang Shuangyu
Kibre Nick
Tan Sharman
Publication venue
Publication date: 26/10/2022
Field of study

Features such as punctuation, capitalization, and formatting of entities are important for readability, understanding, and natural language processing tasks. However, Automatic Speech Recognition (ASR) systems produce spoken-form text devoid of formatting, and tagging approaches to formatting address just one or two features at a time. In this paper, we unify spoken-to-written text conversion via a two-stage process: First, we use a single transformer tagging model to jointly produce token-level tags for inverse text normalization (ITN), punctuation, capitalization, and disfluencies. Then, we apply the tags to generate written-form text and use weighted finite state transducer (WFST) grammars to format tagged ITN entity spans. Despite joining four models into one, our unified tagging approach matches or outperforms task-specific models across all four tasks on benchmark test sets across several domains

arXiv.org e-Print Archive

Automatic truecasing of video subtitles using BERT: a multilingual adaptable approach

Author: Batista F.
Nuno Miguel Guerreiro
Ricardo Rei
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

This paper describes an approach for automatic capitalization of text without case information, such as spoken transcripts of video subtitles, produced by automatic speech recognition systems. Our approach is based on pre-trained contextualized word embeddings, requires only a small portion of data for training when compared with traditional approaches, and is able to achieve state-of-the-art results. The paper reports experiments both on general written data from the European Parliament, and on video subtitles, revealing that the proposed approach is suitable for performing capitalization, not only in each one of the domains, but also in a cross-domain scenario. We have also created a versatile multilingual model, and the conducted experiments show that good results can be achieved both for monolingual and multilingual data. Finally, we applied domain adaptation by finetuning models, initially trained on general written data, on video subtitles, revealing gains over other approaches not only in performance but also in terms of computational cost.info:eu-repo/semantics/publishedVersio

Crossref

Repositório Institucional do ISCTE-IUL

Punctuation Prediction for Norwegian: Using Established Approaches for Under-Resourced Languages

Author: Prestegard Guro Sivertsen
Publication venue: The University of Bergen
Publication date: 01/01/2021
Field of study

Masteroppgåve i informasjonsvitskapINFO390MASV-INF

University of Bergen

NORA - Norwegian Open Research Archives

Automatic punctuation restoration with BERT models

Author: Bial Bence
Nagy Attila
Ács Judit
Publication venue
Publication date: 01/01/2021
Field of study

We present an approach for automatic punctuation restoration with BERT models for English and Hungarian. For English, we conduct our experiments on Ted Talks, a commonly used benchmark for punctuation restoration, while for Hungarian we evaluate our models on the Szeged Treebank dataset. Our best models achieve a macro-averaged F1-score of 79.8 in English and 82.2 in Hungarian. Our code is publicly available

University of Szeged