Search CORE

6,010 research outputs found

Punctuation Restoration for Singaporean Spoken Languages: English, Malay, and Mandarin

Author: Eng-Siong Chng
Rao Abhinav
Thi-Nga Ho
Publication venue
Publication date: 10/12/2022
Field of study

This paper presents the work of restoring punctuation for ASR transcripts generated by multilingual ASR systems. The focus languages are English, Mandarin, and Malay which are three of the most popular languages in Singapore. To the best of our knowledge, this is the first system that can tackle punctuation restoration for these three languages simultaneously. Traditional approaches usually treat the task as a sequential labeling task, however, this work adopts a slot-filling approach that predicts the presence and type of punctuation marks at each word boundary. The approach is similar to the Masked-Language Model approach employed during the pre-training stages of BERT, but instead of predicting the masked word, our model predicts masked punctuation. Additionally, we find that using Jieba1 instead of only using the built-in SentencePiece tokenizer of XLM-R can significantly improve the performance of punctuating Mandarin transcripts. Experimental results on English and Mandarin IWSLT2022 datasets and Malay News show that the proposed approach achieved state-of-the-art results for Mandarin with 73.8% F1-score while maintaining a reasonable F1-score for English and Malay, i.e. 74.7% and 78% respectively. Our source code that allows reproducing the results and building a simple web-based application for demonstration purposes is available on Github

arXiv.org e-Print Archive

Automatic punctuation restoration with BERT models

Author: Bial Bence
Nagy Attila
Ács Judit
Publication venue
Publication date: 01/01/2021
Field of study

We present an approach for automatic punctuation restoration with BERT models for English and Hungarian. For English, we conduct our experiments on Ted Talks, a commonly used benchmark for punctuation restoration, while for Hungarian we evaluate our models on the Szeged Treebank dataset. Our best models achieve a macro-averaged F1-score of 79.8 in English and 82.2 in Hungarian. Our code is publicly available

University of Szeged

Robust Neural Machine Translation for Clean and Noisy Speech Transcripts

Author: Alessandra Brusadin
Marcello Federico
Mattia Antonino Di Gangi
Robert Enyedi
Publication venue
Publication date: 01/01/2019
Field of study

Neural machine translation models have shown to achieve high quality when trained and fed with well structured and punctuated input texts. Unfortunately, the latter condition is not met in spoken language translation, where the input is generated by an automatic speech recognition (ASR) system. In this paper, we study how to adapt a strong NMT system to make it robust to typical ASR errors. As in our application scenarios transcripts might be post-edited by human experts, we propose adaptation strategies to train a single system that can translate either clean or noisy input with no supervision on the input type. Our experimental results on a public speech translation data set show that adapting a model on a significant amount of parallel data including ASR transcripts is beneficial with test data of the same type, but produces a small degradation when translating clean text. Adapting on both clean and noisy variants of the same data leads to the best results on both input types

arXiv.org e-Print Archive

Archivio della ricerca - Fondazione Bruno Kessler

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Challenges of developing a digital scribe to reduce clinical documentation burden.

Author: Berkovsky S
Coiera E
Kocaballi AB
Laranjo L
Quiroz JC
Rezazadegan D
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 09/05/2020
Field of study

Clinicians spend a large amount of time on clinical documentation of patient encounters, often impacting quality of care and clinician satisfaction, and causing physician burnout. Advances in artificial intelligence (AI) and machine learning (ML) open the possibility of automating clinical documentation with digital scribes, using speech recognition to eliminate manual documentation by clinicians or medical scribes. However, developing a digital scribe is fraught with problems due to the complex nature of clinical environments and clinical conversations. This paper identifies and discusses major challenges associated with developing automated speech-based documentation in clinical settings: recording high-quality audio, converting audio to transcripts using speech recognition, inducing topic structure from conversation data, extracting medical concepts, generating clinically meaningful summaries of conversations, and obtaining clinical data for AI and ML algorithms

OPUS - University of Technology Sydney

A survey on bias in machine learning research

Author: Grochowski Michał
Mikołajczyk-Bareła Agnieszka
Publication venue
Publication date: 22/08/2023
Field of study

Current research on bias in machine learning often focuses on fairness, while overlooking the roots or causes of bias. However, bias was originally defined as a "systematic error," often caused by humans at different stages of the research process. This article aims to bridge the gap between past literature on bias in research by providing taxonomy for potential sources of bias and errors in data and models. The paper focus on bias in machine learning pipelines. Survey analyses over forty potential sources of bias in the machine learning (ML) pipeline, providing clear examples for each. By understanding the sources and consequences of bias in machine learning, better methods can be developed for its detecting and mitigating, leading to fairer, more transparent, and more accurate ML models.Comment: Submitted to journal. arXiv admin note: substantial text overlap with arXiv:2308.0946

arXiv.org e-Print Archive

Biomedical Term Extraction: NLP Techniques in Computational Medicine

Author: Campillos Llanos Leonardo
Díaz Julia
Moreno Sandoval Antonio
Redondo Teófilo
Publication venue: 'Universidad Internacional de La Rioja'
Publication date: 14/02/2022
Field of study

Artificial Intelligence (AI) and its branch Natural Language Processing (NLP) in particular are main contributors to recent advances in classifying documentation and extracting information from assorted fields, Medicine being one that has gathered a lot of attention due to the amount of information generated in public professional journals and other means of communication within the medical profession. The typical information extraction task from technical texts is performed via an automatic term recognition extractor. Automatic Term Recognition (ATR) from technical texts is applied for the identification of key concepts for information retrieval and, secondarily, for machine translation. Term recognition depends on the subject domain and the lexical patterns of a given language, in our case, Spanish, Arabic and Japanese. In this article, we present the methods and techniques for creating a biomedical corpus of validated terms, with several tools for optimal exploitation of the information therewith contained in said corpus. This paper also shows how these techniques and tools have been used in a prototype

Re-UNIR

Spartan Daily, March 2, 1992

Author: San Jose State University School of Journalism and Mass Communications
Publication venue: SJSU ScholarWorks
Publication date: 02/03/1992
Field of study

Volume 98, Issue 26https://scholarworks.sjsu.edu/spartandaily/8239/thumbnail.jp

SJSU ScholarWorks

Understand Legal Documents with Contextualized Large Language Models

Author: Jin Xin
Wang Yuchen
Publication venue
Publication date: 12/06/2023
Field of study

The growth of pending legal cases in populous countries, such as India, has become a major issue. Developing effective techniques to process and understand legal documents is extremely useful in resolving this problem. In this paper, we present our systems for SemEval-2023 Task 6: understanding legal texts (Modi et al., 2023). Specifically, we first develop the Legal-BERT-HSLN model that considers the comprehensive context information in both intra- and inter-sentence levels to predict rhetorical roles (subtask A) and then train a Legal-LUKE model, which is legal-contextualized and entity-aware, to recognize legal entities (subtask B). Our evaluations demonstrate that our designed models are more accurate than baselines, e.g., with an up to 15.0% better F1 score in subtask B. We achieved notable performance in the task leaderboard, e.g., 0.834 micro F1 score, and ranked No.5 out of 27 teams in subtask A.Comment: SemEval 202

arXiv.org e-Print Archive