6,010 research outputs found
Punctuation Restoration for Singaporean Spoken Languages: English, Malay, and Mandarin
This paper presents the work of restoring punctuation for ASR transcripts
generated by multilingual ASR systems. The focus languages are English,
Mandarin, and Malay which are three of the most popular languages in Singapore.
To the best of our knowledge, this is the first system that can tackle
punctuation restoration for these three languages simultaneously. Traditional
approaches usually treat the task as a sequential labeling task, however, this
work adopts a slot-filling approach that predicts the presence and type of
punctuation marks at each word boundary. The approach is similar to the
Masked-Language Model approach employed during the pre-training stages of BERT,
but instead of predicting the masked word, our model predicts masked
punctuation. Additionally, we find that using Jieba1 instead of only using the
built-in SentencePiece tokenizer of XLM-R can significantly improve the
performance of punctuating Mandarin transcripts. Experimental results on
English and Mandarin IWSLT2022 datasets and Malay News show that the proposed
approach achieved state-of-the-art results for Mandarin with 73.8% F1-score
while maintaining a reasonable F1-score for English and Malay, i.e. 74.7% and
78% respectively. Our source code that allows reproducing the results and
building a simple web-based application for demonstration purposes is available
on Github
Automatic punctuation restoration with BERT models
We present an approach for automatic punctuation restoration with BERT models for English and Hungarian. For English, we conduct our experiments on Ted Talks, a commonly used benchmark for punctuation restoration, while for Hungarian we evaluate our models on the Szeged Treebank dataset. Our best models achieve a macro-averaged F1-score of 79.8 in English and 82.2 in Hungarian. Our code is publicly available
Robust Neural Machine Translation for Clean and Noisy Speech Transcripts
Neural machine translation models have shown to achieve high quality when trained and fed with well structured and punctuated input texts. Unfortunately, the latter condition is not met in spoken language translation, where the input is generated by an automatic speech recognition (ASR) system. In this paper, we study how to adapt a strong NMT system to make it robust to typical ASR errors. As in our application scenarios transcripts might be post-edited by human experts, we propose adaptation strategies to train a single system that can translate either clean or noisy input with no supervision on the input type. Our experimental results on a public speech translation data set show that adapting a model on a significant amount of parallel data including ASR transcripts is beneficial with test data of the same type, but produces a small degradation when translating clean text. Adapting on both clean and noisy variants of the same data leads to the best results on both input types
Challenges of developing a digital scribe to reduce clinical documentation burden.
Clinicians spend a large amount of time on clinical documentation of patient encounters, often impacting quality of care and clinician satisfaction, and causing physician burnout. Advances in artificial intelligence (AI) and machine learning (ML) open the possibility of automating clinical documentation with digital scribes, using speech recognition to eliminate manual documentation by clinicians or medical scribes. However, developing a digital scribe is fraught with problems due to the complex nature of clinical environments and clinical conversations. This paper identifies and discusses major challenges associated with developing automated speech-based documentation in clinical settings: recording high-quality audio, converting audio to transcripts using speech recognition, inducing topic structure from conversation data, extracting medical concepts, generating clinically meaningful summaries of conversations, and obtaining clinical data for AI and ML algorithms
A survey on bias in machine learning research
Current research on bias in machine learning often focuses on fairness, while
overlooking the roots or causes of bias. However, bias was originally defined
as a "systematic error," often caused by humans at different stages of the
research process. This article aims to bridge the gap between past literature
on bias in research by providing taxonomy for potential sources of bias and
errors in data and models. The paper focus on bias in machine learning
pipelines. Survey analyses over forty potential sources of bias in the machine
learning (ML) pipeline, providing clear examples for each. By understanding the
sources and consequences of bias in machine learning, better methods can be
developed for its detecting and mitigating, leading to fairer, more
transparent, and more accurate ML models.Comment: Submitted to journal. arXiv admin note: substantial text overlap with
arXiv:2308.0946
Biomedical Term Extraction: NLP Techniques in Computational Medicine
Artificial Intelligence (AI) and its branch Natural Language Processing (NLP) in particular are main contributors to recent advances in classifying documentation and extracting information from assorted fields, Medicine being one that has gathered a lot of attention due to the amount of information generated in public professional journals and other means of communication within the medical profession. The typical information extraction task from technical texts is performed via an automatic term recognition extractor. Automatic Term Recognition (ATR) from technical texts is applied for the identification of key concepts for information retrieval and, secondarily, for machine translation. Term recognition depends on the subject domain and the lexical patterns of a given language, in our case, Spanish, Arabic and Japanese. In this article, we present the methods and techniques for creating a biomedical corpus of validated terms, with several tools for optimal exploitation of the information therewith contained in said corpus. This paper also shows how these techniques and tools have been used in a prototype
Spartan Daily, March 2, 1992
Volume 98, Issue 26https://scholarworks.sjsu.edu/spartandaily/8239/thumbnail.jp
Understand Legal Documents with Contextualized Large Language Models
The growth of pending legal cases in populous countries, such as India, has
become a major issue. Developing effective techniques to process and understand
legal documents is extremely useful in resolving this problem. In this paper,
we present our systems for SemEval-2023 Task 6: understanding legal texts (Modi
et al., 2023). Specifically, we first develop the Legal-BERT-HSLN model that
considers the comprehensive context information in both intra- and
inter-sentence levels to predict rhetorical roles (subtask A) and then train a
Legal-LUKE model, which is legal-contextualized and entity-aware, to recognize
legal entities (subtask B). Our evaluations demonstrate that our designed
models are more accurate than baselines, e.g., with an up to 15.0% better F1
score in subtask B. We achieved notable performance in the task leaderboard,
e.g., 0.834 micro F1 score, and ranked No.5 out of 27 teams in subtask A.Comment: SemEval 202
- …