8,630 research outputs found
FF2: A Feature Fusion Two-Stream Framework for Punctuation Restoration
To accomplish punctuation restoration, most existing methods focus on
introducing extra information (e.g., part-of-speech) or addressing the class
imbalance problem. Recently, large-scale transformer-based pre-trained language
models (PLMS) have been utilized widely and obtained remarkable success.
However, the PLMS are trained on the large dataset with marks, which may not
fit well with the small dataset without marks, causing the convergence to be
not ideal. In this study, we propose a Feature Fusion two-stream framework
(FF2) to bridge the gap. Specifically, one stream leverages a pre-trained
language model to capture the semantic feature, while another auxiliary module
captures the feature at hand. We also modify the computation of multi-head
attention to encourage communication among heads. Then, two features with
different perspectives are aggregated to fuse information and enhance context
awareness. Without additional data, the experimental results on the popular
benchmark IWSLT demonstrate that FF2 achieves new SOTA performance, which
verifies that our approach is effective.Comment: 5pages. arXiv admin note: substantial text overlap with
arXiv:2203.1248
Punctuation Restoration for Singaporean Spoken Languages: English, Malay, and Mandarin
This paper presents the work of restoring punctuation for ASR transcripts
generated by multilingual ASR systems. The focus languages are English,
Mandarin, and Malay which are three of the most popular languages in Singapore.
To the best of our knowledge, this is the first system that can tackle
punctuation restoration for these three languages simultaneously. Traditional
approaches usually treat the task as a sequential labeling task, however, this
work adopts a slot-filling approach that predicts the presence and type of
punctuation marks at each word boundary. The approach is similar to the
Masked-Language Model approach employed during the pre-training stages of BERT,
but instead of predicting the masked word, our model predicts masked
punctuation. Additionally, we find that using Jieba1 instead of only using the
built-in SentencePiece tokenizer of XLM-R can significantly improve the
performance of punctuating Mandarin transcripts. Experimental results on
English and Mandarin IWSLT2022 datasets and Malay News show that the proposed
approach achieved state-of-the-art results for Mandarin with 73.8% F1-score
while maintaining a reasonable F1-score for English and Malay, i.e. 74.7% and
78% respectively. Our source code that allows reproducing the results and
building a simple web-based application for demonstration purposes is available
on Github
Semi-supervised Thai Sentence Segmentation Using Local and Distant Word Representations
A sentence is typically treated as the minimal syntactic unit used to extract valuable information from long text. However, in written Thai, there are no explicit sentence markers. Some prior works use machine learning; however, a deep learning approach has never been employed. We propose a deep learning model for sentence segmentation that includes three main contributions. First, we integrate n-gram embedding as a local representation to capture word groups near sentence boundaries. Second, to focus on the keywords of dependent clauses, we combine the model with a distant representation obtained from self-attention modules. Finally, due to the scarcity of labeled data, for which annotation is difficult and time-consuming, we also investigate two techniques that allow us to utilize unlabeled data: Cross-View Training (CVT) as a semi-supervised learning technique, and a pre-trained language model (ELMo) to improve word representation. In the experiments, our model reduced the relative error by 7.4% and 18.5% compared with the baseline models on the Orchid and UGWC datasets, respectively. Ablation studies revealed that the main contributing factor was adopting n-gram features, which were further analyzed using the interpretation technique and indicated that the model utilizes the features in the same way that humans do
- …