23 research outputs found
Revisiting Pre-Trained Models for Chinese Natural Language Processing
Bidirectional Encoder Representations from Transformers (BERT) has shown
marvelous improvements across various NLP tasks, and consecutive variants have
been proposed to further improve the performance of the pre-trained language
models. In this paper, we target on revisiting Chinese pre-trained language
models to examine their effectiveness in a non-English language and release the
Chinese pre-trained language model series to the community. We also propose a
simple but effective model called MacBERT, which improves upon RoBERTa in
several ways, especially the masking strategy that adopts MLM as correction
(Mac). We carried out extensive experiments on eight Chinese NLP tasks to
revisit the existing pre-trained language models as well as the proposed
MacBERT. Experimental results show that MacBERT could achieve state-of-the-art
performances on many NLP tasks, and we also ablate details with several
findings that may help future research. Resources available:
https://github.com/ymcui/MacBERTComment: 12 pages, to appear at Findings of EMNLP 202
CBEAF-Adapting: Enhanced Continual Pretraining for Building Chinese Biomedical Language Model
Continual pretraining is a standard way of building a domain-specific
pretrained language model from a general-domain language model. However,
sequential task training may cause catastrophic forgetting, which affects the
model performance in downstream tasks. In this paper, we propose a continual
pretraining method for the BERT-based model, named CBEAF-Adapting (Chinese
Biomedical Enhanced Attention-FFN Adapting). Its main idea is to introduce a
small number of attention heads and hidden units inside each self-attention
layer and feed-forward network. Using the Chinese biomedical domain as a
running example, we trained a domain-specific language model named
CBEAF-RoBERTa. We conduct experiments by applying models to downstream tasks.
The results demonstrate that with only about 3% of model parameters trained,
our method could achieve about 0.5%, 2% average performance gain compared to
the best performing model in baseline and the domain-specific model,
PCL-MedBERT, respectively. We also examine the forgetting problem of different
pretraining methods. Our method alleviates the problem by about 13% compared to
fine-tuning
Adapting Machine Learning Techniques for Developing Automatic Q&A Interaction Module for Translation Robots based on NLP
Research on Automatic Q&A Interaction Module of Computer-based Translation Robot is a study that focuses on developing an automatic question and answer (Q&A) interaction module for computer-based translation robots. The goal of the research is to enhance the capability of translation robots to perform more human-like interactions with users, particularly in terms of providing more efficient and accurate translations. In this paper proposed a Conditional Random Field Discriminative Analysis (CRFDA) for feature extraction to derive translation robot with Q&A. The proposed CRFDA model comprises of the discriminative analysis for the CRF model. The estimation CRF model uses the bi-directional classifier for the estimation of the feature vector. Finally, the classification is performed with the voting-based classification model for feature extraction. The performance of the CRFDA model is examined based on the Name Entity (Nes) in the TempVal1 &2 dataset. The extraction is based on the strict and relaxed feature model for the exact match and slight variation. The simulation analysis expressed that proposed CRFDA model achieves a classification accuracy of 91% which is significantly higher than the state-of-art techniques
A Survey on Awesome Korean NLP Datasets
English based datasets are commonly available from Kaggle, GitHub, or
recently published papers. Although benchmark tests with English datasets are
sufficient to show off the performances of new models and methods, still a
researcher need to train and validate the models on Korean based datasets to
produce a technology or product, suitable for Korean processing. This paper
introduces 15 popular Korean based NLP datasets with summarized details such as
volume, license, repositories, and other research results inspired by the
datasets. Also, I provide high-resolution instructions with sample or
statistics of datasets. The main characteristics of datasets are presented on a
single table to provide a rapid summarization of datasets for researchers.Comment: 11 pages, 1 horizontal page for large tabl
Research on Multilingual News Clustering Based on Cross-Language Word Embeddings
Classifying the same event reported by different countries is of significant
importance for public opinion control and intelligence gathering. Due to the
diverse types of news, relying solely on transla-tors would be costly and
inefficient, while depending solely on translation systems would incur
considerable performance overheads in invoking translation interfaces and
storing translated texts. To address this issue, we mainly focus on the
clustering problem of cross-lingual news. To be specific, we use a combination
of sentence vector representations of news headlines in a mixed semantic space
and the topic probability distributions of news content to represent a news
article. In the training of cross-lingual models, we employ knowledge
distillation techniques to fit two semantic spaces into a mixed semantic space.
We abandon traditional static clustering methods like K-Means and AGNES in
favor of the incremental clustering algorithm Single-Pass, which we further
modify to better suit cross-lingual news clustering scenarios. Our main
contributions are as follows: (1) We adopt the English standard BERT as the
teacher model and XLM-Roberta as the student model, training a cross-lingual
model through knowledge distillation that can represent sentence-level
bilingual texts in both Chinese and English. (2) We use the LDA topic model to
represent news as a combina-tion of cross-lingual vectors for headlines and
topic probability distributions for con-tent, introducing concepts such as
topic similarity to address the cross-lingual issue in news content
representation. (3) We adapt the Single-Pass clustering algorithm for the news
context to make it more applicable. Our optimizations of Single-Pass include
ad-justing the distance algorithm between samples and clusters, adding cluster
merging operations, and incorporating a news time parameter
Online Distillation-enhanced Multi-modal Transformer for Sequential Recommendation
Multi-modal recommendation systems, which integrate diverse types of
information, have gained widespread attention in recent years. However,
compared to traditional collaborative filtering-based multi-modal
recommendation systems, research on multi-modal sequential recommendation is
still in its nascent stages. Unlike traditional sequential recommendation
models that solely rely on item identifier (ID) information and focus on
network structure design, multi-modal recommendation models need to emphasize
item representation learning and the fusion of heterogeneous data sources. This
paper investigates the impact of item representation learning on downstream
recommendation tasks and examines the disparities in information fusion at
different stages. Empirical experiments are conducted to demonstrate the need
to design a framework suitable for collaborative learning and fusion of diverse
information. Based on this, we propose a new model-agnostic framework for
multi-modal sequential recommendation tasks, called Online
Distillation-enhanced Multi-modal Transformer (ODMT), to enhance feature
interaction and mutual learning among multi-source input (ID, text, and image),
while avoiding conflicts among different features during training, thereby
improving recommendation accuracy. To be specific, we first introduce an
ID-aware Multi-modal Transformer module in the item representation learning
stage to facilitate information interaction among different features. Secondly,
we employ an online distillation training strategy in the prediction
optimization stage to make multi-source data learn from each other and improve
prediction robustness. Experimental results on a video content recommendation
dataset and three e-commerce recommendation datasets demonstrate the
effectiveness of the proposed two modules, which is approximately 10%
improvement in performance compared to baseline models.Comment: 11 pages, 7 figure
MER 2023: Multi-label Learning, Modality Robustness, and Semi-Supervised Learning
Over the past few decades, multimodal emotion recognition has made remarkable
progress with the development of deep learning. However, existing technologies
are difficult to meet the demand for practical applications. To improve the
robustness, we launch a Multimodal Emotion Recognition Challenge (MER 2023) to
motivate global researchers to build innovative technologies that can further
accelerate and foster research. For this year's challenge, we present three
distinct sub-challenges: (1) MER-MULTI, in which participants recognize both
discrete and dimensional emotions; (2) MER-NOISE, in which noise is added to
test videos for modality robustness evaluation; (3) MER-SEMI, which provides
large amounts of unlabeled samples for semi-supervised learning. In this paper,
we test a variety of multimodal features and provide a competitive baseline for
each sub-challenge. Our system achieves 77.57% on the F1 score and 0.82 on the
mean squared error (MSE) for MER-MULTI, 69.82% on the F1 score and 1.12 on MSE
for MER-NOISE, and 86.75% on the F1 score for MER-SEMI, respectively. Baseline
code is available at https://github.com/zeroQiaoba/MER2023-Baseline