1 research outputs found
GujiBERT and GujiGPT: Construction of Intelligent Information Processing Foundation Language Models for Ancient Texts
In the context of the rapid development of large language models, we have
meticulously trained and introduced the GujiBERT and GujiGPT language models,
which are foundational models specifically designed for intelligent information
processing of ancient texts. These models have been trained on an extensive
dataset that encompasses both simplified and traditional Chinese characters,
allowing them to effectively handle various natural language processing tasks
related to ancient books, including but not limited to automatic sentence
segmentation, punctuation, word segmentation, part-of-speech tagging, entity
recognition, and automatic translation. Notably, these models have exhibited
exceptional performance across a range of validation tasks using publicly
available datasets. Our research findings highlight the efficacy of employing
self-supervised methods to further train the models using classical text
corpora, thus enhancing their capability to tackle downstream tasks. Moreover,
it is worth emphasizing that the choice of font, the scale of the corpus, and
the initial model selection all exert significant influence over the ultimate
experimental outcomes. To cater to the diverse text processing preferences of
researchers in digital humanities and linguistics, we have developed three
distinct categories comprising a total of nine model variations. We believe
that by sharing these foundational language models specialized in the domain of
ancient texts, we can facilitate the intelligent processing and scholarly
exploration of ancient literary works and, consequently, contribute to the
global dissemination of China's rich and esteemed traditional culture in this
new era.Comment: 22pages,0 figur