1,443 research outputs found
GujiBERT and GujiGPT: Construction of Intelligent Information Processing Foundation Language Models for Ancient Texts
In the context of the rapid development of large language models, we have
meticulously trained and introduced the GujiBERT and GujiGPT language models,
which are foundational models specifically designed for intelligent information
processing of ancient texts. These models have been trained on an extensive
dataset that encompasses both simplified and traditional Chinese characters,
allowing them to effectively handle various natural language processing tasks
related to ancient books, including but not limited to automatic sentence
segmentation, punctuation, word segmentation, part-of-speech tagging, entity
recognition, and automatic translation. Notably, these models have exhibited
exceptional performance across a range of validation tasks using publicly
available datasets. Our research findings highlight the efficacy of employing
self-supervised methods to further train the models using classical text
corpora, thus enhancing their capability to tackle downstream tasks. Moreover,
it is worth emphasizing that the choice of font, the scale of the corpus, and
the initial model selection all exert significant influence over the ultimate
experimental outcomes. To cater to the diverse text processing preferences of
researchers in digital humanities and linguistics, we have developed three
distinct categories comprising a total of nine model variations. We believe
that by sharing these foundational language models specialized in the domain of
ancient texts, we can facilitate the intelligent processing and scholarly
exploration of ancient literary works and, consequently, contribute to the
global dissemination of China's rich and esteemed traditional culture in this
new era.Comment: 22pages,0 figur
Application of pre-training and fine-tuning AI models to machine translation: a case study of multilingual text classification in Baidu
With the development of international information technology, we are producing
a huge amount of information all the time. The processing ability of information in
various languages is gradually replacing information and becoming a rarer resource.
How to obtain the most effective information in such a large and complex amount of
multilingual textual information is a major goal of multilingual information
processing.
Multilingual text classification helps users to break the language barrier and
accurately locate the required information and triage information. At the same time,
the rapid development of the Internet has accelerated the communication among users
of various languages, giving rise to a large number of multilingual texts, such as book
and movie reviews, online chats, product introductions and other forms, which
contain a large amount of valuable implicit information and urgently need automated
tools to categorize and process those multilingual texts.
This work describes the Natural Language Process (NLP) sub-task known as
Multilingual Text Classification (MTC) performed within the context of Baidu, a
Chinese leading AI company with a strong Internet base, whose NLP division led the
industry in deep learning technology to go online in Machine Translation (MT) and
search. Multilingual text classification is an important module in NLP machine
translation and a basic module in NLP tasks. It can be applied to many fields, such as
Fake Reviews Detection, News Headlines Categories Classification, Analysis of
positive and negative reviews and so on.
In the following work, we will first define the AI model paradigm of
'pre-training and fine-tuning' in deep learning in the Baidu NLP department. Then
investigated the application scenarios of multilingual text classification. Most of the
text classification systems currently available in the Chinese market are designed for a
single language, such as Alibaba's text classification system. If users need to classify
texts of the same category in multiple languages, they need to train multiple single
text classification systems and then classify them one by one.
However, many internationalized products do not have a single text language,
such as AliExpress cross-border e-commerce business, Airbnb B&B business, etc.
Industry needs to understand and classify users’ reviews in various languages, and
have conducted in-depth statistics and marketing strategy development, and
multilingual text classification is particularly important in this scenario.
Therefore, we focus on interpreting the methodology of multilingual text
classification model of machine translation in Baidu NLP department, and capture
sets of multilingual data of reviews, news headlines and other data for manual
classification and labeling, use the labeling results for fine-tuning of multilingual text
classification model, and output the quality evaluation data of Baidu multilingual text
classification model after fine-tuning. We will discuss if the pre-training and
fine-tuning of the large model can substantially improve the quality and performance
of multilingual text classification.
Finally, based on the machine translation-multilingual text classification model,
we derive the application method of pre-training and fine-tuning paradigm in the
current cutting-edge deep learning AI model under the NLP system and verify the
generality and cutting-edge of the pre-training and fine-tuning paradigm in the deep
learning-intelligent search field.Com o desenvolvimento da tecnologia de informação internacional, estamos
sempre a produzir uma enorme quantidade de informação e o recurso mais escasso já
não é a informação, mas a capacidade de processar informação em cada língua. A
maior parte da informação multilingue é expressa sob a forma de texto. Como obter a
informação mais eficaz numa quantidade tão considerável e complexa de informação
textual multilingue é um dos principais objetivos do processamento de informação
multilingue.
A classificação de texto multilingue ajuda os utilizadores a quebrar a barreira
linguística e a localizar com precisão a informação necessária e a classificá-la. Ao
mesmo tempo, o rápido desenvolvimento da Internet acelerou a comunicação entre
utilizadores de várias línguas, dando origem a um grande número de textos
multilingues, tais como críticas de livros e filmes, chats, introduções de produtos e
outros distintos textos, que contêm uma grande quantidade de informação implícita
valiosa e necessitam urgentemente de ferramentas automatizadas para categorizar e
processar esses textos multilingues.
Este trabalho descreve a subtarefa do Processamento de Linguagem Natural
(PNL) conhecida como Classificação de Texto Multilingue (MTC), realizada no
contexto da Baidu, uma empresa chinesa líder em IA, cuja equipa de PNL levou a
indústria em tecnologia baseada em aprendizagem neuronal a destacar-se em
Tradução Automática (MT) e pesquisa científica. A classificação multilingue de
textos é um módulo importante na tradução automática de PNL e um módulo básico
em tarefas de PNL. A MTC pode ser aplicada a muitos campos, tais como análise de
sentimentos multilingues, categorização de notícias, filtragem de conteúdos
indesejados (do inglês spam), entre outros.
Neste trabalho, iremos primeiro definir o paradigma do modelo AI de 'pré-treino
e afinação' em aprendizagem profunda no departamento de PNL da Baidu. Em
seguida, realizaremos a pesquisa sobre outros produtos no mercado com capacidade
de classificação de texto — a classificação de texto levada a cabo pela Alibaba. Após
a pesquisa, verificamos que a maioria dos sistemas de classificação de texto
atualmente disponíveis no mercado chinês são concebidos para uma única língua, tal como o sistema de classificação de texto Alibaba. Se os utilizadores precisarem de
classificar textos da mesma categoria em várias línguas, precisam de aplicar vários
sistemas de classificação de texto para cada língua e depois classificá-los um a um.
No entanto, muitos produtos internacionalizados não têm uma única língua de
texto, tais como AliExpress comércio eletrónico transfronteiriço, Airbnb B&B
business, etc. A indústria precisa compreender e classificar as revisões dos
utilizadores em várias línguas. Esta necessidade conduziu a um desenvolvimento
aprofundado de estatísticas e estratégias de marketing, e a classificação de textos
multilingues é particularmente importante neste cenário.
Desta forma, concentrar-nos-emos na interpretação da metodologia do modelo
de classificação de texto multilingue da tradução automática no departamento de PNL
Baidu. Colhemos para o efeito conjuntos de dados multilingues de comentários e
críticas, manchetes de notícias e outros dados para classificação manual, utilizamos os
resultados dessa classificação para o aperfeiçoamento do modelo de classificação de
texto multilingue e produzimos os dados de avaliação da qualidade do modelo de
classificação de texto multilingue da Baidu. Discutiremos se o pré-treino e o
aperfeiçoamento do modelo podem melhorar substancialmente a qualidade e o
desempenho da classificação de texto multilingue. Finalmente, com base no modelo
de classificação de texto multilingue de tradução automática, derivamos o método de
aplicação do paradigma de pré-formação e afinação no atual modelo de IA de
aprendizagem profunda de ponta sob o sistema de PNL, e verificamos a robustez e os
resultados positivos do paradigma de pré-treino e afinação no campo de pesquisa de
aprendizagem profunda
Natural language understanding: instructions for (Present and Future) use
In this paper I look at Natural Language Understanding, an area of Natural Language Processing aimed at making sense of text, through the lens of a visionary future: what do we expect a machine should be able to understand? and what are the key dimensions that require the attention of researchers to make this dream come true
Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling
Modeling discourse -- the linguistic phenomena that go beyond individual
sentences, is a fundamental yet challenging aspect of natural language
processing (NLP). However, existing evaluation benchmarks primarily focus on
the evaluation of inter-sentence properties and overlook critical discourse
phenomena that cross sentences. To bridge the gap, we propose Disco-Bench, a
benchmark that can evaluate intra-sentence discourse properties across a
diverse set of NLP tasks, covering understanding, translation, and generation.
Disco-Bench consists of 9 document-level testsets in the literature domain,
which contain rich discourse phenomena (e.g. cohesion and coherence) in Chinese
and/or English. For linguistic analysis, we also design a diagnostic test suite
that can examine whether the target models learn discourse knowledge. We
totally evaluate 20 general-, in-domain and commercial models based on
Transformer, advanced pretraining architectures and large language models
(LLMs). Our results show (1) the challenge and necessity of our evaluation
benchmark; (2) fine-grained pretraining based on literary document-level
training data consistently improves the modeling of discourse information. We
will release the datasets, pretrained models, and leaderboard, which we hope
can significantly facilitate research in this field:
https://github.com/longyuewangdcu/Disco-Bench.Comment: Zhaopeng Tu is the corresponding autho
Can humain association norm evaluate latent semantic analysis?
This paper presents the comparison of word association norm created by a psycholinguistic experiment to association lists generated by algorithms operating on text corpora. We compare lists generated by Church and Hanks algorithm and lists generated by LSA algorithm. An argument is presented on how those automatically generated lists reflect real semantic relations
Teaching Both Simplified and Traditional Characters to Learners of Chinese as L2
In the acquisition of Chinese as a second language, learning Chinese characters is an essential part of the learning process. There are various approaches to how and when Chinese characters should be introduced. Some scholars claim simplified characters should be given priority, while others promote teaching traditional characters first. Yet to some, teaching traditional and simplified characters simultaneously is preferable, while others believe that learning Hanyu Pinyin alone reduces the learning load and advocate the idea that learning characters be postponed to the later stages. This study discusses reading priorities for students of L2 Chinese in an environment that promotes balanced teaching of traditional and simplified characters from scratch. Results show that most learners prefer simplified characters though the number of students who equally acquired both characters in a set is also high. Fewer students prefer traditional characters, whereas texts in Hanyu Pinyin were not students’ preferred choice. Moreover, learners’ text comprehensions were better and more accurate when texts were written in characters
Phraseology in Corpus-Based Translation Studies: A Stylistic Study of Two Contemporary Chinese Translations of Cervantes's Don Quijote
The present work sets out to investigate the stylistic profiles of two modern Chinese versions of
Cervantes’s Don Quijote (I): by Yang Jiang (1978), the first direct translation from Castilian to Chinese,
and by Liu Jingsheng (1995), which is one of the most commercially successful versions of the
Castilian literary classic. This thesis focuses on a detailed linguistic analysis carried out with the help
of the latest textual analytical tools, natural language processing applications and statistical packages.
The type of linguistic phenomenon singled out for study is four-character expressions (FCEXs), which
are a very typical category of Chinese phraseology. The work opens with the creation of a descriptive
framework for the annotation of linguistic data extracted from the parallel corpus of Don Quijote.
Subsequently, the classified and extracted data are put through several statistical tests. The results of
these tests prove to be very revealing regarding the different use of FCEXs in the two Chinese
translations. The computational modelling of the linguistic data would seem to indicate that among
other findings, while Liu’s use of archaic idioms has followed the general patterns of the original and
also of Yang’s work in the first half of Don Quijote I, noticeable variations begin to emerge in the
second half of Liu’s more recent version. Such an idiosyncratic use of archaisms by Liu, which may be
defined as style shifting or style variation, is then analyzed in quantitative terms through the application
of the proposed context-motivated theory (CMT). The results of applying the CMT-derived statistical
models show that the detected stylistic variation may well point to the internal consistency of the
translator in rendering the second half of Part I of the novel, which reflects his freer, more creative and
experimental style of translation. Through the introduction and testing of quantitative research methods
adapted from corpus linguistics and textual statistics, this thesis has made a major contribution to
methodological innovation in the study of style within the context of corpus-based translation studies
Phraseology in Corpus-based transaltion studies : stylistic study of two contempoarary Chinese translation of Cervantes's Don Quijote
The present work sets out to investigate the stylistic profiles of two modern Chinese versions of Cervantes???s Don Quijote (I): by Yang Jiang (1978), the first direct translation from Castilian to Chinese, and by Liu Jingsheng (1995), which is one of the most commercially successful versions of the Castilian literary classic. This thesis focuses on a detailed linguistic analysis carried out with the help of the latest textual analytical tools, natural language processing applications and statistical packages. The type of linguistic phenomenon singled out for study is four-character expressions (FCEXs), which are a very typical category of Chinese phraseology. The work opens with the creation of a descriptive framework for the annotation of linguistic data extracted from the parallel corpus of Don Quijote. Subsequently, the classified and extracted data are put through several statistical tests. The results of these tests prove to be very revealing regarding the different use of FCEXs in the two Chinese translations. The computational modelling of the linguistic data would seem to indicate that among other findings, while Liu???s use of archaic idioms has followed the general patterns of the original and also of Yang???s work in the first half of Don Quijote I, noticeable variations begin to emerge in the second half of Liu???s more recent version. Such an idiosyncratic use of archaisms by Liu, which may be defined as style shifting or style variation, is then analyzed in quantitative terms through the application of the proposed context-motivated theory (CMT). The results of applying the CMT-derived statistical models show that the detected stylistic variation may well point to the internal consistency of the translator in rendering the second half of Part I of the novel, which reflects his freer, more creative and experimental style of translation. Through the introduction and testing of quantitative research methods adapted from corpus linguistics and textual statistics, this thesis has made a major contribution to methodological innovation in the study of style within the context of corpus-based translation studies.Imperial Users onl
Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020
On behalf of the Program Committee, a very warm welcome to the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020). This edition of the conference is held in Bologna and organised by the University of Bologna. The CLiC-it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after six years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges
- …