106 research outputs found
Introducing various Semantic Models for Amharic: Experimentation and Evaluation with multiple Tasks and Datasets
The availability of different pre-trained semantic models enabled the quick
development of machine learning components for downstream applications. Despite
the availability of abundant text data for low resource languages, only a few
semantic models are publicly available. Publicly available pre-trained models
are usually built as a multilingual version of semantic models that can not fit
well for each language due to context variations. In this work, we introduce
different semantic models for Amharic. After we experiment with the existing
pre-trained semantic models, we trained and fine-tuned nine new different
models using a monolingual text corpus. The models are build using word2Vec
embeddings, distributional thesaurus (DT), contextual embeddings, and DT
embeddings obtained via network embedding algorithms. Moreover, we employ these
models for different NLP tasks and investigate their impact. We find that newly
trained models perform better than pre-trained multilingual models.
Furthermore, models based on contextual embeddings from RoBERTA perform better
than the word2Vec models
COOL, a Context Outlooker, and its Application to Question Answering and other Natural Language Processing Tasks
Vision outlookers improve the performance of vision transformers, which
implement a self-attention mechanism by adding outlook attention, a form of
local attention.
In natural language processing, as has been the case in computer vision and
other domains, transformer-based models constitute the state-of-the-art for
most processing tasks. In this domain, too, many authors have argued and
demonstrated the importance of local context.
We present and evaluate an outlook attention mechanism, COOL, for natural
language processing. COOL adds, on top of the self-attention layers of a
transformer-based model, outlook attention layers that encode local syntactic
context considering word proximity and consider more pair-wise constraints than
dynamic convolution operations used by existing approaches.
A comparative empirical performance evaluation of an implementation of COOL
with different transformer-based approaches confirms the opportunity of
improvement over a baseline using the neural language models alone for various
natural language processing tasks, including question answering. The proposed
approach is competitive with state-of-the-art methods
AfriBERTa: Towards Viable Multilingual Language Models for Low-resource Languages
There are over 7000 languages spoken on earth, but many of these languages suffer from a dearth of natural language processing (NLP) tools. Multilingual pretrained language models have been introduced to help alleviate this problem. However, the largest pretrained multilingual models were trained on only hundreds of languages. This is a small amount when compared to the number of spoken languages. While these models have displayed impressive performance on several languages, including those they were not pretrained on, there is a lot of ground to be covered.
A lot of languages are often left out because pretrained language models are assumed to require a lot of training data, which the languages do not have. Furthermore, a major motivation behind these models is that such lower-resource languages benefit from joint training with higher-resource languages. In this thesis, we challenge both these assumptions and present the first attempt at training a multilingual language model on only low-resource languages. We show that it is possible to train competitive multilingual language models on less than one gigabyte of text data containing a selection of African languages.
Our model, named AfriBERTa, covers 11 African languages, including the first language model for 4 of these languages. We evaluate this model on named entity recognition and text classification spanning 10 languages. Our evaluation results show that our model is very competitive with larger multilingual models - multilingual BERT and XLM-RoBERTa - on several languages. Results suggest that our “small data” approach based on similar languages may sometimes work better than joint training on large datasets with high- resource languages. Furthermore, we present a comprehensive discussion of the implications of our findings
Evaluating the Cross-Lingual Effectiveness of Massively Multilingual Neural Machine Translation
The recently proposed massively multilingual neural machine translation (NMT)
system has been shown to be capable of translating over 100 languages to and
from English within a single model. Its improved translation performance on low
resource languages hints at potential cross-lingual transfer capability for
downstream tasks. In this paper, we evaluate the cross-lingual effectiveness of
representations from the encoder of a massively multilingual NMT model on 5
downstream classification and sequence labeling tasks covering a diverse set of
over 50 languages. We compare against a strong baseline, multilingual BERT
(mBERT), in different cross-lingual transfer learning scenarios and show gains
in zero-shot transfer in 4 out of these 5 tasks
Language Modelling with Pixels
Language models are defined over a finite set of inputs, which creates a
vocabulary bottleneck when we attempt to scale the number of supported
languages. Tackling this bottleneck results in a trade-off between what can be
represented in the embedding matrix and computational issues in the output
layer. This paper introduces PIXEL, the Pixel-based Encoder of Language, which
suffers from neither of these issues. PIXEL is a pretrained language model that
renders text as images, making it possible to transfer representations across
languages based on orthographic similarity or the co-activation of pixels.
PIXEL is trained to reconstruct the pixels of masked patches, instead of
predicting a distribution over tokens. We pretrain the 86M parameter PIXEL
model on the same English data as BERT and evaluate on syntactic and semantic
tasks in typologically diverse languages, including various non-Latin scripts.
We find that PIXEL substantially outperforms BERT on syntactic and semantic
processing tasks on scripts that are not found in the pretraining data, but
PIXEL is slightly weaker than BERT when working with Latin scripts.
Furthermore, we find that PIXEL is more robust to noisy text inputs than BERT,
further confirming the benefits of modelling language with pixels.Comment: work in progres
Entity centric neural models for natural language processing
This thesis explores how to enhance natural language understanding by incorporating entity information into neural network models. It tackles three key questions:1. Leveraging entities for understanding tasks: This work introduces Entity-GCN, a model that performs multi-step reasoning on a graph where nodes represent entity mentions and edges represent relationships. This method achieved state-of-the-art results on a multi-document question-answering dataset.2. Identifying and disambiguating entities using large language models: This research proposes a novel system that retrieves entities by generating their names token-by-token, overcoming limitations of traditional methods and significantly reducing memory footprint. This approach is also extended to a multilingual setting and further optimized for efficiency.3. Interpreting and controlling entity knowledge within models: This thesis presents a post-hoc interpretation technique to analyze how decisions are made across layers in neural models, allowing for visualization and analysis of knowledge representation. Additionally, a method for editing factual knowledge about entities is proposed, enabling correction of model predictions without costly retraining
Application of pre-training and fine-tuning AI models to machine translation: a case study of multilingual text classification in Baidu
With the development of international information technology, we are producing
a huge amount of information all the time. The processing ability of information in
various languages is gradually replacing information and becoming a rarer resource.
How to obtain the most effective information in such a large and complex amount of
multilingual textual information is a major goal of multilingual information
processing.
Multilingual text classification helps users to break the language barrier and
accurately locate the required information and triage information. At the same time,
the rapid development of the Internet has accelerated the communication among users
of various languages, giving rise to a large number of multilingual texts, such as book
and movie reviews, online chats, product introductions and other forms, which
contain a large amount of valuable implicit information and urgently need automated
tools to categorize and process those multilingual texts.
This work describes the Natural Language Process (NLP) sub-task known as
Multilingual Text Classification (MTC) performed within the context of Baidu, a
Chinese leading AI company with a strong Internet base, whose NLP division led the
industry in deep learning technology to go online in Machine Translation (MT) and
search. Multilingual text classification is an important module in NLP machine
translation and a basic module in NLP tasks. It can be applied to many fields, such as
Fake Reviews Detection, News Headlines Categories Classification, Analysis of
positive and negative reviews and so on.
In the following work, we will first define the AI model paradigm of
'pre-training and fine-tuning' in deep learning in the Baidu NLP department. Then
investigated the application scenarios of multilingual text classification. Most of the
text classification systems currently available in the Chinese market are designed for a
single language, such as Alibaba's text classification system. If users need to classify
texts of the same category in multiple languages, they need to train multiple single
text classification systems and then classify them one by one.
However, many internationalized products do not have a single text language,
such as AliExpress cross-border e-commerce business, Airbnb B&B business, etc.
Industry needs to understand and classify users’ reviews in various languages, and
have conducted in-depth statistics and marketing strategy development, and
multilingual text classification is particularly important in this scenario.
Therefore, we focus on interpreting the methodology of multilingual text
classification model of machine translation in Baidu NLP department, and capture
sets of multilingual data of reviews, news headlines and other data for manual
classification and labeling, use the labeling results for fine-tuning of multilingual text
classification model, and output the quality evaluation data of Baidu multilingual text
classification model after fine-tuning. We will discuss if the pre-training and
fine-tuning of the large model can substantially improve the quality and performance
of multilingual text classification.
Finally, based on the machine translation-multilingual text classification model,
we derive the application method of pre-training and fine-tuning paradigm in the
current cutting-edge deep learning AI model under the NLP system and verify the
generality and cutting-edge of the pre-training and fine-tuning paradigm in the deep
learning-intelligent search field.Com o desenvolvimento da tecnologia de informação internacional, estamos
sempre a produzir uma enorme quantidade de informação e o recurso mais escasso já
não é a informação, mas a capacidade de processar informação em cada língua. A
maior parte da informação multilingue é expressa sob a forma de texto. Como obter a
informação mais eficaz numa quantidade tão considerável e complexa de informação
textual multilingue é um dos principais objetivos do processamento de informação
multilingue.
A classificação de texto multilingue ajuda os utilizadores a quebrar a barreira
linguística e a localizar com precisão a informação necessária e a classificá-la. Ao
mesmo tempo, o rápido desenvolvimento da Internet acelerou a comunicação entre
utilizadores de várias línguas, dando origem a um grande número de textos
multilingues, tais como críticas de livros e filmes, chats, introduções de produtos e
outros distintos textos, que contêm uma grande quantidade de informação implícita
valiosa e necessitam urgentemente de ferramentas automatizadas para categorizar e
processar esses textos multilingues.
Este trabalho descreve a subtarefa do Processamento de Linguagem Natural
(PNL) conhecida como Classificação de Texto Multilingue (MTC), realizada no
contexto da Baidu, uma empresa chinesa líder em IA, cuja equipa de PNL levou a
indústria em tecnologia baseada em aprendizagem neuronal a destacar-se em
Tradução Automática (MT) e pesquisa científica. A classificação multilingue de
textos é um módulo importante na tradução automática de PNL e um módulo básico
em tarefas de PNL. A MTC pode ser aplicada a muitos campos, tais como análise de
sentimentos multilingues, categorização de notícias, filtragem de conteúdos
indesejados (do inglês spam), entre outros.
Neste trabalho, iremos primeiro definir o paradigma do modelo AI de 'pré-treino
e afinação' em aprendizagem profunda no departamento de PNL da Baidu. Em
seguida, realizaremos a pesquisa sobre outros produtos no mercado com capacidade
de classificação de texto — a classificação de texto levada a cabo pela Alibaba. Após
a pesquisa, verificamos que a maioria dos sistemas de classificação de texto
atualmente disponíveis no mercado chinês são concebidos para uma única língua, tal como o sistema de classificação de texto Alibaba. Se os utilizadores precisarem de
classificar textos da mesma categoria em várias línguas, precisam de aplicar vários
sistemas de classificação de texto para cada língua e depois classificá-los um a um.
No entanto, muitos produtos internacionalizados não têm uma única língua de
texto, tais como AliExpress comércio eletrónico transfronteiriço, Airbnb B&B
business, etc. A indústria precisa compreender e classificar as revisões dos
utilizadores em várias línguas. Esta necessidade conduziu a um desenvolvimento
aprofundado de estatísticas e estratégias de marketing, e a classificação de textos
multilingues é particularmente importante neste cenário.
Desta forma, concentrar-nos-emos na interpretação da metodologia do modelo
de classificação de texto multilingue da tradução automática no departamento de PNL
Baidu. Colhemos para o efeito conjuntos de dados multilingues de comentários e
críticas, manchetes de notícias e outros dados para classificação manual, utilizamos os
resultados dessa classificação para o aperfeiçoamento do modelo de classificação de
texto multilingue e produzimos os dados de avaliação da qualidade do modelo de
classificação de texto multilingue da Baidu. Discutiremos se o pré-treino e o
aperfeiçoamento do modelo podem melhorar substancialmente a qualidade e o
desempenho da classificação de texto multilingue. Finalmente, com base no modelo
de classificação de texto multilingue de tradução automática, derivamos o método de
aplicação do paradigma de pré-formação e afinação no atual modelo de IA de
aprendizagem profunda de ponta sob o sistema de PNL, e verificamos a robustez e os
resultados positivos do paradigma de pré-treino e afinação no campo de pesquisa de
aprendizagem profunda
Entity centric neural models for natural language processing
This thesis explores how to enhance natural language understanding by incorporating entity information into neural network models. It tackles three key questions:1. Leveraging entities for understanding tasks: This work introduces Entity-GCN, a model that performs multi-step reasoning on a graph where nodes represent entity mentions and edges represent relationships. This method achieved state-of-the-art results on a multi-document question-answering dataset.2. Identifying and disambiguating entities using large language models: This research proposes a novel system that retrieves entities by generating their names token-by-token, overcoming limitations of traditional methods and significantly reducing memory footprint. This approach is also extended to a multilingual setting and further optimized for efficiency.3. Interpreting and controlling entity knowledge within models: This thesis presents a post-hoc interpretation technique to analyze how decisions are made across layers in neural models, allowing for visualization and analysis of knowledge representation. Additionally, a method for editing factual knowledge about entities is proposed, enabling correction of model predictions without costly retraining
- …