368 research outputs found
Machine Learning with World Knowledge: The Position and Survey
Machine learning has become pervasive in multiple domains, impacting a wide
variety of applications, such as knowledge discovery and data mining, natural
language processing, information retrieval, computer vision, social and health
informatics, ubiquitous computing, etc. Two essential problems of machine
learning are how to generate features and how to acquire labels for machines to
learn. Particularly, labeling large amount of data for each domain-specific
problem can be very time consuming and costly. It has become a key obstacle in
making learning protocols realistic in applications. In this paper, we will
discuss how to use the existing general-purpose world knowledge to enhance
machine learning processes, by enriching the features or reducing the labeling
work. We start from the comparison of world knowledge with domain-specific
knowledge, and then introduce three key problems in using world knowledge in
learning processes, i.e., explicit and implicit feature representation,
inference for knowledge linking and disambiguation, and learning with direct or
indirect supervision. Finally we discuss the future directions of this research
topic
A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios
Deep neural networks and huge language models are becoming omnipresent in
natural language applications. As they are known for requiring large amounts of
training data, there is a growing body of work to improve the performance in
low-resource settings. Motivated by the recent fundamental changes towards
neural models and the popular pre-train and fine-tune paradigm, we survey
promising approaches for low-resource natural language processing. After a
discussion about the different dimensions of data availability, we give a
structured overview of methods that enable learning when training data is
sparse. This includes mechanisms to create additional labeled data like data
augmentation and distant supervision as well as transfer learning settings that
reduce the need for target supervision. A goal of our survey is to explain how
these methods differ in their requirements as understanding them is essential
for choosing a technique suited for a specific low-resource setting. Further
key aspects of this work are to highlight open issues and to outline promising
directions for future research.Comment: Accepted at NAACL 202
Character-level and syntax-level models for low-resource and multilingual natural language processing
There are more than 7000 languages in the world, but only a small portion of them benefit from Natural Language Processing resources and models. Although languages generally present different characteristics, “cross-lingual bridges” can be exploited, such as transliteration signals and word alignment links. Such information, together with the availability of multiparallel corpora and the urge to overcome language barriers, motivates us to build models that represent more of the world’s languages.
This thesis investigates cross-lingual links for improving the processing of low-resource languages with language-agnostic models at the character and syntax level. Specifically, we propose to (i) use orthographic similarities and transliteration between Named Entities and rare words in different languages to improve the construction of Bilingual Word Embeddings (BWEs) and named entity resources, and (ii) exploit multiparallel corpora for projecting labels from high- to low-resource languages, thereby gaining access to weakly supervised processing methods for the latter.
In the first publication, we describe our approach for improving the translation of rare words and named entities for the Bilingual Dictionary Induction (BDI) task, using orthography and transliteration information. In our second work, we tackle BDI by enriching BWEs with orthography embeddings and a number of other features, using our classification-based system to overcome script differences among languages. The third publication describes cheap cross-lingual signals that should be considered when building mapping approaches for BWEs since they are simple to extract, effective for bootstrapping the mapping of BWEs, and overcome the failure of unsupervised methods. The fourth paper shows our approach for extracting a named entity resource for 1340 languages, including very low-resource languages from all major areas of linguistic diversity. We exploit parallel corpus statistics and transliteration models and obtain improved performance over prior work. Lastly, the fifth work models annotation projection as a graph-based label propagation problem for the part of speech tagging task. Part of speech models trained on our labeled sets outperform prior work for low-resource languages like Bambara (an African language spoken in Mali), Erzya (a Uralic language spoken in Russia’s Republic of Mordovia), Manx (the Celtic language of the Isle of Man), and Yoruba (a Niger-Congo language spoken in Nigeria and surrounding countries)
BOND: BERT-Assisted Open-Domain Named Entity Recognition with Distant Supervision
We study the open-domain named entity recognition (NER) problem under distant
supervision. The distant supervision, though does not require large amounts of
manual annotations, yields highly incomplete and noisy distant labels via
external knowledge bases. To address this challenge, we propose a new
computational framework -- BOND, which leverages the power of pre-trained
language models (e.g., BERT and RoBERTa) to improve the prediction performance
of NER models. Specifically, we propose a two-stage training algorithm: In the
first stage, we adapt the pre-trained language model to the NER tasks using the
distant labels, which can significantly improve the recall and precision; In
the second stage, we drop the distant labels, and propose a self-training
approach to further improve the model performance. Thorough experiments on 5
benchmark datasets demonstrate the superiority of BOND over existing distantly
supervised NER methods. The code and distantly labeled data have been released
in https://github.com/cliang1453/BOND.Comment: Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery
and Data Mining (KDD '20
X-WikiRE: A Large, Multilingual Resource for Relation Extraction as Machine Comprehension
Although the vast majority of knowledge bases KBs are heavily biased towards
English, Wikipedias do cover very different topics in different languages.
Exploiting this, we introduce a new multilingual dataset (X-WikiRE), framing
relation extraction as a multilingual machine reading problem. We show that by
leveraging this resource it is possible to robustly transfer models
cross-lingually and that multilingual support significantly improves
(zero-shot) relation extraction, enabling the population of low-resourced KBs
from their well-populated counterparts
Translation Alignment Applied to Historical Languages: methods, evaluation, applications, and visualization
Translation alignment is an essential task in Digital Humanities and Natural
Language Processing, and it aims to link words/phrases in the source
text with their translation equivalents in the translation. In addition to
its importance in teaching and learning historical languages, translation
alignment builds bridges between ancient and modern languages through
which various linguistics annotations can be transferred. This thesis focuses
on word-level translation alignment applied to historical languages in general
and Ancient Greek and Latin in particular. As the title indicates, the thesis
addresses four interdisciplinary aspects of translation alignment.
The starting point was developing Ugarit, an interactive annotation tool
to perform manual alignment aiming to gather training data to train an
automatic alignment model. This effort resulted in more than 190k accurate
translation pairs that I used for supervised training later. Ugarit has been
used by many researchers and scholars also in the classroom at several
institutions for teaching and learning ancient languages, which resulted
in a large, diverse crowd-sourced aligned parallel corpus allowing us to
conduct experiments and qualitative analysis to detect recurring patterns in
annotators’ alignment practice and the generated translation pairs.
Further, I employed the recent advances in NLP and language modeling to
develop an automatic alignment model for historical low-resourced languages,
experimenting with various training objectives and proposing a training
strategy for historical languages that combines supervised and unsupervised
training with mono- and multilingual texts. Then, I integrated this alignment
model into other development workflows to project cross-lingual annotations
and induce bilingual dictionaries from parallel corpora.
Evaluation is essential to assess the quality of any model. To ensure employing the best practice, I reviewed the current evaluation procedure, defined
its limitations, and proposed two new evaluation metrics. Moreover, I introduced a visual analytics framework to explore and inspect alignment gold
standard datasets and support quantitative and qualitative evaluation of
translation alignment models. Besides, I designed and implemented visual
analytics tools and reading environments for parallel texts and proposed
various visualization approaches to support different alignment-related tasks
employing the latest advances in information visualization and best practice.
Overall, this thesis presents a comprehensive study that includes manual and
automatic alignment techniques, evaluation methods and visual analytics
tools that aim to advance the field of translation alignment for historical
languages
Verb Knowledge Injection for Multilingual Event Processing
In parallel to their overwhelming success across NLP tasks, language ability
of deep Transformer networks, pretrained via language modeling (LM) objectives
has undergone extensive scrutiny. While probing revealed that these models
encode a range of syntactic and semantic properties of a language, they are
still prone to fall back on superficial cues and simple heuristics to solve
downstream tasks, rather than leverage deeper linguistic knowledge. In this
paper, we target one such area of their deficiency, verbal reasoning. We
investigate whether injecting explicit information on verbs' semantic-syntactic
behaviour improves the performance of LM-pretrained Transformers in event
extraction tasks -- downstream tasks for which accurate verb processing is
paramount. Concretely, we impart the verb knowledge from curated lexical
resources into dedicated adapter modules (dubbed verb adapters), allowing it to
complement, in downstream tasks, the language knowledge obtained during
LM-pretraining. We first demonstrate that injecting verb knowledge leads to
performance gains in English event extraction. We then explore the utility of
verb adapters for event extraction in other languages: we investigate (1)
zero-shot language transfer with multilingual Transformers as well as (2)
transfer via (noisy automatic) translation of English verb-based lexical
constraints. Our results show that the benefits of verb knowledge injection
indeed extend to other languages, even when verb adapters are trained on
noisily translated constraints.Comment: 19 pages, 1 figure, 8 table
Low-Resource Adaptation of Neural NLP Models
Real-world applications of natural language processing (NLP) are challenging.
NLP models rely heavily on supervised machine learning and require large
amounts of annotated data. These resources are often based on language data
available in large quantities, such as English newswire. However, in real-world
applications of NLP, the textual resources vary across several dimensions, such
as language, dialect, topic, and genre. It is challenging to find annotated
data of sufficient amount and quality. The objective of this thesis is to
investigate methods for dealing with such low-resource scenarios in information
extraction and natural language understanding. To this end, we study distant
supervision and sequential transfer learning in various low-resource settings.
We develop and adapt neural NLP models to explore a number of research
questions concerning NLP tasks with minimal or no training data.Comment: Thesis submitted for the degree of Philosophiae Doctor. Department of
Informatics, University of Oslo.
https://www.mn.uio.no/ifi/forskning/aktuelt/arrangementer/disputaser/2020/nooralahzadeh.htm
X-WikiRE:A Large, Multilingual Resource for Relation Extraction as Machine Comprehension
Although the vast majority of knowledge bases KBs are heavily biased towards
English, Wikipedias do cover very different topics in different languages.
Exploiting this, we introduce a new multilingual dataset (X-WikiRE), framing
relation extraction as a multilingual machine reading problem. We show that by
leveraging this resource it is possible to robustly transfer models
cross-lingually and that multilingual support significantly improves
(zero-shot) relation extraction, enabling the population of low-resourced KBs
from their well-populated counterparts
Cold-start universal information extraction
Who? What? When? Where? Why? are fundamental questions asked when gathering knowledge about and understanding a concept, topic, or event. The answers to these questions underpin the key information conveyed in the overwhelming majority, if not all, of language-based communication. At the core of my research in Information Extraction (IE) is the desire to endow machines with the ability to automatically extract, assess, and understand text in order to answer these fundamental questions. IE has been serving as one of the most important components for many downstream natural language processing (NLP) tasks, such as knowledge base completion, machine reading comprehension, machine translation and so on. The proliferation of the Web also intensifies the need of dealing with enormous amount of unstructured data from various sources, such as languages, genres and domains.
When building an IE system, the conventional pipeline is to (1) ask expert linguists to rigorously define a target set of knowledge types we wish to extract by examining a large data set, (2) collect resources and human annotations for each type, and (3) design features and train machine learning models to extract knowledge elements. In practice, this process is very expensive as each step involves extensive human effort which is not always available, for example, to specify the knowledge types for a particular scenario, both consumers and expert linguists need to examine a lot of data from that domain and write detailed annotation guidelines for each type. Hand-crafted schemas, which define the types and complex templates of the expected knowledge elements, often provide low coverage and fail to generalize to new domains. For example, none of the traditional event extraction programs, such as ACE (Automatic Content Extraction) and TAC-KBP, include "donation'' and "evacuation'' in their schemas in spite of their potential relevance to natural disaster management users. Additionally, these approaches are highly dependent on linguistic resources and human labeled data tuned to pre-defined types, so they suffer from poor scalability and portability when moving to a new language, domain, or genre.
The focus of this thesis is to develop effective theories and algorithms for IE which not only yield satisfactory quality by incorporating prior linguistic and semantic knowledge, but also greater portability and scalability by moving away from the high cost and narrow focus of large-scale manual annotation. This thesis opens up a new research direction called Cold-Start Universal Information Extraction, where the full extraction and analysis starts from scratch and requires little or no prior manual annotation or pre-defined type schema. In addition to this new research paradigm, we also contribute effective algorithms and models towards resolving the following three challenges:
How can machines extract knowledge without any pre-defined types or any human annotated data? We develop an effective bottom-up and unsupervised Liberal Information Extraction framework based on the hypothesis that the meaning and underlying knowledge conveyed by linguistic expressions is usually embodied by their usages in language, which makes it possible to automatically induces a type schema based on rich contextual representations of all knowledge elements by combining their symbolic and distributional semantics using unsupervised hierarchical clustering.
How can machines benefit from available resources, e.g., large-scale ontologies or existing human annotations? My research has shown that pre-defined types can also be encoded by rich contextual or structured representations, through which knowledge elements can be mapped to their appropriate types. Therefore, we design a weakly supervised Zero-shot Learning and a Semi-Supervised Vector Quantized Variational Auto-Encoder approach that frames IE as a grounding problem instead of classification, where knowledge elements are grounded into any types from an extensible and large-scale target ontology or induced from the corpora, with available annotations for a few types.
How can IE approaches be extent to low-resource languages without any extra human effort? There are more than 6000 living languages in the real world while public gold-standard annotations are only available for a few dominant languages. To facilitate the adaptation of these IE frameworks to other languages, especially low resource languages, a Multilingual Common Semantic Space is further proposed to serve as a bridge for transferring existing resources and annotated data from dominant languages to more than 300 low resource languages. Moreover, a Multi-Level Adversarial Transfer framework is also designed to learn language-agnostic features across various languages
- …