582 research outputs found
Similarity-Based Models of Word Cooccurrence Probabilities
In many applications of natural language processing (NLP) it is necessary to
determine the likelihood of a given word combination. For example, a speech
recognizer may need to determine which of the two word combinations ``eat a
peach'' and ``eat a beach'' is more likely. Statistical NLP methods determine
the likelihood of a word combination from its frequency in a training corpus.
However, the nature of language is such that many word combinations are
infrequent and do not occur in any given corpus. In this work we propose a
method for estimating the probability of such previously unseen word
combinations using available information on ``most similar'' words.
We describe probabilistic word association models based on distributional
word similarity, and apply them to two tasks, language modeling and pseudo-word
disambiguation. In the language modeling task, a similarity-based model is used
to improve probability estimates for unseen bigrams in a back-off language
model. The similarity-based method yields a 20% perplexity improvement in the
prediction of unseen bigrams and statistically significant reductions in
speech-recognition error.
We also compare four similarity-based estimation methods against back-off and
maximum-likelihood estimation methods on a pseudo-word sense disambiguation
task in which we controlled for both unigram and bigram frequency to avoid
giving too much weight to easy-to-disambiguate high-frequency configurations.
The similarity-based methods perform up to 40% better on this particular task.Comment: 26 pages, 5 figure
A literature survey of active machine learning in the context of natural language processing
Active learning is a supervised machine learning technique in which the learner is in control of the data used for learning. That control is utilized by the learner to ask an oracle, typically a human with extensive knowledge of the domain at hand, about the classes of the instances for which the model learned so far makes unreliable predictions. The active learning process takes as input a set of labeled examples, as well as a larger set of unlabeled examples, and produces a classifier and a relatively small set of newly labeled data. The overall goal is to create as good a classifier as possible, without having to mark-up and supply the learner with more data than necessary. The learning process aims at keeping the human annotation effort to a minimum, only asking for advice where the training utility of the result of such a query is high. Active learning has been successfully applied to a number of natural language processing tasks, such as, information extraction, named entity recognition, text categorization, part-of-speech tagging, parsing, and word sense disambiguation. This report is a literature survey of active learning from the perspective of natural language processing
Play the Shannon Game With Language Models: A Human-Free Approach to Summary Evaluation
The goal of a summary is to concisely state the most important information in
a document. With this principle in mind, we introduce new reference-free
summary evaluation metrics that use a pretrained language model to estimate the
information content shared between a document and its summary. These metrics
are a modern take on the Shannon Game, a method for summary quality scoring
proposed decades ago, where we replace human annotators with language models.
We also view these metrics as an extension of BLANC, a recently proposed
approach to summary quality measurement based on the performance of a language
model with and without the help of a summary. Using transformer based language
models, we empirically verify that our metrics achieve state-of-the-art
correlation with human judgement of the summary quality dimensions of both
coherence and relevance, as well as competitive correlation with human
judgement of consistency and fluency.Comment: To appear at AAAI 202
X-SNS: Cross-Lingual Transfer Prediction through Sub-Network Similarity
Cross-lingual transfer (XLT) is an emergent ability of multilingual language
models that preserves their performance on a task to a significant extent when
evaluated in languages that were not included in the fine-tuning process. While
English, due to its widespread usage, is typically regarded as the primary
language for model adaption in various tasks, recent studies have revealed that
the efficacy of XLT can be amplified by selecting the most appropriate source
languages based on specific conditions. In this work, we propose the
utilization of sub-network similarity between two languages as a proxy for
predicting the compatibility of the languages in the context of XLT. Our
approach is model-oriented, better reflecting the inner workings of foundation
models. In addition, it requires only a moderate amount of raw text from
candidate languages, distinguishing it from the majority of previous methods
that rely on external resources. In experiments, we demonstrate that our method
is more effective than baselines across diverse tasks. Specifically, it shows
proficiency in ranking candidates for zero-shot XLT, achieving an improvement
of 4.6% on average in terms of NDCG@3. We also provide extensive analyses that
confirm the utility of sub-networks for XLT prediction.Comment: Accepted to EMNLP 2023 (Findings
Bootstrapping Named Entity Annotation by Means of Active Machine Learning: A Method for Creating Corpora
This thesis describes the development and in-depth empirical investigation of a
method, called BootMark, for bootstrapping the marking up of named entities
in textual documents. The reason for working with documents, as opposed to
for instance sentences or phrases, is that the BootMark method is concerned
with the creation of corpora. The claim made in the thesis is that BootMark
requires a human annotator to manually annotate fewer documents in order to
produce a named entity recognizer with a given performance, than would be
needed if the documents forming the basis for the recognizer were randomly
drawn from the same corpus. The intention is then to use the created named en-
tity recognizer as a pre-tagger and thus eventually turn the manual annotation
process into one in which the annotator reviews system-suggested annotations
rather than creating new ones from scratch. The BootMark method consists of
three phases: (1) Manual annotation of a set of documents; (2) Bootstrapping
– active machine learning for the purpose of selecting which document to an-
notate next; (3) The remaining unannotated documents of the original corpus
are marked up using pre-tagging with revision.
Five emerging issues are identified, described and empirically investigated
in the thesis. Their common denominator is that they all depend on the real-
ization of the named entity recognition task, and as such, require the context
of a practical setting in order to be properly addressed. The emerging issues
are related to: (1) the characteristics of the named entity recognition task and
the base learners used in conjunction with it; (2) the constitution of the set of
documents annotated by the human annotator in phase one in order to start the
bootstrapping process; (3) the active selection of the documents to annotate in
phase two; (4) the monitoring and termination of the active learning carried out
in phase two, including a new intrinsic stopping criterion for committee-based
active learning; and (5) the applicability of the named entity recognizer created
during phase two as a pre-tagger in phase three.
The outcomes of the empirical investigations concerning the emerging is-
sues support the claim made in the thesis. The results also suggest that while
the recognizer produced in phases one and two is as useful for pre-tagging as
a recognizer created from randomly selected documents, the applicability of
the recognizer as a pre-tagger is best investigated by conducting a user study
involving real annotators working on a real named entity recognition task
- …