305 research outputs found
Selective Sampling for Example-based Word Sense Disambiguation
This paper proposes an efficient example sampling method for example-based
word sense disambiguation systems. To construct a database of practical size, a
considerable overhead for manual sense disambiguation (overhead for
supervision) is required. In addition, the time complexity of searching a
large-sized database poses a considerable problem (overhead for search). To
counter these problems, our method selectively samples a smaller-sized
effective subset from a given example set for use in word sense disambiguation.
Our method is characterized by the reliance on the notion of training utility:
the degree to which each example is informative for future example sampling
when used for the training of the system. The system progressively collects
examples by selecting those with greatest utility. The paper reports the
effectiveness of our method through experiments on about one thousand
sentences. Compared to experiments with other example sampling methods, our
method reduced both the overhead for supervision and the overhead for search,
without the degeneration of the performance of the system.Comment: 25 pages, 14 Postscript figure
The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources
We introduce the STEM (Science, Technology, Engineering, and Medicine)
Dataset for Scientific Entity Extraction, Classification, and Resolution,
version 1.0 (STEM-ECR v1.0). The STEM-ECR v1.0 dataset has been developed to
provide a benchmark for the evaluation of scientific entity extraction,
classification, and resolution tasks in a domain-independent fashion. It
comprises abstracts in 10 STEM disciplines that were found to be the most
prolific ones on a major publishing platform. We describe the creation of such
a multidisciplinary corpus and highlight the obtained findings in terms of the
following features: 1) a generic conceptual formalism for scientific entities
in a multidisciplinary scientific context; 2) the feasibility of the
domain-independent human annotation of scientific entities under such a generic
formalism; 3) a performance benchmark obtainable for automatic extraction of
multidisciplinary scientific entities using BERT-based neural models; 4) a
delineated 3-step entity resolution procedure for human annotation of the
scientific entities via encyclopedic entity linking and lexicographic word
sense disambiguation; and 5) human evaluations of Babelfy returned encyclopedic
links and lexicographic senses for our entities. Our findings cumulatively
indicate that human annotation and automatic learning of multidisciplinary
scientific concepts as well as their semantic disambiguation in a wide-ranging
setting as STEM is reasonable.Comment: Published in LREC 2020. Publication URL
https://www.aclweb.org/anthology/2020.lrec-1.268/; Dataset DOI
https://doi.org/10.25835/001754
A Crowdsourced Frame Disambiguation Corpus with Ambiguity
We present a resource for the task of FrameNet semantic frame disambiguation
of over 5,000 word-sentence pairs from the Wikipedia corpus. The annotations
were collected using a novel crowdsourcing approach with multiple workers per
sentence to capture inter-annotator disagreement. In contrast to the typical
approach of attributing the best single frame to each word, we provide a list
of frames with disagreement-based scores that express the confidence with which
each frame applies to the word. This is based on the idea that inter-annotator
disagreement is at least partly caused by ambiguity that is inherent to the
text and frames. We have found many examples where the semantics of individual
frames overlap sufficiently to make them acceptable alternatives for
interpreting a sentence. We have argued that ignoring this ambiguity creates an
overly arbitrary target for training and evaluating natural language processing
systems - if humans cannot agree, why would we expect the correct answer from a
machine to be any different? To process this data we also utilized an expanded
lemma-set provided by the Framester system, which merges FN with WordNet to
enhance coverage. Our dataset includes annotations of 1,000 sentence-word pairs
whose lemmas are not part of FN. Finally we present metrics for evaluating
frame disambiguation systems that account for ambiguity.Comment: Accepted to NAACL-HLT201
Durham - a word sense disambiguation system
Ever since the 1950's when Machine Translation first began to be developed, word sense disambiguation (WSD) has been considered a problem to developers. In more recent times, all NLP tasks which are sensitive to lexical semantics potentially benefit from WSD although to what extent is largely unknown. The thesis presents a novel approach to the task of WSD on a large scale. In particular a novel knowledge source is presented named contextual information. This knowledge source adopts a sub-symbolic training mechanism to learn information from the context of a sentence which is able to aid disambiguation. The system also takes advantage of frequency information and these two knowledge sources are combined. The system is trained and tested on SEMCOR. A novel disambiguation algorithm is also developed. The algorithm must tackle the problem of a large possible number of sense combinations in a sentence. The algorithm presented aims to make an appropriate choice between accuracy and efficiency. This is performed by directing the search at a word level. The performance achieved on SEMCOR is reported and an analysis of the various components of the system is performed. The results achieved on this test data are pleasing, but are difficult to compare with most of the other work carried out in the field. For this reason the system took part in the SENSEVAL evaluation which provided an excellent opportunity to extensively compare WSD systems. SENSEVAL is a small scale WSD evaluation using the HECTOR lexicon. Despite this, few adaptations to the system were required. The performance of the system on the SENSEVAL task are reported and have also been presented in [Hawkins, 2000]
When Is Word Sense Disambiguation Difficult? A Crowdsourcing Approach
We identified features that drive differential
accuracy in word sense disambiguation
(WSD) by building regression models using
10,000 coarse-grained WSD instances which
were labeled on Mturk. Features predictive of
accuracy include properties of the target word
(word frequency, part of speech, and number
of possible senses), the example context
(length), and the Turker’s engagement with
our task. The resulting model gives insight
into which words are difficult to disambiguate.
We also show that having many Turkers label
the same instance provides at least a partial
substitute for more expensive annotation
A Survey on Semantic Processing Techniques
Semantic processing is a fundamental research domain in computational
linguistics. In the era of powerful pre-trained language models and large
language models, the advancement of research in this domain appears to be
decelerating. However, the study of semantics is multi-dimensional in
linguistics. The research depth and breadth of computational semantic
processing can be largely improved with new technologies. In this survey, we
analyzed five semantic processing tasks, e.g., word sense disambiguation,
anaphora resolution, named entity recognition, concept extraction, and
subjectivity detection. We study relevant theoretical research in these fields,
advanced methods, and downstream applications. We connect the surveyed tasks
with downstream applications because this may inspire future scholars to fuse
these low-level semantic processing tasks with high-level natural language
processing tasks. The review of theoretical research may also inspire new tasks
and technologies in the semantic processing domain. Finally, we compare the
different semantic processing techniques and summarize their technical trends,
application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN
1566-2535. The equal contribution mark is missed in the published version due
to the publication policies. Please contact Prof. Erik Cambria for detail
The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources
We introduce the STEM (Science, Technology, Engineering, and Medicine) Dataset for Scientific Entity Extraction, Classification, and Resolution, version 1.0 (STEM-ECR v1.0). The STEM-ECR v1.0 dataset has been developed to provide a benchmark for the evaluation of scientific entity extraction, classification, and resolution tasks in a domain-independent fashion. It comprises abstracts in 10 STEM disciplines that were found to be the most prolific ones on a major publishing platform. We describe the creation of such a multidisciplinary corpus and highlight the obtained findings in terms of the following features: 1) a generic conceptual formalism for scientific entities in a multidisciplinary scientific context; 2) the feasibility of the domain-independent human annotation of scientific entities under such a generic formalism; 3) a performance benchmark obtainable for automatic extraction of multidisciplinary scientific entities using BERT-based neural models; 4) a delineated 3-step entity resolution procedure for human annotation of the scientific entities via encyclopedic entity linking and lexicographic word sense disambiguation; and 5) human evaluations of Babelfy returned encyclopedic links and lexicographic senses for our entities. Our findings cumulatively indicate that human annotation and automatic learning of multidisciplinary scientific concepts as well as their semantic disambiguation in a wide-ranging setting as STEM is reasonable
- …