110 research outputs found
Stress Testing BERT Anaphora Resolution Models for Reaction Extraction in Chemical Patents
The high volume of published chemical patents and the importance of a timely
acquisition of their information gives rise to automating information
extraction from chemical patents. Anaphora resolution is an important component
of comprehensive information extraction, and is critical for extracting
reactions. In chemical patents, there are five anaphoric relations of interest:
co-reference, transformed, reaction associated, work up, and contained. Our
goal is to investigate how the performance of anaphora resolution models for
reaction texts in chemical patents differs in a noise-free and noisy
environment and to what extent we can improve the robustness against noise of
the model
Enhancing Coreference Clustering
Proceedings of the Second Workshop on Anaphora Resolution
(WAR II).
Editor: Christer Johansson.
NEALT Proceedings Series, Vol. 2 (2008), 31-40.
© 2008 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/7129
Coherence in Machine Translation
Coherence ensures individual sentences work together to form a meaningful document. When properly translated, a coherent document in one language should result in a coherent document in another language. In Machine Translation, however, due to reasons of modeling and computational complexity, sentences are pieced together from words or phrases based on short context windows and
with no access to extra-sentential context.
In this thesis I propose ways to automatically assess the coherence of machine translation output. The work is structured around three dimensions: entity-based coherence, coherence as evidenced via syntactic patterns, and coherence as
evidenced via discourse relations.
For the first time, I evaluate existing monolingual coherence models on this new task, identifying issues and challenges that are specific to the machine translation setting. In order to address these issues, I adapted a state-of-the-art syntax
model, which also resulted in improved performance for the monolingual task. The results clearly indicate how much more difficult the new task is than the task of detecting shuffled texts. I proposed a new coherence model, exploring the crosslingual transfer of discourse relations in machine translation. This model is novel in that it measures the correctness of the discourse relation by comparison to the source text rather than to a reference translation. I identified patterns of incoherence common across different language pairs, and created a corpus of machine translated output annotated with coherence errors for evaluation purposes. I then examined
lexical coherence in a multilingual context, as a preliminary study for crosslingual transfer. Finally, I determine how the new and adapted models correlate with human judgements of translation quality and suggest that improvements in general evaluation within machine translation would benefit from having a coherence component that evaluated the translation output with respect to the source text
Improved Coreference Resolution Using Cognitive Insights
Coreference resolution is the task of extracting referential expressions, or mentions, in text and clustering these by the entity or concept they refer to. The sustained research interest in the task reflects the richness of reference expression usage in natural language and the difficulty in encoding insights from linguistic and cognitive theories effectively. In this thesis, we design and implement LIMERIC, a state-of-the-art coreference resolution engine. LIMERIC naturally incorporates both non-local decoding and entity-level modelling to achieve the highly competitive benchmark performance of 64.22% and 59.99% on the CoNLL-2012 benchmark with a simple model and a baseline feature set. As well as strong performance, a key contribution of this work is a reconceptualisation of the coreference task. We draw an analogy between shift-reduce parsing and coreference resolution to develop an algorithm which naturally mimics cognitive models of human discourse processing. In our feature development work, we leverage insights from cognitive theories to improve our modelling. Each contribution achieves statistically significant improvements and sum to gains of 1.65% and 1.66% on the CoNLL-2012 benchmark, yielding performance values of 65.76% and 61.27%. For each novel feature we propose, we contribute an accompanying analysis so as to better understand how cognitive theories apply to real language data. LIMERIC is at once a platform for exploring cognitive insights into coreference and a viable alternative to current systems. We are excited by the promise of incorporating our and further cognitive insights into more complex frameworks since this has the potential to both improve the performance of computational models, as well as our understanding of the mechanisms underpinning human reference resolution
A Compositional Vector Space Model of Ellipsis and Anaphora.
PhD ThesisThis thesis discusses research in compositional distributional semantics: if words
are defined by their use in language and represented as high-dimensional vectors
reflecting their co-occurrence behaviour in textual corpora, how should words be
composed to produce a similar numerical representation for sentences, paragraphs
and documents? Neural methods learn a task-dependent composition by generalising
over large datasets, whereas type-driven approaches stipulate that composition
is given by a functional view on words, leaving open the question of what those
functions should do, concretely.
We take on the type-driven approach to compositional distributional semantics
and focus on the categorical framework of Coecke, Grefenstette, and Sadrzadeh
[CGS13], which models composition as an interpretation of syntactic structures as
linear maps on vector spaces using the language of category theory, as well as the
two-step approach of Muskens and Sadrzadeh [MS16], where syntactic structures
map to lambda logical forms that are instantiated by a concrete composition model.
We develop the theory behind these approaches to cover phenomena not dealt with
in previous work, evaluate the models in sentence-level tasks, and implement a tensor
learning method that generalises to arbitrary sentences.
This thesis reports three main contributions. The first, theoretical in nature, discusses
the ability of categorical and lambda-based models of compositional distributional
semantics to model ellipsis, anaphora, and parasitic gaps; phenomena that
challenge the linearity of previous compositional models. Secondly, we perform an
evaluation study on verb phrase ellipsis where we introduce three novel sentence
evaluation datasets and compare algebraic, neural, and tensor-based composition
models to show that models that resolve ellipsis achieve higher correlation with humans.
Finally, we generalise the skipgram model [Mik+13] to a tensor-based setting
and implement it for transitive verbs, showing that neural methods to learn tensor
representations for words can outperform previous tensor-based methods on compositional
tasks
Advances in automatic terminology processing: methodology and applications in focus
A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy.The information and knowledge era, in which we are living, creates challenges in many fields, and terminology is not an exception. The challenges include an exponential growth in the number of specialised documents that are available, in which terms are presented, and the number of newly introduced concepts and terms, which are already beyond our (manual) capacity. A promising solution to this ‘information overload’ would be to employ automatic or semi-automatic procedures to enable individuals and/or small groups to efficiently build high quality terminologies from their own resources which closely reflect their individual objectives and viewpoints. Automatic terminology processing (ATP) techniques have already proved to be quite reliable, and can save human time in terminology processing. However, they are not without weaknesses, one of which is that these techniques often consider terms to be independent lexical units satisfying some criteria, when terms are, in fact, integral parts of a coherent system (a terminology). This observation is supported by the discussion of the notion of terms and terminology and the review of existing approaches in ATP presented in this thesis. In order to overcome the aforementioned weakness, we propose a novel methodology in ATP which is able to extract a terminology as a whole. The proposed methodology is based on knowledge patterns automatically extracted from glossaries, which we considered to be valuable, but overlooked resources. These automatically identified knowledge patterns are used to extract terms, their relations and descriptions from corpora. The extracted information can facilitate the construction of a terminology as a coherent system. The study also aims to discuss applications of ATP, and describes an experiment in which ATP is integrated into a new NLP application: multiplechoice test item generation. The successful integration of the system shows that ATP is a viable technology, and should be exploited more by other NLP applications
Harnessing Collective Intelligence on Social Networks
Crowdsourcing is an approach to replace the work traditionally done by a single person with the collective action of a group of people via the Internet. It has established itself in the mainstream of research methodology in recent years using a variety of approaches to engage humans in solving problems that computers, as yet, cannot solve. Several common approaches to crowdsourcing have been successful, including peer production (in which the participants are inherently interested in contributing), microworking (in which participants are paid small amounts of money per task) and games or gamification (in which the participants are entertained as they complete the tasks). An alternative approach to crowdsourcing using social networks is proposed here. Social networks offer access to large user communities through integrated software applications and, as they mature, are utilised in different ways, with decentralised and unevenly-distributed organisation of content. This research investigates whether collective intelligence systems are facilitated better on social networks and how the contributed human effort can be optimised. These questions are investigated using two case studies of problem solving: anaphoric coreference in text documents and classifying images in the marine biology domain. Social networks themselves can be considered inherent, self-organised problem solving systems, an approach defined here as ?groupsourcing?, sharing common features with other crowdsourcing approaches; however, the benefits are tempered with the many challenges this approach presents. In comparison to other methods of crowdsourcing, harnessing collective intelligence on social networks offers a high-accuracy, data-driven and low-cost approach
- …