11 research outputs found

    The cycloaddition of diethyl chlorophosphite with norbornadiene: Synthesis and crystal structure of the cycloadduct

    No full text
    In the presence of Lewis acids, diethyl phosphorochloridite reacts with norbornadiene to afford the tetracyclic phosphinate ester 3. The reaction is believed to involve formation of phosphenium ions stabilized only by alkoxy substituent. The cycloadduct of the reaction was characterized by its spectral data, by single crystal diffraction analysis of the parent acid, and 31P and 27Al nmr experiments

    DOCREP: Document Representation for Natural Language Processing

    No full text
    The field of natural language processing (NLP) revolves around the computational interpretation and generation of natural language. The language typically processed in NLP occurs in paragraphs or documents rather than in single isolated sentences. Despite this, most NLP tools operate over one sentence at a time, not utilising the context outside of the sentence nor any of the metadata associated with the underlying document. One pragmatic reason for this disparity is that representing documents and their annotations through an NLP pipeline is difficult with existing infrastructure. Representing linguistic annotations for a text document using a plain text markupbased format is not sufficient to capture arbitrarily nested and overlapping annotations. Despite this, most linguistic text corpora and NLP tools still operate in this fashion. A document representation framework (DRF) supports the creation of linguistic annotations stored separately to the original document, overcoming this nesting and overlapping annotations problem. Despite the prevalence of pipelines in NLP, there is little published work on, or implementations of, DRFs. The main DRFs, GATE and UIMA, exhibit usability issues which have limited their uptake by the NLP community. This thesis aims to solve this problem through a novel, modern DRF, DOCREP; a portmanteau of document representation. DOCREP is designed to be efficient, programming language and environment agnostic, and most importantly, easy to use. We want DOCREP to be powerful and simple enough to use that NLP researchers and language technology application developers would even use it in their own small projects instead of developing their own ad hoc solution. This thesis begins by presenting the design criteria for our new DRF, extending upon existing requirements from the literature with additional usability and efficiency requirements that should lead to greater use of DRFs. We outline how our new DRF, DOCREP, differs from existing DRFs in terms of the data model, serialisation strategy, developer interactions, support for rapid prototyping, and the expected runtime and environment requirements. We then describe our provided implementations of DOCREP in Python, C++, and Java, the most common languages in NLP; outlining their efficiency, idiomaticity, and the ways in which these implementations satisfy our design requirements. We then present two different evaluations of DOCREP. First, we evaluate its ability to model complex linguistic corpora through the conversion of the OntoNotes 5 corpus to DOCREP and UIMA, outlining the differences in modelling approaches required and efficiency when using these two DRFs. Second, we evaluate DOCREP against our usability requirements from the perspective of a computational linguist who is new to DOCREP. We walk through a number of common use cases for working with text corpora and contrast traditional approaches again their DOCREP counterpart. These two evaluations conclude that DOCREP satisfies our outlined design requirements and outperforms existing DRFs in terms of efficiency, and most importantly, usability. With DOCREP designed and evaluated, we then show how NLP applications can harness document structure. We present a novel document structureaware tokenization framework for the first stage of fullstack NLP systems. We then present a new structureaware NER system which achieves stateoftheart results on multiple standard NER evaluations. The tokenization framework produces its tokenization, sentence boundary, and document structure annotations as native DOCREP annotations. The NER system consumes DOCREP annotations and utilises many components of the DOCREP runtime. We believe that the adoption of DOCREP throughout the NLP community will assist in the reproducibility of results, substitutability of components, and overall quality assurance of NLP systems and corpora, all of which are problematic areas within NLP research and applications. This adoption will make developing and combining NLP components into applications faster, more efficient, and more reliable
    corecore