30 research outputs found

    Novel database design for extreme scale corpus analysis

    Get PDF
    This thesis presents the patterns and methods uncovered in the development of a new scalable corpus database management system, LexiDB, which can handle the ever-growing size of modern corpus datasets. Initially, an exploration of existing corpus data systems is conducted which examines their usage in corpus linguistics as well as their underlying architectures. From this survey, it is identified that existing systems are designed primarily to be vertically scalable (i.e. scalable through the usage of bigger, better and faster hardware). This motivates a wider examination of modern distributable database management systems and information retrieval techniques used for indexing and retrieval. These techniques are modified and adapted into an architecture that can be horizontally scaled to handle ever bigger corpora. Based on this architecture several new methods for querying and retrieval that improve upon existing techniques are proposed as modern approaches to query extremely large annotated text collections for corpus analysis. The effectiveness of these techniques and the scalability of the architecture is evaluated where it is demonstrated that the architecture is comparably scalable to two modern No-SQL database management systems and outperforms existing corpus data systems in token level pattern querying whilst still supporting character level pattern matching

    Unfinished Business:Construction and Maintenance of a Semantically Tagged Historical Parliamentary Corpus, UK Hansard from 1803 to the present day

    Get PDF
    Creating, curating and maintaining modern political corpora is becoming an ever more involved task. As interest from various socialbodies and the general public in political discourse grows so too does the need to enrich such datasets with metadata and linguisticannotations. Beyond this, such corpora must be easy to browse and search for linguists, social scientists, digital humanists and thegeneral public. We present our efforts to compile a linguistically annotated and semantically tagged version of the Hansard corpus from1803 right up to the present day. This involves combining multiple sources of documents and transcripts. We describe our toolchainfor tagging; using several existing tools that provide tokenisation, part-of-speech tagging and semantic annotations. We also provide anoverview of our bespoke web-based search interface built on LexiDB. In conclusion, we examine the completed corpus by looking atfour case studies making use of semantic categories made available by our toolchain

    LexiDB: Patterns & Methods for Corpus Linguistic Database Management

    Get PDF
    LexiDB is a tool for storing, managing and querying corpus data. In contrast to other database management systems (DBMSs), itis designed specifically for text corpora. It improves on other corpus management systems (CMSs) because data can be added anddeleted from corpora on the fly with the ability to add live data to existing corpora. LexiDB sits between these two categories ofDBMSs and CMSs, more specialised to language data than a general-purpose DBMS but more flexible than a traditional static corpusmanagement system. Previous work has demonstrated the scalability of LexiDB in response to the growing need to be able to scale outfor ever-growing corpus datasets. Here, we present the patterns and methods developed in LexiDB for storage, retrieval and querying ofmulti-level annotated corpus data. These techniques are evaluated and compared to an existing CMS (Corpus Workbench CWB - CQP)and indexer (Lucene). We find that LexiDB consistently outperforms existing tools for corpus queries. This is particularly apparent withlarge corpora and when handling queries with large result sets

    Infrastructure for Semantic Annotation in the Genomics Domain

    Get PDF
    We describe a novel super-infrastructure for biomedical text mining which incorporates an end-to-end pipeline for the collection, annotation, storage, retrieval and analysis of biomedical and life sciences literature, combining NLP and corpus linguistics methods. The infrastructure permits extreme-scale research on the open access PubMed Central archive. It combines an updatable Gene Ontology Semantic Tagger (GOST) for entity identification and semantic markup in the literature, with a NLP pipeline scheduler (Buster) to collect and process the corpus, and a bespoke columnar corpus database (LexiDB) for indexing. The corpus database is distributed to permit fast indexing, and provides a simple web front-end with corpus linguistics methods for sub-corpus comparison and retrieval. GOST is also connected as a service in the Language Application (LAPPS) Grid, in which context it is interoperable with other NLP tools and data in the Grid and can be combined with them in more complex workflows. In a literature based discovery setting, we have created an annotated corpus of 9,776 papers with 5,481,543 words

    The ParlaMint corpora of parliamentary proceedings

    Get PDF
    This paper presents the ParlaMint corpora containing transcriptions of the sessions of the 17 European national parliaments with half a billion words. The corpora are uniformly encoded, contain rich meta-data about 11 thousand speakers, and are linguistically annotated following the Universal Dependencies formalism and with named entities. Samples of the corpora and conversion scripts are available from the project’s GitHub repository, and the complete corpora are openly available via the CLARIN.SI repository for download, as well as through the NoSketch Engine and KonText concordancers and the Parlameter interface for on-line exploration and analysis

    An Orally Bioavailable, Indole-3-glyoxylamide Based Series of Tubulin Polymerization Inhibitors Showing Tumor Growth Inhibition in a Mouse Xenograft Model of Head and Neck Cancer.

    Get PDF
    A number of indole-3-glyoxylamides have previously been reported as tubulin polymerization inhibitors, although none has yet been successfully developed clinically. We report here a new series of related compounds, modified according to a strategy of reducing aromatic ring count and introducing a greater degree of saturation, which retain potent tubulin polymerization activity but with a distinct SAR from previously documented libraries. A subset of active compounds from the reported series is shown to interact with tubulin at the colchicine binding site, disrupt the cellular microtubule network, and exert a cytotoxic effect against multiple cancer cell lines. Two compounds demonstrated significant tumor growth inhibition in a mouse xenograft model of head and neck cancer, a type of the disease which often proves resistant to chemotherapy, supporting further development of the current series as potential new therapeutics

    Exploring the Suitability of Transformer Models to Analyse Mental Health Peer Support Forum Data for a Realist Evaluation

    Get PDF
    Mental health peer support forums have become widely used in recent years. The emerging mental health crisis and the COVID-19 pandemic have meant that finding a place online for support and advice when dealing with mental health issues is more critical than ever. The need to examine, understand and find ways to improve the support provided by mental health forums is vital in the current climate. As part of this, we present our initial explorations in using modern transformer models to detect four key concepts (connectedness, lived experience, empathy and gratitude), which we believe are essential to understanding how people use mental health forums and will serve as a basis for testing more expansive realise theories about mental health forums in the future. As part of this work, we also replicate previously published results on empathy utilising an existing annotated dataset and test the other concepts on our manually annotated mental health forum posts dataset. These results serve as a basis for future research examining peer support forums
    corecore