Search CORE

30 research outputs found

Novel database design for extreme scale corpus analysis

Author: Coole Matthew
Publication venue: Lancaster University
Publication date: 01/01/2021
Field of study

This thesis presents the patterns and methods uncovered in the development of a new scalable corpus database management system, LexiDB, which can handle the ever-growing size of modern corpus datasets. Initially, an exploration of existing corpus data systems is conducted which examines their usage in corpus linguistics as well as their underlying architectures. From this survey, it is identified that existing systems are designed primarily to be vertically scalable (i.e. scalable through the usage of bigger, better and faster hardware). This motivates a wider examination of modern distributable database management systems and information retrieval techniques used for indexing and retrieval. These techniques are modified and adapted into an architecture that can be horizontally scaled to handle ever bigger corpora. Based on this architecture several new methods for querying and retrieval that improve upon existing techniques are proposed as modern approaches to query extremely large annotated text collections for corpus analysis. The effectiveness of these techniques and the scalability of the architecture is evaluated where it is demonstrated that the architecture is comparably scalable to two modern No-SQL database management systems and outperforms existing corpus data systems in token level pattern querying whilst still supporting character level pattern matching

Lancaster E-Prints

Unfinished Business:Construction and Maintenance of a Semantically Tagged Historical Parliamentary Corpus, UK Hansard from 1803 to the present day

Author: Coole Matthew
Mariani John
Rayson Paul
Publication venue: European Language Resources Association (ELRA)
Publication date: 11/05/2020
Field of study

Creating, curating and maintaining modern political corpora is becoming an ever more involved task. As interest from various socialbodies and the general public in political discourse grows so too does the need to enrich such datasets with metadata and linguisticannotations. Beyond this, such corpora must be easy to browse and search for linguists, social scientists, digital humanists and thegeneral public. We present our efforts to compile a linguistically annotated and semantically tagged version of the Hansard corpus from1803 right up to the present day. This involves combining multiple sources of documents and transcripts. We describe our toolchainfor tagging; using several existing tools that provide tokenisation, part-of-speech tagging and semantic annotations. We also provide anoverview of our bespoke web-based search interface built on LexiDB. In conclusion, we examine the completed corpus by looking atfour case studies making use of semantic categories made available by our toolchain

Lancaster E-Prints

LexiDB: Patterns & Methods for Corpus Linguistic Database Management

Author: Coole Matthew
Mariani John
Rayson Paul
Publication venue: European Language Resources Association (ELRA)
Publication date: 11/05/2020
Field of study

LexiDB is a tool for storing, managing and querying corpus data. In contrast to other database management systems (DBMSs), itis designed specifically for text corpora. It improves on other corpus management systems (CMSs) because data can be added anddeleted from corpora on the fly with the ability to add live data to existing corpora. LexiDB sits between these two categories ofDBMSs and CMSs, more specialised to language data than a general-purpose DBMS but more flexible than a traditional static corpusmanagement system. Previous work has demonstrated the scalability of LexiDB in response to the growing need to be able to scale outfor ever-growing corpus datasets. Here, we present the patterns and methods developed in LexiDB for storage, retrieval and querying ofmulti-level annotated corpus data. These techniques are evaluated and compared to an existing CMS (Corpus Workbench CWB - CQP)and indexer (Lucene). We find that LexiDB consistently outperforms existing tools for corpus queries. This is particularly apparent withlarge corpora and when handling queries with large result sets

Lancaster E-Prints

iPOF: Improving Peer Online Forums Update

Author: Coole Matthew
Lobban Fiona
Marshall Paul
Rayson Paul
Publication venue
Publication date: 15/06/2023
Field of study

Lancaster E-Prints

Infrastructure for Semantic Annotation in the Genomics Domain

Author: Coole Matthew
El-Haj Mahmoud
Ezeani Ignatius
Ide Nancy
Knight Jo
Mariani John
Piao Scott
Prentice Sheryl
Rayson Paul
Rutherford Nathan
Suderman Keith
Publication venue: European Language Resources Association (ELRA)
Publication date: 11/05/2020
Field of study

We describe a novel super-infrastructure for biomedical text mining which incorporates an end-to-end pipeline for the collection, annotation, storage, retrieval and analysis of biomedical and life sciences literature, combining NLP and corpus linguistics methods. The infrastructure permits extreme-scale research on the open access PubMed Central archive. It combines an updatable Gene Ontology Semantic Tagger (GOST) for entity identification and semantic markup in the literature, with a NLP pipeline scheduler (Buster) to collect and process the corpus, and a bespoke columnar corpus database (LexiDB) for indexing. The corpus database is distributed to permit fast indexing, and provides a simple web front-end with corpus linguistics methods for sub-corpus comparison and retrieval. GOST is also connected as a service in the Language Application (LAPPS) Grid, in which context it is interoperable with other NLP tools and data in the Grid and can be combined with them in more complex workflows. In a literature based discovery setting, we have created an annotated corpus of 9,776 papers with 5,481,543 words

Lancaster E-Prints

The ParlaMint corpora of parliamentary proceedings

Author: Agnoloni Tommaso
Barkarson Starkaður
Coole Matthew
Darǵis Roberts
de Does Jesse
de Macedo Luciana D.
Depuydt Katrien
Erjavec Tomaž
Fišer Darja
Kopp Matyáš
Krilavičius Tomas
Ljubešić Nikola
Luxardo Giancarlo
Marx Maarten
Morkevičius Vaidas
Navarretta Costanza
Ogrodniczuk Maciej
Osenova Petya
Pančur Andrej
Pérez María Calzada
Rayson Paul
Ring Orsolya
Rudolf Michał
Simov Kiril
Steingrímsson Steinþór
van Heusden Ruben
Venturi Giulia
Çöltekin Çağrı
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

This paper presents the ParlaMint corpora containing transcriptions of the sessions of the 17 European national parliaments with half a billion words. The corpora are uniformly encoded, contain rich meta-data about 11 thousand speakers, and are linguistically annotated following the Universal Dependencies formalism and with named entities. Samples of the corpora and conversion scripts are available from the project’s GitHub repository, and the complete corpora are openly available via the CLARIN.SI repository for download, as well as through the NoSketch Engine and KonText concordancers and the Parlameter interface for on-line exploration and analysis

PubMed Central

Copenhagen University Research Information System

Repositori Institucional de la Universitat Jaume I

Lancaster E-Prints

International Migration, Integration and Social Cohesion online publications

UvA-DARE

An Orally Bioavailable, Indole-3-glyoxylamide Based Series of Tubulin Polymerization Inhibitors Showing Tumor Growth Inhibition in a Mouse Xenograft Model of Head and Neck Cancer.

Author: Alexander G. Dossetter
Bacher G.
Bonne D.
Chen G.
Cortese F.
Daniel P. Mason
Dennis Norman
Desai A.
Edward J. Griffen
Forastiere A. A.
Fürst R.
Harker W. G.
Helen E. Colley
Huang T.-H.
Jacobs C.
Joanne Harrison
Lucinda V. Jackson
Luke R. Jennings
Lynne Williams
Mark J. Thompson
Matthew L. Brett
Melanie Wong
Munitta Muthana
Peter M. Lockey
Sarah J. Danson
Sean F. Coole
Vamshi Tulasi
Publication venue: 'American Chemical Society (ACS)'
Publication date: 01/12/2015
Field of study

A number of indole-3-glyoxylamides have previously been reported as tubulin polymerization inhibitors, although none has yet been successfully developed clinically. We report here a new series of related compounds, modified according to a strategy of reducing aromatic ring count and introducing a greater degree of saturation, which retain potent tubulin polymerization activity but with a distinct SAR from previously documented libraries. A subset of active compounds from the reported series is shown to interact with tubulin at the colchicine binding site, disrupt the cellular microtubule network, and exert a cytotoxic effect against multiple cancer cell lines. Two compounds demonstrated significant tumor growth inhibition in a mouse xenograft model of head and neck cancer, a type of the disease which often proves resistant to chemotherapy, supporting further development of the current series as potential new therapeutics

Crossref

White Rose Research Online

iPOF: Improving Peer Online Forums

Author: Coole Matthew
Lobban Fiona
Rayson Paul
Publication venue
Publication date: 15/06/2022
Field of study

Lancaster E-Prints

Exploring the Suitability of Transformer Models to Analyse Mental Health Peer Support Forum Data for a Realist Evaluation

Author: Coole Matthew
Glossop Zoe
Lobban Fiona
Marshall Paul
Rayson Paul
Vidler John
Publication venue: ELRA and ICCL
Publication date: 01/05/2024
Field of study

Mental health peer support forums have become widely used in recent years. The emerging mental health crisis and the COVID-19 pandemic have meant that finding a place online for support and advice when dealing with mental health issues is more critical than ever. The need to examine, understand and find ways to improve the support provided by mental health forums is vital in the current climate. As part of this, we present our initial explorations in using modern transformer models to detect four key concepts (connectedness, lived experience, empathy and gratitude), which we believe are essential to understanding how people use mental health forums and will serve as a basis for testing more expansive realise theories about mental health forums in the future. As part of this work, we also replicate previously published results on empathy utilising an existing annotated dataset and test the other concepts on our manually annotated mental health forum posts dataset. These results serve as a basis for future research examining peer support forums

Lancaster E-Prints

<it>De novo</it> assembly of highly diverse viral populations

Author: Charlebois Patrick
Coole Matthew G
Gnerre Sante
Henn Matthew R
Lennon Niall J
Levin Joshua Z
Qu James
Ryan Elizabeth M
Yang Xiao
Zody Michael C
Publication venue: BMC
Publication date: 01/01/2012
Field of study

Abstract Background Extensive genetic diversity in viral populations within infected hosts and the divergence of variants from existing reference genomes impede the analysis of deep viral sequencing data. A <it>de novo</it> population consensus assembly is valuable both as a single linear representation of the population and as a backbone on which intra-host variants can be accurately mapped. The availability of consensus assemblies and robustly mapped variants are crucial to the genetic study of viral disease progression, transmission dynamics, and viral evolution. Existing <it>de novo</it> assembly techniques fail to robustly assemble ultra-deep sequence data from genetically heterogeneous populations such as viruses into full-length genomes due to the presence of extensive genetic variability, contaminants, and variable sequence coverage. Results We present <it>VICUNA</it>, a <it>de novo</it> assembly algorithm suitable for generating consensus assemblies from genetically heterogeneous populations. We demonstrate its effectiveness on Dengue, Human Immunodeficiency and West Nile viral populations, representing a range of intra-host diversity. Compared to state-of-the-art assemblers designed for haploid or diploid systems, <it>VICUNA</it> recovers full-length consensus and captures insertion/deletion polymorphisms in diverse samples. Final assemblies maintain a high base calling accuracy. <it>VICUNA</it> program is publicly available at: <url>http://www.broadinstitute.org/scientific-community/science/projects/viral-genomics/ viral-genomics-analysis-software</url>. Conclusions We developed <it>VICUNA</it>, a publicly available software tool, that enables consensus assembly of ultra-deep sequence derived from diverse viral populations. While <it>VICUNA</it> was developed for the analysis of viral populations, its application to other heterogeneous sequence data sets such as metagenomic or tumor cell population samples may prove beneficial in these fields of research.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals