Search CORE

771 research outputs found

Building a semantically annotated corpus of clinical texts

Author: Andrea Setzer
Angus Roberts
Denny
Franzén
Friedman
Gennari
George Demetriou
Hersh
Hripcsak
Ian Roberts
Kim
Lindberg
Mark Hepple
Meystre
Pestian
Robert Gaizauskas
Roberts
Tanabe
Yikun Guo
Publication venue: 'Elsevier BV'
Publication date: 01/10/2009
Field of study

In this paper, we describe the construction of a semantically annotated corpus of clinical texts for use in the development and evaluation of systems for automatically extracting clinically significant information from the textual component of patient records. The paper details the sampling of textual material from a collection of 20,000 cancer patient records, the development of a semantic annotation scheme, the annotation methodology, the distribution of annotations in the final corpus, and the use of the corpus for development of an adaptive information extraction system. The resulting corpus is the most richly semantically annotated resource for clinical text processing built to date, whose value has been demonstrated through its use in developing an effective information extraction system. The detailed presentation of our corpus construction and annotation methodology will be of value to others seeking to build high-quality semantically annotated corpora in biomedical domains

Elsevier - Publisher Connector

Crossref

White Rose Research Online

A review of energy law education in the UK

Author: Cameron Peter
Heffron Raphael
Johnston Angus
Roberts Peter
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2016
Field of study

This article focuses on reviewing energy law education in the UK. For such a fast-growing discipline it is important to reflect on the features that give cohesiveness to its curriculum development: how it is taught; who is teaching it and where it is being taught; and what content is given to the curriculum offered? Is it, for example, national in focus or international, or both? A recent review on the state of energy law education in the US demonstrates the scale and ambition of energy law education in that country. This article complements that exercise by providing a review of energy law education in the UK as at 2016. By comparing and contrasting the two approaches, we can glean some distinctive features of the UK approach. More research is needed on energy law education but from this article it is clear that energy law has taken a foothold in legal education in the UK

Crossref

Oxford University Research Archive

University of Dundee Online Publications

Queen Mary Research Online

Identifying Mentions of Pain in Mental Health Records Text: A Natural Language Processing Approach

Author: Chaturvedi Jaya
Roberts Angus
Stewart Robert
Velupillai Sumithra
Publication venue
Publication date: 05/04/2023
Field of study

Pain is a common reason for accessing healthcare resources and is a growing area of research, especially in its overlap with mental health. Mental health electronic health records are a good data source to study this overlap. However, much information on pain is held in the free text of these records, where mentions of pain present a unique natural language processing problem due to its ambiguous nature. This project uses data from an anonymised mental health electronic health records database. The data are used to train a machine learning based classification algorithm to classify sentences as discussing patient pain or not. This will facilitate the extraction of relevant pain information from large databases, and the use of such outputs for further studies on pain and mental health. 1,985 documents were manually triple-annotated for creation of gold standard training data, which was used to train three commonly used classification algorithms. The best performing model achieved an F1-score of 0.98 (95% CI 0.98-0.99).Comment: 5 pages, 2 tables, submitted to MEDINFO 2023 conferenc

arXiv.org e-Print Archive

Normalisation of imprecise temporal expressions extracted from text

Author: Derczynski Leon
Didonet Del Fabro Marcos
Roberts Angus
Tissot Hegler
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Information extraction systems and techniques have been largely used to deal with the increasing amount of unstructured data available nowadays. Time is among the different kinds of information that may be extracted from such unstructured data sources, including text documents. However, the inability to correctly identify and extract temporal information from text makes it difficult to understand how the extracted events are organised in a chronological order. Furthermore, in many situations, the meaning of temporal expressions (timexes) is imprecise, such as in “less than 2 years” and “several weeks”, and cannot be accurately normalised, leading to interpretation errors. Although there are some approaches that enable representing imprecise timexes, they are not designed to be applied to specific scenarios and difficult to generalise. This paper presents a novel methodology to analyse and normalise imprecise temporal expressions by representing temporal imprecision in the form of membership functions, based on human interpretation of time in two different languages (Portuguese and English). Each resulting model is a generalisation of probability distributions in the form of trapezoidal and hexagonal fuzzy membership functions. We use an adapted F1-score to guide the choice of the best models for each kind of imprecise timex and a weighted F1-score (F1 3 D ) as a complementary metric in order to identify relevant differences when comparing two normalisation models. We apply the proposed methodology for three distinct classes of imprecise timexes, and the resulting models give distinct insights in the way each kind of temporal expression is interpreted

UCL Discovery

The IT University of Copenhagen's Repository

King's Research Portal

Development of a Knowledge Graph Embeddings Model for Pain

Author: Chaturvedi Jaya
Roberts Angus
Stewart Robert
Velupillai Sumithra
Wang Tao
Publication venue
Publication date: 17/08/2023
Field of study

Pain is a complex concept that can interconnect with other concepts such as a disorder that might cause pain, a medication that might relieve pain, and so on. To fully understand the context of pain experienced by either an individual or across a population, we may need to examine all concepts related to pain and the relationships between them. This is especially useful when modeling pain that has been recorded in electronic health records. Knowledge graphs represent concepts and their relations by an interlinked network, enabling semantic and context-based reasoning in a computationally tractable form. These graphs can, however, be too large for efficient computation. Knowledge graph embeddings help to resolve this by representing the graphs in a low-dimensional vector space. These embeddings can then be used in various downstream tasks such as classification and link prediction. The various relations associated with pain which are required to construct such a knowledge graph can be obtained from external medical knowledge bases such as SNOMED CT, a hierarchical systematic nomenclature of medical terms. A knowledge graph built in this way could be further enriched with real-world examples of pain and its relations extracted from electronic health records. This paper describes the construction of such knowledge graph embedding models of pain concepts, extracted from the unstructured text of mental health electronic health records, combined with external knowledge created from relations described in SNOMED CT, and their evaluation on a subject-object link prediction task. The performance of the models was compared with other baseline models.Comment: Accepted at AMIA 2023, New Orlean

arXiv.org e-Print Archive

Initial 4D seismic results after CO 2 injection start-up at the Aquistore storage site

Author: Brian Roberts
Donald J. White
Doug Angus
Kalantzis F.
Lisa A. N. Roach
Publication venue: 'Society of Exploration Geophysicists'
Publication date: 01/05/2017
Field of study

The first post-CO2-injection 3D time-lapse seismic survey was conducted at the Aquistore CO2 storage site in February 2016 using the same permanent array of buried geophones used for acquisition of three previous pre-CO2-injection surveys from March 2012 to November 2013. By February 2016, 36 kilotons of CO2 have been injected within the reservoir between 3170 and 3370 m depth. We have developed time-lapse results from analysis of the first post-CO2-injection data and three pre-CO2-injection data sets. The objective of our analysis was to evaluate the ability of the permanent array to detect the injected CO2. A “4D-friendly simultaneous” processing flow was applied to the data in an effort to maximize the repeatability between the pre- and post-CO2-injection volumes while optimizing the final subsurface image including the reservoir. Excellent repeatability was achieved among all surveys with global normalized root-mean-square (Gnrms) values of 1.13–1.19 for the raw prestack data relative to the baseline data, which decreased during processing to Gnrms values of approximately 0.10 for the final crossequalized migrated data volumes. A zone of high normalized root-mean-square (nrms) values (0.11–0.25 as compared with background values of 0.05–0.10) is identified within the upper Deadwood unit of the storage reservoir, which likely corresponds to approximately 18 kilotons of CO2. No significant nrms anomalies are observed within the other reservoir units due to a combination of reduced seismic sensitivity, higher background nrms values, and/or small quantities of CO2 residing within these zones

Crossref

White Rose Research Online

Sample Size in Natural Language Processing within Healthcare Research

Author: Chaturvedi Jaya
Roberts Angus
Shamsutdinova Diana
Stahl Daniel
Stewart Robert
Velupillai Sumithra
Zimmer Felix
Publication venue
Publication date: 05/09/2023
Field of study

Sample size calculation is an essential step in most data-based disciplines. Large enough samples ensure representativeness of the population and determine the precision of estimates. This is true for most quantitative studies, including those that employ machine learning methods, such as natural language processing, where free-text is used to generate predictions and classify instances of text. Within the healthcare domain, the lack of sufficient corpora of previously collected data can be a limiting factor when determining sample sizes for new studies. This paper tries to address the issue by making recommendations on sample sizes for text classification tasks in the healthcare domain. Models trained on the MIMIC-III database of critical care records from Beth Israel Deaconess Medical Center were used to classify documents as having or not having Unspecified Essential Hypertension, the most common diagnosis code in the database. Simulations were performed using various classifiers on different sample sizes and class proportions. This was repeated for a comparatively less common diagnosis code within the database of diabetes mellitus without mention of complication. Smaller sample sizes resulted in better results when using a K-nearest neighbours classifier, whereas larger sample sizes provided better results with support vector machines and BERT models. Overall, a sample size larger than 1000 was sufficient to provide decent performance metrics. The simulations conducted within this study provide guidelines that can be used as recommendations for selecting appropriate sample sizes and class proportions, and for predicting expected performance, when building classifiers for textual healthcare data. The methodology used here can be modified for sample size estimates calculations with other datasets.Comment: Submitted to Journal of Biomedical Informatic

arXiv.org e-Print Archive

Identifying encephalopathy in patients admitted to an intensive care unit: Going beyond structured information using natural language processing

Author: Angus Roberts
Angus Roberts
Helena Ariño
Helena Ariño
Jaya Chaturvedi
Soo Kyung Bae
Soo Kyung Bae
Tao Wang
Publication venue: 'Frontiers Media SA'
Publication date: 01/01/2023
Field of study

BackgroundEncephalopathy is a severe co-morbid condition in critically ill patients that includes different clinical constellation of neurological symptoms. However, even for the most recognised form, delirium, this medical condition is rarely recorded in structured fields of electronic health records precluding large and unbiased retrospective studies. We aimed to identify patients with encephalopathy using a machine learning-based approach over clinical notes in electronic health records.MethodsWe used a list of ICD-9 codes and clinical concepts related to encephalopathy to define a cohort of patients from the MIMIC-III dataset. Clinical notes were annotated with MedCAT and vectorized with a bag-of-word approach or word embedding using clinical concepts normalised to standard nomenclatures as features. Machine learning algorithms (support vector machines and random forest) trained with clinical notes from patients who had a diagnosis of encephalopathy (defined by ICD-9 codes) were used to classify patients with clinical concepts related to encephalopathy in their clinical notes but without any ICD-9 relevant code. A random selection of 50 patients were reviewed by a clinical expert for model validation.ResultsAmong 46,520 different patients, 7.5% had encephalopathy related ICD-9 codes in all their admissions (group 1, definite encephalopathy), 45% clinical concepts related to encephalopathy only in their clinical notes (group 2, possible encephalopathy) and 38% did not have encephalopathy related concepts neither in structured nor in clinical notes (group 3, non-encephalopathy). Length of stay, mortality rate or number of co-morbid conditions were higher in groups 1 and 2 compared to group 3. The best model to classify patients from group 2 as patients with encephalopathy (SVM using embeddings) had F1 of 85% and predicted 31% patients from group 2 as having encephalopathy with a probability >90%. Validation on new cases found a precision ranging from 92% to 98% depending on the criteria considered.ConclusionsNatural language processing techniques can leverage relevant clinical information that might help to identify patients with under-recognised clinical disorders such as encephalopathy. In the MIMIC dataset, this approach identifies with high probability thousands of patients that did not have a formal diagnosis in the structured information of the EHR

Directory of Open Access Journals

Monitoring young associations and open clusters with Kepler in two-wheel mode

Author: Aigrain S.
Alencar S.
Angus R.
Bouvier J.
Flaccomio E.
Gillen E.
Guzik J.
Hebb L.
Hodgkin S.
McQuillan A.
Micela G.
Moraux E.
Parviainen H.
Randich S.
Reece S.
Roberts S.
Zwintz K.
Publication venue
Publication date: 01/01/2013
Field of study

We outline a proposal to use the Kepler spacecraft in two-wheel mode to monitor a handful of young associations and open clusters, for a few weeks each. Judging from the experience of similar projects using ground-based telescopes and the CoRoT spacecraft, this program would transform our understanding of early stellar evolution through the study of pulsations, rotation, activity, the detection and characterisation of eclipsing binaries, and the possible detection of transiting exoplanets. Importantly, Kepler's wide field-of-view would enable key spatially extended, nearby regions to be monitored in their entirety for the first time, and the proposed observations would exploit unique synergies with the GAIA ESO spectroscopic survey and, in the longer term, the GAIA mission itself. We also outline possible strategies for optimising the photometric performance of Kepler in two-wheel mode by modelling pixel sensitivity variations and other systematics.Comment: 10 pages, 6 figures, white paper submitted in response to NASA call for community input for alternative science investigations for the Kepler spacecraf

arXiv.org e-Print Archive

Oxford University Research Archive