771 research outputs found
Building a semantically annotated corpus of clinical texts
In this paper, we describe the construction of a semantically annotated corpus of clinical texts for use in the development and evaluation of systems for automatically extracting clinically significant information from the textual component of patient records. The paper details the sampling of textual material from a collection of 20,000 cancer patient records, the development of a semantic annotation scheme, the annotation methodology, the distribution of annotations in the final corpus, and the use of the corpus for development of an adaptive information extraction system. The resulting corpus is the most richly semantically annotated resource for clinical text processing built to date, whose value has been demonstrated through its use in developing an effective information extraction system. The detailed presentation of our corpus construction and annotation methodology will be of value to others seeking to build high-quality semantically annotated corpora in biomedical domains
A review of energy law education in the UK
This article focuses on reviewing energy law education in the UK. For such a fast-growing discipline it is important to reflect on the features that give cohesiveness to its curriculum development: how it is taught; who is teaching it and where it is being taught; and what content is given to the curriculum offered? Is it, for example, national in focus or international, or both? A recent review on the state of energy law education in the US demonstrates the scale and ambition of energy law education in that country. This article complements that exercise by providing a review of energy law education in the UK as at 2016. By comparing and contrasting the two approaches, we can glean some distinctive features of the UK approach. More research is needed on energy law education but from this article it is clear that energy law has taken a foothold in legal education in the UK
Identifying Mentions of Pain in Mental Health Records Text: A Natural Language Processing Approach
Pain is a common reason for accessing healthcare resources and is a growing
area of research, especially in its overlap with mental health. Mental health
electronic health records are a good data source to study this overlap.
However, much information on pain is held in the free text of these records,
where mentions of pain present a unique natural language processing problem due
to its ambiguous nature. This project uses data from an anonymised mental
health electronic health records database. The data are used to train a machine
learning based classification algorithm to classify sentences as discussing
patient pain or not. This will facilitate the extraction of relevant pain
information from large databases, and the use of such outputs for further
studies on pain and mental health. 1,985 documents were manually
triple-annotated for creation of gold standard training data, which was used to
train three commonly used classification algorithms. The best performing model
achieved an F1-score of 0.98 (95% CI 0.98-0.99).Comment: 5 pages, 2 tables, submitted to MEDINFO 2023 conferenc
Normalisation of imprecise temporal expressions extracted from text
Information extraction systems and techniques have been largely used to deal with the increasing amount of unstructured data available nowadays. Time is among the different kinds of information that may be extracted from such unstructured data sources, including text documents. However, the inability to correctly identify and extract temporal information from text makes it difficult to understand how the extracted events are organised in a chronological order. Furthermore, in many situations, the meaning of temporal expressions (timexes) is imprecise, such as in “less than 2 years” and “several weeks”, and cannot be accurately normalised, leading to interpretation errors. Although there are some approaches that enable representing imprecise timexes, they are not designed to be applied to specific scenarios and difficult to generalise. This paper presents a novel methodology to analyse and normalise imprecise temporal expressions by representing temporal imprecision in the form of membership functions, based on human interpretation of time in two different languages (Portuguese and English). Each resulting model is a generalisation of probability distributions in the form of trapezoidal and hexagonal fuzzy membership functions. We use an adapted F1-score to guide the choice of the best models for each kind of imprecise timex and a weighted F1-score (F1 3 D ) as a complementary metric in order to identify relevant differences when comparing two normalisation models. We apply the proposed methodology for three distinct classes of imprecise timexes, and the resulting models give distinct insights in the way each kind of temporal expression is interpreted
Development of a Knowledge Graph Embeddings Model for Pain
Pain is a complex concept that can interconnect with other concepts such as a
disorder that might cause pain, a medication that might relieve pain, and so
on. To fully understand the context of pain experienced by either an individual
or across a population, we may need to examine all concepts related to pain and
the relationships between them. This is especially useful when modeling pain
that has been recorded in electronic health records. Knowledge graphs represent
concepts and their relations by an interlinked network, enabling semantic and
context-based reasoning in a computationally tractable form. These graphs can,
however, be too large for efficient computation. Knowledge graph embeddings
help to resolve this by representing the graphs in a low-dimensional vector
space. These embeddings can then be used in various downstream tasks such as
classification and link prediction. The various relations associated with pain
which are required to construct such a knowledge graph can be obtained from
external medical knowledge bases such as SNOMED CT, a hierarchical systematic
nomenclature of medical terms. A knowledge graph built in this way could be
further enriched with real-world examples of pain and its relations extracted
from electronic health records. This paper describes the construction of such
knowledge graph embedding models of pain concepts, extracted from the
unstructured text of mental health electronic health records, combined with
external knowledge created from relations described in SNOMED CT, and their
evaluation on a subject-object link prediction task. The performance of the
models was compared with other baseline models.Comment: Accepted at AMIA 2023, New Orlean
Initial 4D seismic results after CO 2 injection start-up at the Aquistore storage site
The first post-CO2-injection 3D time-lapse seismic survey was conducted at the Aquistore CO2 storage site in February 2016 using the same permanent array of buried geophones used for acquisition of three previous pre-CO2-injection surveys from March 2012 to November 2013. By February 2016, 36 kilotons of CO2 have been injected within the reservoir between 3170 and 3370 m depth. We have developed time-lapse results from analysis of the first post-CO2-injection data and three pre-CO2-injection data sets. The objective of our analysis was to evaluate the ability of the permanent array to detect the injected CO2. A “4D-friendly simultaneous” processing flow was applied to the data in an effort to maximize the repeatability between the pre- and post-CO2-injection volumes while optimizing the final subsurface image including the reservoir. Excellent repeatability was achieved among all surveys with global normalized root-mean-square (Gnrms) values of 1.13–1.19 for the raw prestack data relative to the baseline data, which decreased during processing to Gnrms values of approximately 0.10 for the final crossequalized migrated data volumes. A zone of high normalized root-mean-square (nrms) values (0.11–0.25 as compared with background values of 0.05–0.10) is identified within the upper Deadwood unit of the storage reservoir, which likely corresponds to approximately 18 kilotons of CO2. No significant nrms anomalies are observed within the other reservoir units due to a combination of reduced seismic sensitivity, higher background nrms values, and/or small quantities of CO2 residing within these zones
Sample Size in Natural Language Processing within Healthcare Research
Sample size calculation is an essential step in most data-based disciplines.
Large enough samples ensure representativeness of the population and determine
the precision of estimates. This is true for most quantitative studies,
including those that employ machine learning methods, such as natural language
processing, where free-text is used to generate predictions and classify
instances of text. Within the healthcare domain, the lack of sufficient corpora
of previously collected data can be a limiting factor when determining sample
sizes for new studies. This paper tries to address the issue by making
recommendations on sample sizes for text classification tasks in the healthcare
domain.
Models trained on the MIMIC-III database of critical care records from Beth
Israel Deaconess Medical Center were used to classify documents as having or
not having Unspecified Essential Hypertension, the most common diagnosis code
in the database. Simulations were performed using various classifiers on
different sample sizes and class proportions. This was repeated for a
comparatively less common diagnosis code within the database of diabetes
mellitus without mention of complication.
Smaller sample sizes resulted in better results when using a K-nearest
neighbours classifier, whereas larger sample sizes provided better results with
support vector machines and BERT models. Overall, a sample size larger than
1000 was sufficient to provide decent performance metrics.
The simulations conducted within this study provide guidelines that can be
used as recommendations for selecting appropriate sample sizes and class
proportions, and for predicting expected performance, when building classifiers
for textual healthcare data. The methodology used here can be modified for
sample size estimates calculations with other datasets.Comment: Submitted to Journal of Biomedical Informatic
Identifying encephalopathy in patients admitted to an intensive care unit: Going beyond structured information using natural language processing
BackgroundEncephalopathy is a severe co-morbid condition in critically ill patients that includes different clinical constellation of neurological symptoms. However, even for the most recognised form, delirium, this medical condition is rarely recorded in structured fields of electronic health records precluding large and unbiased retrospective studies. We aimed to identify patients with encephalopathy using a machine learning-based approach over clinical notes in electronic health records.MethodsWe used a list of ICD-9 codes and clinical concepts related to encephalopathy to define a cohort of patients from the MIMIC-III dataset. Clinical notes were annotated with MedCAT and vectorized with a bag-of-word approach or word embedding using clinical concepts normalised to standard nomenclatures as features. Machine learning algorithms (support vector machines and random forest) trained with clinical notes from patients who had a diagnosis of encephalopathy (defined by ICD-9 codes) were used to classify patients with clinical concepts related to encephalopathy in their clinical notes but without any ICD-9 relevant code. A random selection of 50 patients were reviewed by a clinical expert for model validation.ResultsAmong 46,520 different patients, 7.5% had encephalopathy related ICD-9 codes in all their admissions (group 1, definite encephalopathy), 45% clinical concepts related to encephalopathy only in their clinical notes (group 2, possible encephalopathy) and 38% did not have encephalopathy related concepts neither in structured nor in clinical notes (group 3, non-encephalopathy). Length of stay, mortality rate or number of co-morbid conditions were higher in groups 1 and 2 compared to group 3. The best model to classify patients from group 2 as patients with encephalopathy (SVM using embeddings) had F1 of 85% and predicted 31% patients from group 2 as having encephalopathy with a probability >90%. Validation on new cases found a precision ranging from 92% to 98% depending on the criteria considered.ConclusionsNatural language processing techniques can leverage relevant clinical information that might help to identify patients with under-recognised clinical disorders such as encephalopathy. In the MIMIC dataset, this approach identifies with high probability thousands of patients that did not have a formal diagnosis in the structured information of the EHR
Monitoring young associations and open clusters with Kepler in two-wheel mode
We outline a proposal to use the Kepler spacecraft in two-wheel mode to
monitor a handful of young associations and open clusters, for a few weeks
each. Judging from the experience of similar projects using ground-based
telescopes and the CoRoT spacecraft, this program would transform our
understanding of early stellar evolution through the study of pulsations,
rotation, activity, the detection and characterisation of eclipsing binaries,
and the possible detection of transiting exoplanets. Importantly, Kepler's wide
field-of-view would enable key spatially extended, nearby regions to be
monitored in their entirety for the first time, and the proposed observations
would exploit unique synergies with the GAIA ESO spectroscopic survey and, in
the longer term, the GAIA mission itself. We also outline possible strategies
for optimising the photometric performance of Kepler in two-wheel mode by
modelling pixel sensitivity variations and other systematics.Comment: 10 pages, 6 figures, white paper submitted in response to NASA call
for community input for alternative science investigations for the Kepler
spacecraf
- …