267 research outputs found
Recommended from our members
Chromatin signature of widespread monoallelic expression
In mammals, numerous autosomal genes are subject to mitotically stable monoallelic expression (MAE), including genes that play critical roles in a variety of human diseases. Due to challenges posed by the clonal nature of MAE, very little is known about its regulation; in particular, no molecular features have been specifically linked to MAE. In this study, we report an approach that distinguishes MAE genes in human cells with great accuracy: a chromatin signature consisting of chromatin marks associated with active transcription (H3K36me3) and silencing (H3K27me3) simultaneously occurring in the gene body. The MAE signature is present in ∼20% of ubiquitously expressed genes and over 30% of tissue-specific genes across cell types. Notably, it is enriched among key developmental genes that have bivalent chromatin structure in pluripotent cells. Our results open a new approach to the study of MAE that is independent of polymorphisms, and suggest that MAE is linked to cell differentiation. DOI: http://dx.doi.org/10.7554/eLife.01256.00
Annotating patient clinical records with syntactic chunks and named entities: the Harvey corpus
The free text notes typed by physicians during patient consultations contain valuable information for the study of disease and treatment. These notes are difficult to process by existing natural language analysis tools since they are highly telegraphic (omitting many words), and contain many spelling mistakes, inconsistencies in punctuation, and non-standard word order. To support information extraction and classification tasks over such text, we describe a de-identified corpus of free text notes, a shallow syntactic and named entity annotation scheme for this kind of text, and an approach to training domain specialists with no linguistic background to annotate the text. Finally, we present a statistical chunking system for such clinical text with a stable learning rate and good accuracy, indicating that the manual annotation is consistent and that the annotation scheme is tractable for machine learning
Recommended from our members
Chromatin Signature Identifies Monoallelic Gene Expression Across Mammalian Cell Types
Monoallelic expression of autosomal genes (MAE) is a widespread epigenetic phenomenon which is poorly understood, due in part to current limitations of genome-wide approaches for assessing it. Recently, we reported that a specific histone modification signature is strongly associated with MAE and demonstrated that it can serve as a proxy of MAE in human lymphoblastoid cells. Here, we use murine cells to establish that this chromatin signature is conserved between mouse and human and is associated with MAE in multiple cell types. Our analyses reveal extensive conservation in the identity of MAE genes between the two species. By analyzing MAE chromatin signature in a large number of cell and tissue types, we show that it remains consistent during terminal cell differentiation and is predominant among cell-type specific genes, suggesting a link between MAE and specification of cell identity
Natural language processing to automatically extract the presence and severity of esophagitis in notes of patients undergoing radiotherapy
Radiotherapy (RT) toxicities can impair survival and quality-of-life, yet
remain under-studied. Real-world evidence holds potential to improve our
understanding of toxicities, but toxicity information is often only in clinical
notes. We developed natural language processing (NLP) models to identify the
presence and severity of esophagitis from notes of patients treated with
thoracic RT. We fine-tuned statistical and pre-trained BERT-based models for
three esophagitis classification tasks: Task 1) presence of esophagitis, Task
2) severe esophagitis or not, and Task 3) no esophagitis vs. grade 1 vs. grade
2-3. Transferability was tested on 345 notes from patients with esophageal
cancer undergoing RT.
Fine-tuning PubmedBERT yielded the best performance. The best macro-F1 was
0.92, 0.82, and 0.74 for Task 1, 2, and 3, respectively. Selecting the most
informative note sections during fine-tuning improved macro-F1 by over 2% for
all tasks. Silver-labeled data improved the macro-F1 by over 3% across all
tasks. For the esophageal cancer notes, the best macro-F1 was 0.73, 0.74, and
0.65 for Task 1, 2, and 3, respectively, without additional fine-tuning.
To our knowledge, this is the first effort to automatically extract
esophagitis toxicity severity according to CTCAE guidelines from clinic notes.
The promising performance provides proof-of-concept for NLP-based automated
detailed toxicity monitoring in expanded domains.Comment: 17 pages, 6 tables, 1figure, submiting to JCO-CCI for revie
Recommended from our members
Automatic Prediction of Rheumatoid Arthritis Disease Activity from the Electronic Medical Records
Objective: We aimed to mine the data in the Electronic Medical Record to automatically discover patients' Rheumatoid Arthritis disease activity at discrete rheumatology clinic visits. We cast the problem as a document classification task where the feature space includes concepts from the clinical narrative and lab values as stored in the Electronic Medical Record. Materials and Methods The Training Set consisted of 2792 clinical notes and associated lab values. Test Set 1 included 1749 clinical notes and associated lab values. Test Set 2 included 344 clinical notes for which there were no associated lab values. The Apache clinical Text Analysis and Knowledge Extraction System was used to analyze the text and transform it into informative features to be combined with relevant lab values. Results: Experiments over a range of machine learning algorithms and features were conducted. The best performing combination was linear kernel Support Vector Machines with Unified Medical Language System Concept Unique Identifier features with feature selection and lab values. The Area Under the Receiver Operating Characteristic Curve (AUC) is 0.831 (σ = 0.0317), statistically significant as compared to two baselines (AUC = 0.758, σ = 0.0291). Algorithms demonstrated superior performance on cases clinically defined as extreme categories of disease activity (Remission and High) compared to those defined as intermediate categories (Moderate and Low) and included laboratory data on inflammatory markers. Conclusion: Automatic Rheumatoid Arthritis disease activity discovery from Electronic Medical Record data is a learnable task approximating human performance. As a result, this approach might have several research applications, such as the identification of patients for genome-wide pharmacogenetic studies that require large sample sizes with precise definitions of disease activity and response to therapies
Extracting information from the text of electronic medical records to improve case detection: a systematic review
Background: Electronic medical records (EMRs) are revolutionizing health-related research. One key issue for study quality is the accurate identification of patients with the condition of interest. Information in EMRs can be entered as structured codes or unstructured free text. The majority of research studies have used only coded parts of EMRs for case-detection, which may bias findings, miss cases, and reduce study quality. This review examines whether incorporating information from text into case-detection algorithms can improve research quality.
Methods: A systematic search returned 9659 papers, 67 of which reported on the extraction of information from free text of EMRs with the stated purpose of detecting cases of a named clinical condition. Methods for extracting information from text and the technical accuracy of case-detection algorithms were reviewed.
Results: Studies mainly used US hospital-based EMRs, and extracted information from text for 41 conditions using keyword searches, rule-based algorithms, and machine learning methods. There was no clear difference in case-detection algorithm accuracy between rule-based and machine learning methods of extraction. Inclusion of information from text resulted in a significant improvement in algorithm sensitivity and area under the receiver operating characteristic in comparison to codes alone (median sensitivity 78% (codes + text) vs 62% (codes), P = .03; median area under the receiver operating characteristic 95% (codes + text) vs 88% (codes), P = .025).
Conclusions: Text in EMRs is accessible, especially with open source information extraction algorithms, and significantly improves case detection when combined with codes. More harmonization of reporting within EMR studies is needed, particularly standardized reporting of algorithm accuracy metrics like positive predictive value (precision) and sensitivity (recall)
Association of Diabetic Ketoacidosis and HbA1c at Onset with Year-Three HbA1c in Children and Adolescents with Type 1 Diabetes: Data from the International SWEET Registry
Objective: To establish whether diabetic ketoacidosis (DKA) or HbA1c at onset is associated with year-three HbA1c in children with type 1 diabetes (T1D).
Methods: Children with T1D from the SWEET registry, diagnosed <18 years, with documented clinical presentation, HbA1c at onset and follow-up were included. Participants were categorized according to T1D onset: (a) DKA (DKA with coma, DKA without coma, no DKA); (b) HbA1c at onset (low [<10%], medium [10 to <12%], high [≥12%]). To adjust for demographics, linear regression was applied with interaction terms for DKA and HbA1c at onset groups (adjusted means with 95% CI). Association between year-three HbA1c and both HbA1c and presentation at onset was analyzed (Vuong test).
Results: Among 1420 children (54% males; median age at onset 9.1 years [Q1;Q3: 5.8;12.2]), 6% of children experienced DKA with coma, 37% DKA without coma, and 57% no DKA. Year-three HbA1c was lower in the low compared to high HbA1c at onset group, both in the DKA without coma (7.1% [6.8;7.4] vs 7.6% [7.5;7.8], P = .03) and in the no DKA group (7.4% [7.2;7.5] vs 7.8% [7.6;7.9], P = .01), without differences between low and medium HbA1c at onset groups. Year-three HbA1c did not differ among HbA1c at onset groups in the DKA with coma group. HbA1c at onset as an explanatory variable was more closely associated with year-three HbA1c compared to presentation at onset groups (P = .02).
Conclusions: Year-three HbA1c is more closely related to HbA1c than to DKA at onset; earlier hyperglycemia detection might be crucial to improving year-three HbA1c.info:eu-repo/semantics/publishedVersio
Clinical narrative analytics challenges
Precision medicine or evidence based medicine is based on
the extraction of knowledge from medical records to provide individuals
with the appropriate treatment in the appropriate moment according to
the patient features. Despite the efforts of using clinical narratives for
clinical decision support, many challenges have to be faced still today
such as multilinguarity, diversity of terms and formats in different services,
acronyms, negation, to name but a few. The same problems exist
when one wants to analyze narratives in literature whose analysis would
provide physicians and researchers with highlights. In this talk we will
analyze challenges, solutions and open problems and will analyze several
frameworks and tools that are able to perform NLP over free text to
extract medical entities by means of Named Entity Recognition process.
We will also analyze a framework we have developed to extract and validate
medical terms. In particular we present two uses cases: (i) medical
entities extraction of a set of infectious diseases description texts provided
by MedlinePlus and (ii) scales of stroke identification in clinical
narratives written in Spanish
- …