68 research outputs found

    Unsupervised grammar induction of clinical report sublanguage

    Get PDF
    BACKGROUND: Clinical reports are written using a subset of natural language while employing many domain-specific terms; such a language is also known as a sublanguage for a scientific or a technical domain. Different genres of clinical reports use different sublaguages, and in addition, different medical facilities use different medical language conventions. This makes supervised training of a parser for clinical sentences very difficult as it would require expensive annotation effort to adapt to every type of clinical text. METHODS: In this paper, we present an unsupervised method which automatically induces a grammar and a parser for the sublanguage of a given genre of clinical reports from a corpus with no annotations. In order to capture sentence structures specific to clinical domains, the grammar is induced in terms of semantic classes of clinical terms in addition to part-of-speech tags. Our method induces grammar by minimizing the combined encoding cost of the grammar and the corresponding sentence derivations. The probabilities for the productions of the induced grammar are then learned from the unannotated corpus using an instance of the expectation-maximization algorithm. RESULTS: Our experiments show that the induced grammar is able to parse novel sentences. Using a dataset of discharge summary sentences with no annotations, our method obtains 60.5% F-measure for parse-bracketing on sentences of maximum length 10. By varying a parameter, the method can induce a range of grammars, from very specific to very general, and obtains the best performance in between the two extremes

    Clinical Term Normalization Using Learned Edit Patterns and Subconcept Matching: System Development and Evaluation

    Get PDF
    Background: Clinical terms mentioned in clinical text are often not in their standardized forms as listed in clinical terminologies because of linguistic and stylistic variations. However, many automated downstream applications require clinical terms mapped to their corresponding concepts in clinical terminologies, thus necessitating the task of clinical term normalization. Objective: In this paper, a system for clinical term normalization is presented that utilizes edit patterns to convert clinical terms into their normalized forms. Methods: The edit patterns are automatically learned from the Unified Medical Language System (UMLS) Metathesaurus as well as from the given training data. The edit patterns are generalized sequences of edits that are derived from edit distance computations. The edit patterns are both character based as well as word based and are learned separately for different semantic types. In addition to these edit patterns, the system also normalizes clinical terms through the subconcepts mentioned within them. Results: The system was evaluated as part of the 2019 n2c2 Track 3 shared task of clinical term normalization. It obtained 80.79% accuracy on the standard test data. This paper includes ablation studies to evaluate the contributions of different components of the system. A challenging part of the task was disambiguation when a clinical term could be normalized to multiple concepts. Conclusions: The learned edit patterns led the system to perform well on the normalization task. Given that the system is based on patterns, it is human interpretable and is also capable of giving insights about common variations of clinical terms mentioned in clinical text that are different from their standardized forms

    Learning for clinical named entity recognition without manual annotations

    Get PDF
    Background: Named entity recognition (NER) systems are commonly built using supervised methods that use machine learning to learn from corpora manually annotated with named entities. However, manually annotating corpora is very expensive and laborious. Materials and methods: In this paper, a novel method is presented for training clinical NER systems that does not require any manual annotations. It only requires a raw text corpus and a resource like UMLS that can give a list of named entities along with their semantic types. Using these two resources, annotations are automatically obtained to train machine learning methods. The method was evaluated on the NER shared-task datasets of i2b2 2010 and SemEval 2014. Results: On the SemEval 2014 dataset for recognizing diseases and disorders, the method obtained F-measure of 0.693 for exact matching and of 0.773 allowing overlaps. This is comparable to many supervised systems in the past that had used manual annotations for training. On the i2b2 2010 dataset for recognizing problems, tests and treatments, the method obtained F-measures of 0.451, 0.338 and 0.204 respectively for exact matching and of 0.721, 0.587 and 0.475 respectively allowing overlaps. These results are better than an existing unsupervised method. Conclusions: Experiments on standard datasets showed that the new method performed well. The method is general and could be applied to recognize entities of other types on other genres of text without needing manual annotations

    Genie: A Generator of Natural Language Semantic Parsers for Virtual Assistant Commands

    Full text link
    To understand diverse natural language commands, virtual assistants today are trained with numerous labor-intensive, manually annotated sentences. This paper presents a methodology and the Genie toolkit that can handle new compound commands with significantly less manual effort. We advocate formalizing the capability of virtual assistants with a Virtual Assistant Programming Language (VAPL) and using a neural semantic parser to translate natural language into VAPL code. Genie needs only a small realistic set of input sentences for validating the neural model. Developers write templates to synthesize data; Genie uses crowdsourced paraphrases and data augmentation, along with the synthesized data, to train a semantic parser. We also propose design principles that make VAPL languages amenable to natural language translation. We apply these principles to revise ThingTalk, the language used by the Almond virtual assistant. We use Genie to build the first semantic parser that can support compound virtual assistants commands with unquoted free-form parameters. Genie achieves a 62% accuracy on realistic user inputs. We demonstrate Genie's generality by showing a 19% and 31% improvement over the previous state of the art on a music skill, aggregate functions, and access control.Comment: To appear in PLDI 201

    NEUROlogical Prognosis After Cardiac Arrest in Kids (NEUROPACK) study: protocol for a prospective multicentre clinical prediction model derivation and validation study in children after cardiac arrest

    Get PDF
    Introduction Currently, we are unable to accurately predict mortality or neurological morbidity following resuscitation after paediatric out of hospital (OHCA) or in-hospital (IHCA) cardiac arrest. A clinical prediction model may improve communication with parents and families and risk stratification of patients for appropriate postcardiac arrest care. This study aims to the derive and validate a clinical prediction model to predict, within 1 hour of admission to the paediatric intensive care unit (PICU), neurodevelopmental outcome at 3 months after paediatric cardiac arrest. Methods and analysis A prospective study of children (age: >24 hours and <16 years), admitted to 1 of the 24 participating PICUs in the UK and Ireland, following an OHCA or IHCA. Patients are included if requiring more than 1 min of cardiopulmonary resuscitation and mechanical ventilation at PICU admission Children who had cardiac arrests in PICU or neonatal intensive care unit will be excluded. Candidate variables will be identified from data submitted to the Paediatric Intensive Care Audit Network registry. Primary outcome is neurodevelopmental status, assessed at 3 months by telephone interview using the Vineland Adaptive Behavioural Score II questionnaire. A clinical prediction model will be derived using logistic regression with model performance and accuracy assessment. External validation will be performed using the Therapeutic Hypothermia After Paediatric Cardiac Arrest trial dataset. We aim to identify 370 patients, with successful consent and follow-up of 150 patients. Patient inclusion started 1 January 2018 and inclusion will continue over 18 months. Ethics and dissemination Ethical review of this protocol was completed by 27 September 2017 at the Wales Research Ethics Committee 5, 17/WA/0306. The results of this study will be published in peer-reviewed journals and presented in conferences. Trial registration number NCT03574025

    Chromosomal microarray testing in adults with intellectual disability presenting with comorbid psychiatric disorders.

    Get PDF
    Chromosomal copy-number variations (CNVs) are a class of genetic variants highly implicated in the aetiology of neurodevelopmental disorders, including intellectual disabilities (ID), schizophrenia and autism spectrum disorders (ASD). Yet the majority of adults with idiopathic ID presenting to psychiatric services have not been tested for CNVs. We undertook genome-wide chromosomal microarray analysis (CMA) of 202 adults with idiopathic ID recruited from community and in-patient ID psychiatry services across England. CNV pathogenicity was assessed using standard clinical diagnostic methods and participants underwent comprehensive medical and psychiatric phenotyping. We found an 11% yield of likely pathogenic CNVs (22/202). CNVs at recurrent loci, including the 15q11-q13 and 16p11.2-p13.11 regions were most frequently observed. We observed an increased frequency of 16p11.2 duplications compared with those reported in single-disorder cohorts. CNVs were also identified in genes known to effect neurodevelopment, namely NRXN1 and GRIN2B. Furthermore deletions at 2q13, 12q21.2-21.31 and 19q13.32, and duplications at 4p16.3, 13q32.3-33.3 and Xq24-25 were observed. Routine CMA in ID psychiatry could uncover ~11% new genetic diagnoses with potential implications for patient management. We advocate greater consideration of CMA in the assessment of adults with idiopathic ID presenting to psychiatry services

    Learning an Executable Neural Semantic Parser

    Get PDF
    This paper describes a neural semantic parser that maps natural language utterances onto logical forms which can be executed against a task-specific environment, such as a knowledge base or a database, to produce a response. The parser generates tree-structured logical forms with a transition-based approach which combines a generic tree-generation algorithm with domain-general operations defined by the logical language. The generation process is modeled by structured recurrent neural networks, which provide a rich encoding of the sentential context and generation history for making predictions. To tackle mismatches between natural language and logical form tokens, various attention mechanisms are explored. Finally, we consider different training settings for the neural semantic parser, including a fully supervised training where annotated logical forms are given, weakly-supervised training where denotations are provided, and distant supervision where only unlabeled sentences and a knowledge base are available. Experiments across a wide range of datasets demonstrate the effectiveness of our parser.Comment: In Journal of Computational Linguistic

    Phenotypic Characterization of EIF2AK4 Mutation Carriers in a Large Cohort of Patients Diagnosed Clinically With Pulmonary Arterial Hypertension.

    Get PDF
    BACKGROUND: Pulmonary arterial hypertension (PAH) is a rare disease with an emerging genetic basis. Heterozygous mutations in the gene encoding the bone morphogenetic protein receptor type 2 (BMPR2) are the commonest genetic cause of PAH, whereas biallelic mutations in the eukaryotic translation initiation factor 2 alpha kinase 4 gene (EIF2AK4) are described in pulmonary veno-occlusive disease/pulmonary capillary hemangiomatosis. Here, we determine the frequency of these mutations and define the genotype-phenotype characteristics in a large cohort of patients diagnosed clinically with PAH. METHODS: Whole-genome sequencing was performed on DNA from patients with idiopathic and heritable PAH and with pulmonary veno-occlusive disease/pulmonary capillary hemangiomatosis recruited to the National Institute of Health Research BioResource-Rare Diseases study. Heterozygous variants in BMPR2 and biallelic EIF2AK4 variants with a minor allele frequency of <1:10 000 in control data sets and predicted to be deleterious (by combined annotation-dependent depletion, PolyPhen-2, and sorting intolerant from tolerant predictions) were identified as potentially causal. Phenotype data from the time of diagnosis were also captured. RESULTS: Eight hundred sixty-four patients with idiopathic or heritable PAH and 16 with pulmonary veno-occlusive disease/pulmonary capillary hemangiomatosis were recruited. Mutations in BMPR2 were identified in 130 patients (14.8%). Biallelic mutations in EIF2AK4 were identified in 5 patients with a clinical diagnosis of pulmonary veno-occlusive disease/pulmonary capillary hemangiomatosis. Furthermore, 9 patients with a clinical diagnosis of PAH carried biallelic EIF2AK4 mutations. These patients had a reduced transfer coefficient for carbon monoxide (Kco; 33% [interquartile range, 30%-35%] predicted) and younger age at diagnosis (29 years; interquartile range, 23-38 years) and more interlobular septal thickening and mediastinal lymphadenopathy on computed tomography of the chest compared with patients with PAH without EIF2AK4 mutations. However, radiological assessment alone could not accurately identify biallelic EIF2AK4 mutation carriers. Patients with PAH with biallelic EIF2AK4 mutations had a shorter survival. CONCLUSIONS: Biallelic EIF2AK4 mutations are found in patients classified clinically as having idiopathic and heritable PAH. These patients cannot be identified reliably by computed tomography, but a low Kco and a young age at diagnosis suggests the underlying molecular diagnosis. Genetic testing can identify these misclassified patients, allowing appropriate management and early referral for lung transplantation

    Telomerecat: A ploidy-agnostic method for estimating telomere length from whole genome sequencing data.

    Get PDF
    Telomere length is a risk factor in disease and the dynamics of telomere length are crucial to our understanding of cell replication and vitality. The proliferation of whole genome sequencing represents an unprecedented opportunity to glean new insights into telomere biology on a previously unimaginable scale. To this end, a number of approaches for estimating telomere length from whole-genome sequencing data have been proposed. Here we present Telomerecat, a novel approach to the estimation of telomere length. Previous methods have been dependent on the number of telomeres present in a cell being known, which may be problematic when analysing aneuploid cancer data and non-human samples. Telomerecat is designed to be agnostic to the number of telomeres present, making it suited for the purpose of estimating telomere length in cancer studies. Telomerecat also accounts for interstitial telomeric reads and presents a novel approach to dealing with sequencing errors. We show that Telomerecat performs well at telomere length estimation when compared to leading experimental and computational methods. Furthermore, we show that it detects expected patterns in longitudinal data, repeated measurements, and cross-species comparisons. We also apply the method to a cancer cell data, uncovering an interesting relationship with the underlying telomerase genotype
    • …
    corecore