10 research outputs found

    Cohort selection for clinical trials from longitudinal patient records: text mining approach

    Get PDF
    Background: Clinical trials are an important step in introducing new interventions into clinical practice by generating data on their safety and efficacy. Clinical trials need to ensure that participants are similar so that the findings can be attributed to the interventions studied and not some other factors. Therefore, each clinical trial defines eligibility criteria, which describe characteristics that must be shared by the participants. Unfortunately, the complexities of eligibility criteria may not allow them to be translated directly into readily executable database queries. Instead, they may require careful analysis of the narrative sections of medical records. Manual screening of medical records is time consuming, thus negatively affecting the timeliness of the recruitment process. Objective: The Track 1 of the 2018 National NLP Clinical Challenge (n2c2) focused on the task of cohort selection for clinical trials with the aim of answering the following question: 'Can natural language processing be applied to narrative medical records to identify patients who meet eligibility criteria for clinical trials?' The task required the participating systems to analyze longitudinal patient records to determine if the corresponding patients met the given eligibility criteria. This article describes a system developed to address this task. Methods: Our system consists of 13 classifiers, one for each eligibility criterion. All classifiers use a bag-of-words document representation model. To prevent the loss of relevant contextual information associated with such representation, a pattern matching approach is used to extract context-sensitive features. They are embedded back into the text as lexically distinguishable tokens, which will consequently be featured in the bag-of-words representation. Supervised machine learning was chosen wherever a sufficient number of both positive and negative instances were available to learn from. A rule–based approach focusing on a small set of relevant features was chosen for the remaining criteria. Results: The system was evaluated using micro-averaged F–measure. Four machine algorithms, including support vector machine, logistic regression, naïve Bayesian classifier and gradient tree boosting, were evaluated on the training data using 10–fold cross-validation. Overall, gradient tree boosting demonstrated the most consistent performance. Its performance peaked when oversampling was used to balance the training data. Final evaluation was performed on previously unseen test data. On average, the F-measure of 89.04% was comparable to three of the top ranked performances in the shared task (91.11%, 90.28% and 90.21%). With F-measure of 88.14%, we significantly outperformed these systems (81.03%, 78.50% and 70.81%) in identifying patients with advanced coronary artery disease. Conclusions: The holdout evaluation provides evidence that our system was able to identify eligible patients for the given clinical trial with high accuracy. Our approach demonstrates how rule-based knowledge infusion can improve the performance of machine learning algorithms even when trained on a relatively small dataset

    Information Retrieval Systems Adapted to the Biomedical Domain

    Get PDF
    The terminology used in Biomedicine shows lexical peculiarities that have required the elaboration of terminological resources and information retrieval systems with specific functionalities. The main characteristics are the high rates of synonymy and homonymy, due to phenomena such as the proliferation of polysemic acronyms and their interaction with common language. Information retrieval systems in the biomedical domain use techniques oriented to the treatment of these lexical peculiarities. In this paper we review some of the techniques used in this domain, such as the application of Natural Language Processing (BioNLP), the incorporation of lexical-semantic resources, and the application of Named Entity Recognition (BioNER). Finally, we present the evaluation methods adopted to assess the suitability of these techniques for retrieving biomedical resources.Comment: 6 pages, 4 table

    Protecting patient privacy in distributed collaborative healthcare environments by retaining access control of shared information

    No full text
    Access control and privacy policies change during the course of collaboration. Information is often shared with collaborators outside of the traditional “perimeterized” organizational computer network. At this point the information owner (in the legal data protection sense) loses persistent control over their information. They cannot modify the policy that controls who accesses it, and have that enforced on the information wherever it resides. However, if patient consent is withdrawn or if the collaboration comes to an end naturally, or prematurely, the owner may be required to withdraw further access to their information. This paper presents a system that enhances the way access control technology is currently deployed so that information owners retain control of their access control and privacy policies, even after information has been shared

    100,000 Genomes Pilot on Rare-Disease Diagnosis in Health Care - Preliminary Report.

    No full text
    BACKGROUND: The U.K. 100,000 Genomes Project is in the process of investigating the role of genome sequencing in patients with undiagnosed rare diseases after usual care and the alignment of this research with health care implementation in the U.K. National Health Service. Other parts of this project focus on patients with cancer and infection. METHODS: We conducted a pilot study involving 4660 participants from 2183 families, among whom 161 disorders covering a broad spectrum of rare diseases were present. We collected data on clinical features with the use of Human Phenotype Ontology terms, undertook genome sequencing, applied automated variant prioritization on the basis of applied virtual gene panels and phenotypes, and identified novel pathogenic variants through research analysis. RESULTS: Diagnostic yields varied among family structures and were highest in family trios (both parents and a proband) and families with larger pedigrees. Diagnostic yields were much higher for disorders likely to have a monogenic cause (35%) than for disorders likely to have a complex cause (11%). Diagnostic yields for intellectual disability, hearing disorders, and vision disorders ranged from 40 to 55%. We made genetic diagnoses in 25% of the probands. A total of 14% of the diagnoses were made by means of the combination of research and automated approaches, which was critical for cases in which we found etiologic noncoding, structural, and mitochondrial genome variants and coding variants poorly covered by exome sequencing. Cohortwide burden testing across 57,000 genomes enabled the discovery of three new disease genes and 19 new associations. Of the genetic diagnoses that we made, 25% had immediate ramifications for clinical decision making for the patients or their relatives. CONCLUSIONS: Our pilot study of genome sequencing in a national health care system showed an increase in diagnostic yield across a range of rare diseases. (Funded by the National Institute for Health Research and others.)

    References

    No full text
    corecore