29 research outputs found
Desiderata for the development of next-generation electronic health record phenotype libraries
Background
High-quality phenotype definitions are desirable to enable the extraction of patient cohorts from large electronic health record repositories and are characterized by properties such as portability, reproducibility, and validity. Phenotype libraries, where definitions are stored, have the potential to contribute significantly to the quality of the definitions they host. In this work, we present a set of desiderata for the design of a next-generation phenotype library that is able to ensure the quality of hosted definitions by combining the functionality currently offered by disparate tooling.
Methods
A group of researchers examined work to date on phenotype models, implementation, and validation, as well as contemporary phenotype libraries developed as a part of their own phenomics communities. Existing phenotype frameworks were also examined. This work was translated and refined by all the authors into a set of best practices.
Results
We present 14 library desiderata that promote high-quality phenotype definitions, in the areas of modelling, logging, validation, and sharing and warehousing.
Conclusions
There are a number of choices to be made when constructing phenotype libraries. Our considerations distil the best practices in the field and include pointers towards their further development to support portable, reproducible, and clinically valid phenotype design. The provision of high-quality phenotype definitions enables electronic health record data to be more effectively used in medical domains
Desiderata for the development of next-generation electronic health record phenotype libraries
BackgroundHigh-quality phenotype definitions are desirable to enable the extraction of patient cohorts from large electronic health record repositories and are characterized by properties such as portability, reproducibility, and validity. Phenotype libraries, where definitions are stored, have the potential to contribute significantly to the quality of the definitions they host. In this work, we present a set of desiderata for the design of a next-generation phenotype library that is able to ensure the quality of hosted definitions by combining the functionality currently offered by disparate tooling.MethodsA group of researchers examined work to date on phenotype models, implementation, and validation, as well as contemporary phenotype libraries developed as a part of their own phenomics communities. Existing phenotype frameworks were also examined. This work was translated and refined by all the authors into a set of best practices.ResultsWe present 14 library desiderata that promote high-quality phenotype definitions, in the areas of modelling, logging, validation, and sharing and warehousing.ConclusionsThere are a number of choices to be made when constructing phenotype libraries. Our considerations distil the best practices in the field and include pointers towards their further development to support portable, reproducible, and clinically valid phenotype design. The provision of high-quality phenotype definitions enables electronic health record data to be more effectively used in medical domains
Recommended from our members
Generating Reliable and Responsive Observational Evidence: Reducing Pre-analysis Bias
A growing body of evidence generated from observational data has demonstrated the potential to influence decision-making and improve patient outcomes. For observational evidence to be actionable, however, it must be generated reliably and in a timely manner. Large distributed observational data networks enable research on diverse patient populations at scale and develop new sound methods to improve reproducibility and robustness of real-world evidence. Nevertheless, the problems of generalizability, portability and scalability persist and compound. As analytical methods only partially address bias, reliable observational research (especially in networks) must address the bias at the design stage (i.e., pre-analysis bias) including the strategies for identifying patients of interest and defining comparators.
This thesis synthesizes and enumerates a set of challenges to addressing pre-analysis bias in observational studies and presents mixed-methods approaches and informatics solutions for overcoming a number of those obstacles. We develop frameworks, methods and tools for scalable and reliable phenotyping including data source granularity estimation, comprehensive concept set selection, index date specification, and structured data-based patient review for phenotype evaluation. We cover the research on potential bias in the unexposed comparator definition including systematic background rates estimation and interpretation, and definition and evaluation of the unexposed comparator.
We propose that the use of standardized approaches and methods as described in this thesis not only improves reliability but also increases responsiveness of observational evidence. To test this hypothesis, we designed and piloted a Data Consult Service - a service that generates new on-demand evidence at the bedside. We demonstrate that it is feasible to generate reliable evidence to address clinicians’ information needs in a robust and timely fashion and provide our analysis of the current limitations and future steps needed to scale such a service
Cohort Identification Using Semantic Web Technologies: Ontologies and Triplestores as Engines for Complex Computable Phenotyping
Electronic health record (EHR)-based computable phenotypes are algorithms used to identify individuals or populations with clinical conditions or events of interest within a clinical data repository. Due to a lack of EHR data standardization, computable phenotypes can be semantically ambiguous and difficult to share across institutions. In this research, I propose a new computable phenotyping methodological framework based on semantic web technologies, specifically ontologies, the Resource Description Framework (RDF) data format, triplestores, and Web Ontology Language (OWL) reasoning. My hypothesis is that storing and analyzing clinical data using these technologies can begin to address the critical issues of semantic ambiguity and lack of interoperability in the context of computable phenotyping. To test this hypothesis, I compared the performance of two variants of two computable phenotypes (for depression and rheumatoid arthritis, respectively). The first variant of each phenotype used a list of ICD-10-CM codes to define the condition; the second variant used ontology concepts from SNOMED and the Human Phenotype Ontology (HPO). After executing each variant of each phenotype against a clinical data repository, I compared the patients matched in each case to see where the different variants overlapped and diverged. Both the ontologies and the clinical data were stored in an RDF triplestore to allow me to assess the interoperability advantages of the RDF format for clinical data. All tested methods successfully identified cohorts in the data store, with differing rates of overlap and divergence between variants. Depending on the phenotyping use case, SNOMED and HPO’s ability to more broadly define many conditions due to complex relationships between their concepts may be seen as an advantage or a disadvantage. I also found that RDF triplestores do indeed provide interoperability advantages, despite being far less commonly used in clinical data applications than relational databases. Despite the fact that these methods and technologies are not “one-size-fits-all,” the experimental results are encouraging enough for them to (1) be put into practice in combination with existing phenotyping methods or (2) be used on their own for particularly well-suited use cases.Doctor of Philosoph
COHORT IDENTIFICATION FROM FREE-TEXT CLINICAL NOTES USING SNOMED CT’S SEMANTIC RELATIONS
In this paper, a new cohort identification framework that exploits the semantic hierarchy of SNOMED CT is proposed to overcome the limitations of supervised machine learning-based approaches. Eligibility criteria descriptions and free-text clinical notes from the 2018 National NLP Clinical Challenge (n2c2) were processed to map to relevant SNOMED CT concepts and to measure semantic similarity between the eligibility criteria and patients. The eligibility of a patient was determined if the patient had a similarity score higher than a threshold cut-off value, which was established where the best F1 score could be achieved. The performance of the proposed system was evaluated for three eligibility criteria. The current framework’s macro-average F1 score across three eligibility criteria was higher than the previously reported results of the 2018 n2c2 (0.933 vs. 0.889). This study demonstrated that SNOMED CT alone can be leveraged for cohort identification tasks without referring to external textual sources for training.Doctor of Philosoph
Health systems data interoperability and implementation
Objective The objective of this study was to use machine learning and health standards to address the problem of clinical data interoperability across healthcare institutions. Addressing this problem has the potential to make clinical data comparable, searchable and exchangeable between healthcare providers.
Data sources Structured and unstructured data has been used to conduct the experiments in this study. The data was collected from two disparate data sources namely MIMIC-III and NHanes. The MIMIC-III database stored data from two electronic health record systems which are CareVue and MetaVision. The data stored in these systems was not recorded with the same standards; therefore, it was not comparable because some values were conflicting, while one system would store an abbreviation of a clinical concept, the other would store the full concept name and some of the attributes contained missing information. These few issues that have been identified make this form of data a good candidate for this study. From the identified data sources, laboratory, physical examination, vital signs, and behavioural data were used for this study.
Methods This research employed a CRISP-DM framework as a guideline for all the stages of data mining. Two sets of classification experiments were conducted, one for the classification of structured data, and the other for unstructured data. For the first experiment, Edit distance, TFIDF and JaroWinkler were used to calculate the similarity weights between two datasets, one coded with the LOINC terminology standard and another not coded. Similar sets of data were classified as matches while dissimilar sets were classified as non-matching. Then soundex indexing method was used to reduce the number of potential comparisons. Thereafter, three classification algorithms were trained and tested, and the performance of each was evaluated through the ROC curve. Alternatively the second experiment was aimed at extracting patient’s smoking status information from a clinical corpus. A sequence-oriented classification algorithm called CRF was used for learning related concepts from the given clinical corpus. Hence, word embedding, random indexing, and word shape features were used for understanding the meaning in the corpus.
Results Having optimized all the model’s parameters through the v-fold cross validation on a sampled training set of structured data ( ), out of 24 features, only ( 8) were selected for a classification task. RapidMiner was used to train and test all the classification algorithms. On the final run of classification process, the last contenders were SVM and the decision tree classifier. SVM yielded an accuracy of 92.5% when the and parameters were set to and . These results were obtained after more relevant features were identified, having observed that the classifiers were biased on the initial data. On the other side, unstructured data was annotated via the UIMA Ruta scripting language, then trained through the CRFSuite which comes with the CLAMP toolkit. The CRF classifier obtained an F-measure of 94.8% for “nonsmoker” class, 83.0% for “currentsmoker”, and 65.7% for “pastsmoker”. It was observed that as more relevant data was added, the performance of the classifier improved. The results show that there is a need for the use of FHIR resources for exchanging clinical data between healthcare institutions. FHIR is free, it uses: profiles to extend coding standards; RESTFul API to exchange messages; and JSON, XML and turtle for representing messages. Data could be stored as JSON format on a NoSQL database such as CouchDB, which makes it available for further post extraction exploration.
Conclusion This study has provided a method for learning a clinical coding standard by a computer algorithm, then applying that learned standard to unstandardized data so that unstandardized data could be easily exchangeable, comparable and searchable and ultimately achieve data interoperability. Even though this study was applied on a limited scale, in future, the study would explore the standardization of patient’s long-lived data from multiple sources using the SHARPn open-sourced tools and data scaling platformsInformation ScienceM. Sc. (Computing
Front-Line Physicians' Satisfaction with Information Systems in Hospitals
Day-to-day operations management in hospital units is difficult due to continuously varying situations, several actors involved and a vast number of information systems in use. The aim of this study was to describe front-line physicians' satisfaction with existing information systems needed to support the day-to-day operations management in hospitals. A cross-sectional survey was used and data chosen with stratified random sampling were collected in nine hospitals. Data were analyzed with descriptive and inferential statistical methods. The response rate was 65 % (n = 111). The physicians reported that information systems support their decision making to some extent, but they do not improve access to information nor are they tailored for physicians. The respondents also reported that they need to use several information systems to support decision making and that they would prefer one information system to access important information. Improved information access would better support physicians' decision making and has the potential to improve the quality of decisions and speed up the decision making process.Peer reviewe