37 research outputs found

    Data quality considerations for evaluating COVID-19 treatments using real world data: learnings from the National COVID Cohort Collaborative (N3C)

    Get PDF
    Background: Multi-institution electronic health records (EHR) are a rich source of real world data (RWD) for generating real world evidence (RWE) regarding the utilization, benefits and harms of medical interventions. They provide access to clinical data from large pooled patient populations in addition to laboratory measurements unavailable in insurance claims-based data. However, secondary use of these data for research requires specialized knowledge and careful evaluation of data quality and completeness. We discuss data quality assessments undertaken during the conduct of prep-to-research, focusing on the investigation of treatment safety and effectiveness. Methods: Using the National COVID Cohort Collaborative (N3C) enclave, we defined a patient population using criteria typical in non-interventional inpatient drug effectiveness studies. We present the challenges encountered when constructing this dataset, beginning with an examination of data quality across data partners. We then discuss the methods and best practices used to operationalize several important study elements: exposure to treatment, baseline health comorbidities, and key outcomes of interest. Results: We share our experiences and lessons learned when working with heterogeneous EHR data from over 65 healthcare institutions and 4 common data models. We discuss six key areas of data variability and quality. (1) The specific EHR data elements captured from a site can vary depending on source data model and practice. (2) Data missingness remains a significant issue. (3) Drug exposures can be recorded at different levels and may not contain route of administration or dosage information. (4) Reconstruction of continuous drug exposure intervals may not always be possible. (5) EHR discontinuity is a major concern for capturing history of prior treatment and comorbidities. Lastly, (6) access to EHR data alone limits the potential outcomes which can be used in studies. Conclusions: The creation of large scale centralized multi-site EHR databases such as N3C enables a wide range of research aimed at better understanding treatments and health impacts of many conditions including COVID-19. As with all observational research, it is important that research teams engage with appropriate domain experts to understand the data in order to define research questions that are both clinically important and feasible to address using these real world data

    A method for comparing multiple imputation techniques: A case study on the U.S. national COVID cohort collaborative

    Get PDF
    Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful for assessing associations between patients’ predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases, whose removal may introduce severe bias. Several multiple imputation algorithms have been proposed to attempt to recover the missing information under an assumed missingness mechanism. Each algorithm presents strengths and weaknesses, and there is currently no consensus on which multiple imputation algorithm works best in a given scenario. Furthermore, the selection of each algorithm's parameters and data-related modeling choices are also both crucial and challenging. In this paper we propose a novel framework to numerically evaluate strategies for handling missing data in the context of statistical analysis, with a particular focus on multiple imputation techniques. We demonstrate the feasibility of our approach on a large cohort of type-2 diabetes patients provided by the National COVID Cohort Collaborative (N3C) Enclave, where we explored the influence of various patient characteristics on outcomes related to COVID-19. Our analysis included classic multiple imputation techniques as well as simple complete-case Inverse Probability Weighted models. Extensive experiments show that our approach can effectively highlight the most promising and performant missing-data handling strategy for our case study. Moreover, our methodology allowed a better understanding of the behavior of the different models and of how it changed as we modified their parameters. Our method is general and can be applied to different research fields and on datasets containing heterogeneous types

    Predicting cyanobacteria bloom occurrence in lakes and reservoirs before blooms occur

    Get PDF
    With increased global warming, cyanobacteria are blooming more frequently in lakes and reservoirs, severely damaging the health and stability of aquatic ecosystems and threatening drinking water safety and human health. There is an urgent demand for the effective prediction and prevention of cyanobacterial blooms. However, it is difficult to effectively reduce the risks and loss caused by cyanobacterial blooms because most methods are unable to successfully predict cyanobacteria blooms. Therefore, in this study, we proposed a new cyanobacterial bloom occurrence prediction method to analyze the probability and driving factors of the blooms for effective prevention and control. Dominant cyanobacterial species with bloom capabilities were initially determined using a dominant species identification model, and the principal driving factors of the dominant species were then analyzed using canonical correspondence analysis (CCA). Cyanobacterial bloom probability was calculated using a newly-developed model, after which, the probable mutation points were identified and thresholds for the principal driving factors of cyanobacterial blooms were predicted. A total of 141 phytoplankton data sets from 90 stations were collected from six large-scale hydrology, water-quality ecology, integrated field surveys in Jinan City, China in 2014–2015 and used for model application and verification. The results showed that there were six dominant cyanobacterial species in the study area, and that the principal driving factors were water temperature, pH, total phosphorus, ammonia nitrogen, chemical oxygen demand, and dissolved oxygen. The cyanobacterial blooms corresponded to a threshold water temperature range, pH, total phosphorus (TP), ammonium nitrogen level, chemical oxygen demand, and dissolved oxygen levels of 19.5–32.5 °C, 7.0–9.38, 0.13–0.22 mg L−1, 0.38–0.63 mg L−1, 10.5–17.5 mg L−1, and 4.97–8.28 mg L−1, respectively. Comparison with research results from other global regions further supported the use of these thresholds, indicating that this method could be used in habitats beyond China. We found that the probability of cyanobacterial bloom was 0.75, a critical point for prevention and control. When this critical point was exceeded, cyanobacteria could proliferate rapidly, increasing the risk of cyanobacterial blooms. Changes in driving factors need to be rapidly controlled, based on these thresholds, to prevent cyanobacterial blooms. Temporal and spatial scales were critical factors potentially affecting the selection of driving factors. This method is versatile and can help determine the risk of cyanobacterial blooms and the thresholds of the principal driving factors. It can effectively predict and help prevent cyanobacterial blooms to reduce the global probability of occurrence, protect the health and stability of water ecosystems, ensure drinking water safety, and protect human health

    Scaling of divertor power footprint width in RF-heated type-III ELMy H-mode on the EAST superconducting tokamak

    No full text
    Dedicated experiments for the scaling of divertor power footprint width have been performed in the ITER-relevant radio-frequency (RF)-heated H-mode scheme under the lower single null, double null and upper single null divertor configurations in the Experimental Advanced Superconducting Tokamak (EAST) under lithium wall coating conditioning. A strong inverse scaling of the edge localized mode (ELM)-averaged power fall-off width with the plasma current (equivalently the poloidal field) has been demonstrated for the attached type-III ELMy H-mode as λq∝Ip−1.05\lambda_{q} \propto I_{{\rm p}}^{-1.05} by various heat flux diagnostics including the divertor Langmuir probes (LPs), infra-red (IR) thermograph and reciprocating LPs on the low-field side. The IR camera and divertor LP measurements show that \lambda_{q,{\rm IR}} \approx {\lambda_{q,{\rm div\mbox{-}LPs}}}/{1.3}=1.15B_{{\rm p,omp}}^{-1.25} , in good agreement with the multi-machine scaling trend during the inter-ELM phase between type-I ELMs or ELM-free enhanced Dα (EDA). H-mode. However, the magnitude is nearly doubled, which may be attributed to the different operation scenarios or heating schemes in EAST, i.e., dominated by electron heating. It is also shown that the type-III ELMs only broaden the power fall-off width slightly, and the ELM-averaged width is representative for the inter-ELM period. Furthermore, the inverse Ip (Bp) scaling appears to be independent of the divertor configurations in EAST. The divertor power footprint integral width, fall-off width and dissipation width derived from EAST IR camera measurements follow the relation, λint cong λq + 1.64S, yielding λintEAST=(1.39±0.03)λqEAST+(0.97±0.35) mm\lambda_{\rm int}^{{\rm EAST}} =(1.39\pm 0.03)\lambda_{q}^{{\rm EAST}} +(0.97\pm 0.35)\,{\rm mm} . Detailed analysis of these three characteristic widths was carried out to shed more light on their extrapolation to ITER
    corecore