11 research outputs found

    Counties with Lower Insurance Coverage and Housing Problems Are Associated with Both Slower Vaccine Rollout and Higher COVID-19 Incidence

    No full text
    Equitable vaccination distribution is a priority for outcompeting the transmission of COVID-19. Here, the impact of demographic, socioeconomic, and environmental factors on county-level vaccination rates and COVID-19 incidence changes is assessed. In particular, using data from 3142 US counties with over 328 million individuals, correlations were computed between cumulative vaccination rate and change in COVID-19 incidence from 1 December 2020 to 6 June 2021, with 44 different demographic, environmental, and socioeconomic factors. This correlation analysis was also performed using multivariate linear regression to adjust for age as a potential confounding variable. These correlation analyses demonstrated that counties with high levels of uninsured individuals have significantly lower COVID-19 vaccination rates (Spearman correlation: −0.460, p-value: <0.001). In addition, severe housing problems and high housing costs were strongly correlated with increased COVID-19 incidence (Spearman correlations: 0.335, 0.314, p-values: <0.001, <0.001). This study shows that socioeconomic factors are strongly correlated to both COVID-19 vaccination rates and incidence rates, underscoring the need to improve COVID-19 vaccination campaigns in marginalized communities

    High diversity in Delta variant across countries revealed by genome‐wide analysis of SARS‐CoV‐2 beyond the Spike protein

    No full text
    Abstract The highly contagious Delta variant of SARS‐CoV‐2 has become a prevalent strain globally and poses a public health challenge around the world. While there has been extensive focus on understanding the amino acid mutations in the Delta variant’s Spike protein, the mutational landscape of the rest of the SARS‐CoV‐2 proteome (25 proteins) remains poorly understood. To this end, we performed a systematic analysis of mutations in all the SARS‐CoV‐2 proteins from nearly 2 million SARS‐CoV‐2 genomes from 176 countries/territories. Six highly prevalent missense mutations in the viral life cycle‐associated Membrane (I82T), Nucleocapsid (R203M, D377Y), NS3 (S26L), and NS7a (V82A, T120I) proteins are almost exclusive to the Delta variant compared to other variants of concern (mean prevalence across genomes: Delta = 99.74%, Alpha = 0.06%, Beta = 0.09%, and Gamma = 0.22%). Furthermore, we find that the Delta variant harbors a more diverse repertoire of mutations across countries compared to the previously dominant Alpha variant. Overall, our study underscores the high diversity of the Delta variant between countries and identifies a list of amino acid mutations in the Delta variant’s proteome for probing the mechanistic basis of pathogenic features such as high viral loads, high transmissibility, and reduced susceptibility against neutralization by vaccines

    On the Origins of Omicron’s Unique Spike Gene Insertion

    No full text
    The emergence of a heavily mutated SARS-CoV-2 variant (Omicron; Pango lineage B.1.1.529 and BA sublineages) and its rapid spread to over 75 countries raised a global public health alarm. Characterizing the mutational profile of Omicron is necessary to interpret its clinical phenotypes which are shared with or distinctive from those of other SARS-CoV-2 variants. We compared the mutations of the initially circulating Omicron variant (now known as BA.1) with prior variants of concern (Alpha, Beta, Gamma, and Delta), variants of interest (Lambda, Mu, Eta, Iota, and Kappa), and ~1500 SARS-CoV-2 lineages constituting ~5.8 million SARS-CoV-2 genomes. Omicron’s Spike protein harbors 26 amino acid mutations (23 substitutions, 2 deletions, and 1 insertion) that are distinct compared to other variants of concern. While the substitution and deletion mutations appeared in previous SARS-CoV-2 lineages, the insertion mutation (ins214EPE) was not previously observed in any other SARS-CoV-2 lineage. Here, we consider and discuss various mechanisms through which the nucleotide sequence encoding for ins214EPE could have been acquired, including local duplication, polymerase slippage, and template switching. Although we are not able to definitively determine the mechanism, we highlight the plausibility of template switching. Analysis of the homology of the inserted nucleotide sequence and flanking regions suggests that this template-switching event could have involved the genomes of SARS-CoV-2 variants (e.g., the B.1.1 strain), other human coronaviruses that infect the same host cells as SARS-CoV-2 (e.g., HCoV-OC43 or HCoV-229E), or a human transcript expressed in a host cell that was infected by the Omicron precursor

    Building a best-in-class automated de-identification tool for electronic health records through ensemble learning

    No full text
    Summary: The presence of personally identifiable information (PII) in natural language portions of electronic health records (EHRs) constrains their broad reuse. Despite continuous improvements in automated detection of PII, residual identifiers require manual validation and correction. Here, we describe an automated de-identification system that employs an ensemble architecture, incorporating attention-based deep-learning models and rule-based methods, supported by heuristics for detecting PII in EHR data. Detected identifiers are then transformed into plausible, though fictional, surrogates to further obfuscate any leaked identifier. Our approach outperforms existing tools, with a recall of 0.992 and precision of 0.979 on the i2b2 2014 dataset and a recall of 0.994 and precision of 0.967 on a dataset of 10,000 notes from the Mayo Clinic. The de-identification system presented here enables the generation of de-identified patient data at the scale required for modern machine-learning applications to help accelerate medical discoveries. The bigger picture: Clinical notes in electronic health records convey rich historical information regarding disease and treatment progression. However, this unstructured text often contains personally identifiable information such as names, phone numbers, or residential addresses of patients, thereby limiting its dissemination for research purposes. The removal of patient identifiers, through the process of de-identification, enables sharing of clinical data while preserving patient privacy. Here, we present a best-in-class approach to de-identification, which automatically detects identifiers and substitutes them with fabricated ones. Our approach enables de-identification of patient data at the scale required to harness the unstructured, context-rich information in electronic health records to aid in medical research and advancement

    A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets

    No full text
    Technology to generate single cell RNA-sequencing (scRNA-seq) datasets and tools to annotate them have advanced rapidly in the past several years. Such tools generally rely on existing transcriptomic datasets or curated databases of cell type defining genes, while the application of scalable natural language processing (NLP) methods to enhance analysis workflows has not been adequately explored. Here we deployed an NLP framework to objectively quantify associations between a comprehensive set of over 20,000 human protein-coding genes and over 500 cell type terms across over 26 million biomedical documents. The resultant gene-cell type associations (GCAs) are significantly stronger between a curated set of matched cell type-marker pairs than the complementary set of mismatched pairs (Mann Whitney p = 6.15 × 10−76, r = 0.24; cohen’s D = 2.6). Building on this, we developed an augmented annotation algorithm (single cell Annotation via Literature Encoding, or scALE) that leverages GCAs to categorize cell clusters identified in scRNA-seq datasets, and we tested its ability to predict the cellular identity of 133 clusters from nine datasets of human breast, colon, heart, joint, ovary, prostate, skin, and small intestine tissues. With the optimized settings, the true cellular identity matched the top prediction in 59% of tested clusters and was present among the top five predictions for 91% of clusters. scALE slightly outperformed an existing method for reference data driven automated cluster annotation, and we demonstrate that integration of scALE can meaningfully improve the annotations derived from such methods. Further, contextualization of differential expression analyses with these GCAs highlights poorly characterized markers of well-studied cell types, such as CLIC6 and DNASE1L3 in retinal pigment epithelial cells and endothelial cells, respectively. Taken together, this study illustrates for the first time how the systematic application of a literature-derived knowledge graph can expedite and enhance the annotation and interpretation of scRNA-seq data

    Mapping each pre-existing condition’s association to short-term and long-term COVID-19 complications

    No full text
    Abstract Understanding the relationships between pre-existing conditions and complications of COVID-19 infection is critical to identifying which patients will develop severe disease. Here, we leverage ~1.1 million clinical notes from 1803 hospitalized COVID-19 patients and deep neural network models to characterize associations between 21 pre-existing conditions and the development of 20 complications (e.g. respiratory, cardiovascular, renal, and hematologic) of COVID-19 infection throughout the course of infection (i.e. 0–30 days, 31–60 days, and 61–90 days). Pleural effusion was the most frequent complication of early COVID-19 infection (89/1803 patients, 4.9%) followed by cardiac arrhythmia (45/1803 patients, 2.5%). Notably, hypertension was the most significant risk factor associated with 10 different complications including acute respiratory distress syndrome, cardiac arrhythmia, and anemia. The onset of new complications after 30 days is rare and most commonly involves pleural effusion (31–60 days: 11 patients, 61–90 days: 9 patients). Lastly, comparing the rates of complications with a propensity-matched COVID-negative hospitalized population confirmed the importance of hypertension as a risk factor for early-onset complications. Overall, the associations between pre-COVID conditions and COVID-associated complications presented here may form the basis for the development of risk assessment scores to guide clinical care pathways
    corecore