75 research outputs found

    Completing Single-Cell DNA Methylome Profiles via Transfer Learning Together With KL-Divergence

    Get PDF
    The high level of sparsity in methylome profiles obtained using whole-genome bisulfite sequencing in the case of low biological material amount limits its value in the study of systems in which large samples are difficult to assemble, such as mammalian preimplantation embryonic development. The recently developed computational methods for addressing the sparsity by imputing missing have their limits when the required minimum data coverage or profiles of the same tissue in other modalities are not available. In this study, we explored the use of transfer learning together with Kullback-Leibler (KL) divergence to train predictive models for completing methylome profiles with very low coverage (below 2%). Transfer learning was used to leverage less sparse profiles that are typically available for different tissues for the same species, while KL divergence was employed to maximize the usage of information carried in the input data. A deep neural network was adopted to extract both DNA sequence and local methylation patterns for imputation. Our study of training models for completing methylome profiles of bovine oocytes and early embryos demonstrates the effectiveness of transfer learning and KL divergence, with individual increase of 29.98 and 29.43%, respectively, in prediction performance and 38.70% increase when the two were used together. The drastically increased data coverage (43.80–73.6%) after imputation powers downstream analyses involving methylomes that cannot be effectively done using the very low coverage profiles (0.06–1.47%) before imputation

    VIGAN: Missing View Imputation with Generative Adversarial Networks

    Full text link
    In an era when big data are becoming the norm, there is less concern with the quantity but more with the quality and completeness of the data. In many disciplines, data are collected from heterogeneous sources, resulting in multi-view or multi-modal datasets. The missing data problem has been challenging to address in multi-view data analysis. Especially, when certain samples miss an entire view of data, it creates the missing view problem. Classic multiple imputations or matrix completion methods are hardly effective here when no information can be based on in the specific view to impute data for such samples. The commonly-used simple method of removing samples with a missing view can dramatically reduce sample size, thus diminishing the statistical power of a subsequent analysis. In this paper, we propose a novel approach for view imputation via generative adversarial networks (GANs), which we name by VIGAN. This approach first treats each view as a separate domain and identifies domain-to-domain mappings via a GAN using randomly-sampled data from each view, and then employs a multi-modal denoising autoencoder (DAE) to reconstruct the missing view from the GAN outputs based on paired data across the views. Then, by optimizing the GAN and DAE jointly, our model enables the knowledge integration for domain mappings and view correspondences to effectively recover the missing view. Empirical results on benchmark datasets validate the VIGAN approach by comparing against the state of the art. The evaluation of VIGAN in a genetic study of substance use disorders further proves the effectiveness and usability of this approach in life science.Comment: 10 pages, 8 figures, conferenc

    Analysis of an Existing Method in Refinement of Protein Structure Predictions using Cryo-EM Images

    Get PDF
    Protein structure prediction produces atomic models from its amino acid sequence. Three-dimensional structures are important for understanding the function mechanism of proteins. Knowing the structure of a given protein is crucial in drug development design of novel enzymes. AlphaFold2 is a protein structure prediction tool with good performance in recent CASP competitions. Phenix is a tool for determination of a protein structure from a high-resolution 3D molecular image. Recent development of Phenix shows that it is capable to refine predicted models from AlphaFold2, specifically the poorly predicted regions, by incorporating information from the 3D image of the protein. The goal of this project is to understand the strengths and weaknesses of the approach that combines Phenix and AlphaFold2 using broader data. This analysis may provide insights for enhancement of the approach.https://digitalcommons.odu.edu/gradposters2022_sciences/1000/thumbnail.jp

    An Approach to Developing Benchmark Datasets for Protein Secondary Structure Segmentation from Cryo-EM Density Maps

    Get PDF
    More and more deep learning approaches have been proposed to segment secondary structures from cryo-electron density maps at medium resolution range (5--10Ã…). Although the deep learning approaches show great potential, only a few small experimental data sets have been used to test the approaches. There is limited understanding about potential factors, in data, that affect the performance of segmentation. We propose an approach to generate data sets with desired specifications in three potential factors - the protein sequence identity, structural contents, and data quality. The approach was implemented and has generated a test set and various training sets to study the effect of secondary structure content and data quality on the performance of DeepSSETracer, a deep learning method that segments regions of protein secondary structures from cryo-EM map components. Results show that various content levels in the secondary structure and data quality influence the performance of segmentation for DeepSSETracer

    Refinement of AlphaFold2 Models Against Experimental and Hybrid Cryo-EM Density Maps

    Get PDF
    Recent breakthroughs in deep learning-based protein structure prediction show that it is possible to obtain highly accurate models for a wide range of difficult protein targets for which only the amino acid sequence is known. The availability of accurately predicted models from sequences can potentially revolutionise many modelling approaches in structural biology, including the interpretation of cryo-EM density maps. Although atomic structures can be readily solved from cryo-EM maps of better than 4 Ã… resolution, it is still challenging to determine accurate models from lower-resolution density maps. Here, we report on the benefits of models predicted by AlphaFold2 (the best-performing structure prediction method at CASP14) on cryo-EM refinement using the Phenix refinement suite for AlphaFold2 models. To study the robustness of model refinement at a lower resolution of interest, we introduced hybrid maps (i.e. experimental cryo-EM maps) filtered to lower resolutions by real-space convolution. The AlphaFold2 models were refined to attain good accuracies above 0.8 TM scores for 9 of the 13 cryo-EM maps. TM scores improved for AlphaFold2 models refined against all 13 cryo-EM maps of better than 4.5 Ã… resolution, 8 hybrid maps of 6 Ã… resolution, and 3 hybrid maps of 8 Ã… resolution. The results show that it is possible (at least with the Phenix protocol) to extend the refinement success below 4.5 Ã… resolution. We even found isolated cases in which resolution lowering was slightly beneficial for refinement, suggesting that highresolution cryo-EM maps might sometimes trap AlphaFold2 models in local optima

    A Genome-Wide Association Study of Cocaine Use Disorder Accounting for Phenotypic Heterogeneity and Gene–Environment Interaction

    Get PDF
    Background: Phenotypic heterogeneity and complicated gene-environment interplay in etiology are among the primary factors that hinder the identification of genetic variants associated with cocaine use disorder. Methods: To detect novel genetic variants associated with cocaine use disorder, we derived disease traits with reduced phenotypic heterogeneity using cluster analysis of a study sample (n = 9965). We then used these traits in genome-wide association tests, performed separately for 2070 African Americans and 1570 European Americans, using a new mixed model that accounted for the moderating effects of 5 childhood environmental factors. We used an independent sample (918 African Americans, 1382 European Americans) for replication. Results: The cluster analysis yielded 5 cocaine use disorder subtypes, of which subtypes 4 (n = 3258) and 5 (n = 1916) comprised heavy cocaine users, had high heritability estimates (h2 = 0.66 and 0.64, respectively) and were used in association tests. Seven of the 13 identified genetic loci in the discovery phase were available in the replication sample. In African Americans, rs114492924 (discovery p = 1.23 x E-8), a single nucleotide polymorphism in LINC01411, was replicated in the replication sample (p = 3.63 x E-3). In a meta-analysis that combined the discovery and replication results, 3 loci in African Americans were significant genome-wide: rs10188036 in TRAK2 (p = 2.95 x E-8), del 1:15511771 in TMEM51 = 9.11 x E-10) and rs149843442 near LPHN2 (p = 3.50 x E-8). Limitations: Lack of data prevented us from replicating 6 of the 13 identified loci. Conclusion: Our results demonstrate the importance of considering phenotypic heterogeneity and gene-environment interplay in detecting genetic variations that contribute to cocaine use disorder, because new genetic loci have been identified using our novel analytic method

    Intergenic Transcription in In Vivo Developed Bovine Oocytes and Pre-Implantation Embryos

    Get PDF
    Background Intergenic transcription, either failure to terminate at the transcription end site (TES), or transcription initiation at other intergenic regions, is present in cultured cells and enhanced in the presence of stressors such as viral infection. Transcription termination failure has not been characterized in natural biological samples such as pre-implantation embryos which express more than 10,000 genes and undergo drastic changes in DNA methylation. Results Using Automatic Readthrough Transcription Detection (ARTDeco) and data of in vivo developed bovine oocytes and embryos, we found abundant intergenic transcripts that we termed as read-outs (transcribed from 5 to 15 kb after TES) and read-ins (transcribed 1 kb up-stream of reference genes, extending up to 15 kb up-stream). Read-throughs (continued transcription from TES of expressed reference genes, 4–15 kb in length), however, were much fewer. For example, the numbers of read-outs and read-ins ranged from 3,084 to 6,565 or 33.36–66.67% of expressed reference genes at different stages of embryo development. The less copious read-throughs were at an average of 10% and significantly correlated with reference gene expression (P \u3c 0.05). Interestingly, intergenic transcription did not seem to be random because many intergenic transcripts (1,504 read-outs, 1,045 read-ins, and 1,021 read-throughs) were associated with common reference genes across all stages of pre-implantation development. Their expression also seemed to be regulated by developmental stages because many were differentially expressed (log2 fold change ≥ 2, P \u3c 0.05). Additionally, while gradual but un-patterned decreases in DNA methylation densities 10 kb both up- and down-stream of the intergenic transcribed regions were observed, the correlation between intergenic transcription and DNA methylation was insignificant. Finally, transcription factor binding motifs and polyadenylation signals were found in 27.2% and 12.15% of intergenic transcripts, respectively, suggesting considerable novel transcription initiation and RNA processing. Conclusion In summary, in vivo developed oocytes and pre-implantation embryos express large numbers of intergenic transcripts, which are not related to the overall DNA methylation profiles either up- or down-stream

    A Tool for Segmentation of Secondary Structures in 3D Cryo-EM Density Map Components Using Deep Convolutional Neural Networks

    Get PDF
    Although cryo-electron microscopy (cryo-EM) has been successfully used to derive atomic structures for many proteins, it is still challenging to derive atomic structures when the resolution of cryo-EM density maps is in the medium resolution range, such as 5–10 Å. Detection of protein secondary structures, such as helices and β-sheets, from cryo-EM density maps provides constraints for deriving atomic structures from such maps. As more deep learning methodologies are being developed for solving various molecular problems, effective tools are needed for users to access them. We have developed an effective software bundle, DeepSSETracer, for the detection of protein secondary structure from cryo-EM component maps in medium resolution. The bundle contains the network architecture and a U-Net model trained with a curriculum and gradient of episodic memory (GEM). The bundle integrates the deep neural network with the visualization capacity provided in ChimeraX. Using a Linux server that is remotely accessed by Windows users, it takes about 6 s on one CPU and one GPU for the trained deep neural network to detect secondary structures in a cryo-EM component map containing 446 amino acids. A test using 28 chain components of cryo-EM maps shows overall residue-level F1 scores of 0.72 and 0.65 to detect helices and β-sheets, respectively. Although deep learning applications are built on software frameworks, such as PyTorch and Tensorflow, our pioneer work here shows that integration of deep learning applications with ChimeraX is a promising and effective approach. Our experiments show that the F1 score measured at the residue level is an effective evaluation of secondary structure detection for individual classes. The test using 28 cryo-EM component maps shows that DeepSSETracer detects β-sheets more accurately than Emap2sec+, with a weighted average residue-level F1 score of 0.65 and 0.42, respectively. It also shows that Emap2sec+ detects helices more accurately than DeepSSETracer with a weighted average residue-level F1 score of 0.77 and 0.72 respectively

    Multi-View Cluster Analysis With Incomplete Data to Understand Treatment Effects

    Get PDF
    Multi-view cluster analysis, as a popular granular computing method, aims to partition sample subjects into consistent clusters across different views in which the subjects are characterized. Frequently, data entries can be missing from some of the views. The latest multi-view co-clustering methods cannot effectively deal with incomplete data, especially when there are mixed patterns of missing values. We propose an enhanced formulation for a family of multi-view co-clustering methods to cope with the missing data problem by introducing an indicator matrix whose elements indicate which data entries are observed and assessing cluster validity only on observed entries. In comparison with the simple strategy of removing subjects with missing values, our approach can use all available data in cluster analysis. In comparison with common methods that impute missing data in order to use regular multi-view analytics, our approach is less sensitive to imputation uncertainty. In comparison with other state-of-the-art multi-view incomplete clustering methods, our approach is sensible in the cases of missing any value in a view or missing the entire view, the most common scenario in practice. We first validated the proposed strategy in simulations, and then applied it to a treatment study of heroin dependence which would have been impossible with previous methods due to a number of missing-data patterns. Patients in a treatment study were naturally assessed in different feature spaces such as in the pre-, during-and post-treatment time windows. Our algorithm was able to identify subgroups where patients in each group showed similarities in all of the three time windows, thus leading to the recognition of pre-treatment (baseline) features predictive of post-treatment outcomes

    Gabapentin Drug Misuse Signals: A Pharmacovigilance Assessment Using the FDA Adverse Event Reporting System

    Get PDF
    Background: Although there have been increasing reports of intentional gabapentin misuse, epidemiological evidence for the phenomenon is limited. The purpose of this study was to determine whether there are pharmacovigilance abuse signals for gabapentin.Methods: Using FDA Adverse Events Reporting System reports from January 1, 2005 to December 31, 2015, we calculated pharmacovigilance signal measures (i.e., reporting odds ratio, proportional reporting ratio, information component, and empirical Bayes geometric mean) for abuse-related adverse event (AR-AE)-gabapentin pairs. Loglinear modeling assessed the frequency of concurrent reporting of abuse-related and abusespecific AEs (AS-AEs) associated with gabapentin. Findings were compared to a positive (pregabalin) and negative (duloxetine) control.Results: From 2005-2015 there were 5,951,229 unique AE reports submitted to the FDA including 99,977 for gabapentin, 73,977 for duloxetine, and 97,813 for pregabalin. Significant drug-AR-AE pair signals involving gabapentin included: drug abuser, multiple drug overdose, and substance-induced psychotic disorder. Significant drug AR-AE signals involving gabapentin and pregabalin, but not duloxetine, were: ataxia, dependence, drug abuse, increased drug tolerance, and overdose. Compared to duloxetine, gabapentin had significantly greater odds of a coreport for an AS-AE with drug withdrawal syndrome (OR: 6.55), auditory hallucinations (OR: 4.57), delusions (OR: 2.36), euphoric mood (OR: 5.45), ataxia (OR: 2.85), drug abuser (OR: 3.01), aggression (OR: L98), psychotic disorder (OR: 1.96), and feeling abnormal (OR: 1.31).Conclusions: We identified abuse-related signals for gabapentin and highlighted several CNS effects that may be associated with its abuse. Gabapentin prescribers should be aware of the drug\u27s abuse liability and effects that may accompany its use
    • …
    corecore