18 research outputs found

    Predicting gene expression from histone modifications with self-attention based neural networks and transfer learning

    Get PDF
    It is well known that histone modifications play an important part in various chromatin-dependent processes such as DNA replication, repair, and transcription. Using computational models to predict gene expression based on histone modifications has been intensively studied. However, the accuracy of the proposed models still has room for improvement, especially in cross-cell lines gene expression prediction. In the work, we proposed a new model TransferChrome to predict gene expression from histone modifications based on deep learning. The model uses a densely connected convolutional network to capture the features of histone modifications data and uses self-attention layers to aggregate global features of the data. For cross-cell lines gene expression prediction, TransferChrome adopts transfer learning to improve prediction accuracy. We trained and tested our model on 56 different cell lines from the REMC database. The experimental results show that our model achieved an average Area Under the Curve (AUC) score of 84.79%. Compared to three state-of-the-art models, TransferChrome improves the prediction performance on most cell lines. The experiments of cross-cell lines gene expression prediction show that TransferChrome performs best and is an efficient model for predicting cross-cell lines gene expression

    Accurate HLA type inference using a weighted similarity graph

    Get PDF
    Abstract Background The human leukocyte antigen system (HLA) contains many highly variable genes. HLA genes play an important role in the human immune system, and HLA gene matching is crucial for the success of human organ transplantations. Numerous studies have demonstrated that variation in HLA genes is associated with many autoimmune, inflammatory and infectious diseases. However, typing HLA genes by serology or PCR is time consuming and expensive, which limits large-scale studies involving HLA genes. Since it is much easier and cheaper to obtain single nucleotide polymorphism (SNP) genotype data, accurate computational algorithms to infer HLA gene types from SNP genotype data are in need. To infer HLA types from SNP genotypes, the first step is to infer SNP haplotypes from genotypes. However, for the same SNP genotype data set, the haplotype configurations inferred by different methods are usually inconsistent, and it is often difficult to decide which one is true. Results In this paper, we design an accurate HLA gene type inference algorithm by utilizing SNP genotype data from pedigrees, known HLA gene types of some individuals and the relationship between inferred SNP haplotypes and HLA gene types. Given a set of haplotypes inferred from the genotypes of a population consisting of many pedigrees, the algorithm first constructs a weighted similarity graph based on a new haplotype similarity measure and derives constraint edges from known HLA gene types. Based on the principle that different HLA gene alleles should have different background haplotypes, the algorithm searches for an optimal labeling of all the haplotypes with unknown HLA gene types such that the total weight among the same HLA gene types is maximized. To deal with ambiguous haplotype solutions, we use a genetic algorithm to select haplotype configurations that tend to maximize the same optimization criterion. Our experiments on a previously typed subset of the HapMap data show that the algorithm is highly accurate, achieving an accuracy of 96% for gene HLA-A, 95% for HLA-B, 97% for HLA-C, 84% for HLA-DRB1, 98% for HLA-DQA1 and 97% for HLA-DQB1 in a leave-one-out test. Conclusions Our algorithm can infer HLA gene types from neighboring SNP genotype data accurately. Compared with a recent approach on the same input data, our algorithm achieved a higher accuracy. The code of our algorithm is available to the public for free upon request to the corresponding authors

    Graph regularized non-negative matrix factorization with prior knowledge consistency constraint for drugā€“target interactions prediction

    No full text
    Abstract Background Identifying drugā€“target interactions (DTIs) plays a key role in drug development. Traditional wet experiments to identify DTIs are expensive and time consuming. Effective computational methods to predict DTIs are useful to narrow the searching scope of potential drugs and speed up the process of drug discovery. There are a variety of non-negativity matrix factorization based methods to predict DTIs, but the convergence of the algorithms used in the matrix factorization are often overlooked and the results can be further improved. Results In order to predict DTIs more accurately and quickly, we propose an alternating direction algorithm to solve graph regularized non-negative matrix factorization with prior knowledge consistency constraint (ADA-GRMFC). Based on known DTIs, drug chemical structures and target sequences, ADA-GRMFC at first constructs a DTI matrix, a drug similarity matrix and a target similarity matrix. Then DTI prediction is modeled as the non-negative factorization of the DTI matrix with graph dual regularization terms and a prior knowledge consistency constraint. The graph dual regularization terms are used to integrate the information from the drug similarity matrix and the target similarity matrix, and the prior knowledge consistency constraint is used to ensure the matrix decomposition result should be consistent with the prior knowledge of known DTIs. Finally, an alternating direction algorithm is used to solve the matrix factorization. Furthermore, we prove that the algorithm can converge to a stationary point. Extensive experimental results of 10-fold cross-validation show that ADA-GRMFC has better performance than other state-of-the-art methods. In the case study, ADA-GRMFC is also used to predict the targets interacting with the drug olanzapine, and all of the 10 highest-scoring targets have been accurately predicted. In predicting drug interactions with target estrogen receptors alpha, 17 of the 20 highest-scoring drugs have been validated

    Graph regularized non-negative matrix factorization with L2,1L_{2,1} L 2 , 1 norm regularization terms for drugā€“target interactions prediction

    No full text
    Abstract Background Identifying drugā€“target interactions (DTIs) plays a key role in drug development. Traditional wet experiments to identify DTIs are costly and time consuming. Effective computational methods to predict DTIs are useful to speed up the process of drug discovery. A variety of non-negativity matrix factorization based methods are proposed to predict DTIs, but most of them overlooked the sparsity of feature matrices and the convergence of adopted matrix factorization algorithms, therefore their performances can be further improved. Results In order to predict DTIs more accurately, we propose a novel method iPALM-DLMF. iPALM-DLMF models DTIs prediction as a problem of non-negative matrix factorization with graph dual regularization terms and L2,1L_{2,1} L 2 , 1 norm regularization terms. The graph dual regularization terms are used to integrate the information from the drug similarity matrix and the target similarity matrix, and L2,1L_{2,1} L 2 , 1 norm regularization terms are used to ensure the sparsity of the feature matrices obtained by non-negative matrix factorization. To solve the model, iPALM-DLMF adopts non-negative double singular value decomposition to initialize the nonnegative matrix factorization, and an inertial Proximal Alternating Linearized Minimization iterating process, which has been proved to converge to a KKT point, to obtain the final result of the matrix factorization. Extensive experimental results show that iPALM-DLMF has better performance than other state-of-the-art methods. In case studies, in 50 highest-scoring proteins targeted by the drug gabapentin predicted by iPALM-DLMF, 46 have been validated, and in 50 highest-scoring drugs targeting prostaglandin-endoperoxide synthase 2 predicted by iPALM-DLMF, 47 have been validated

    A fast and accurate algorithm for single individual haplotyping

    Get PDF
    Abstract Background Due to the difficulty in separating two (paternal and maternal) copies of a chromosome, most published human genome sequences only provide genotype information, i.e., the mixed information of the underlying two haplotypes. However, phased haplotype information is needed to completely understand complex genetic polymorphisms and to increase the power of genome-wide association studies for complex diseases. With the rapid development of DNA sequencing technologies, reconstructing a pair of haplotypes from an individual's aligned DNA fragments by computer algorithms (i.e., Single Individual Haplotyping) has become a practical haplotyping approach. Results In the paper, we combine two measures "errors corrected" and "fragments cut" and propose a new optimization model, called Balanced Optimal Partition (BOP), for single individual haplotyping. The model generalizes two existing models, Minimum Error Correction (MEC) and Maximum Fragments Cut (MFC), and could be made either model by using some extreme parameter values. To solve the model, we design a heuristic dynamic programming algorithm H-BOP. By limiting the number of intermediate solutions at each iteration to an appropriately chosen small integer k, H-BOP is able to solve the model efficiently. Conclusions Extensive experimental results on simulated and real data show that when k = 8, H-BOP is generally faster and more accurate than a recent state-of-art algorithm ReFHap in haplotype reconstruction. The running time of H-BOP is linearly dependent on some of the key parameters controlling the input size and H-BOP scales well to large input data. The code of H-BOP is available to the public for free upon request to the corresponding author

    H-PoP and H-PoPG: heuristic partitioning algorithms for single individual haplotyping of polyploids

    No full text
    MotivationSome economically important plants including wheat and cotton have more than two copies of each chromosome. With the decreasing cost and increasing read length of next-generation sequencing technologies, reconstructing the multiple haplotypes of a polyploid genome from its sequence reads becomes practical. However, the computational challenge in polyploid haplotyping is much greater than that in diploid haplotyping, and there are few related methods.ResultsThis article models the polyploid haplotyping problem as an optimal poly-partition problem of the reads, called the Polyploid Balanced Optimal Partition model. For the reads sequenced from a k-ploid genome, the model tries to divide the reads into k groups such that the difference between the reads of the same group is minimized while the difference between the reads of different groups is maximized. When the genotype information is available, the model is extended to the Polyploid Balanced Optimal Partition with Genotype constraint problem. These models are all NP-hard. We propose two heuristic algorithms, H-PoP and H-PoPG, based on dynamic programming and a strategy of limiting the number of intermediate solutions at each iteration, to solve the two models, respectively. Extensive experimental results on simulated and real data show that our algorithms can solve the models effectively, and are much faster and more accurate than the recent state-of-the-art polyploid haplotyping algorithms. The experiments also show that our algorithms can deal with long reads and deep read coverage effectively and accurately. Furthermore, H-PoP might be applied to help determine the ploidy of an organism.Availability and implementationhttps://github.com/MinzhuXie/H-PoPG CONTACT: [email protected] information: Supplementary data are available at Bioinformatics online

    H-PoP and H-PoPG: heuristic partitioning algorithms for single individual haplotyping of polyploids

    No full text
    Motivation: Some economically important plants including wheat and cotton have more than two copies of each chromosome. With the decreasing cost and increasing read length of next-generation sequencing technologies, reconstructing the multiple haplotypes of a polyploid genome from its sequence reads becomes practical. However, the computational challenge in polyploid haplotyping is much greater than that in diploid haplotyping, and there are few related methods. Results: This article models the polyploid haplotyping problem as an optimal poly-partition problem of the reads, called the Polyploid Balanced Optimal Partition model. For the reads sequenced from a k-ploid genome, the model tries to divide the reads into k groups such that the difference between the reads of the same group is minimized while the difference between the reads of different groups is maximized. When the genotype information is available, the model is extended to the Polyploid Balanced Optimal Partition with Genotype constraint problem. These models are all NP-hard. We propose two heuristic algorithms, H-PoP and H-PoPG, based on dynamic programming and a strategy of limiting the number of intermediate solutions at each iteration, to solve the two models, respectively. Extensive experimental results on simulated and real data show that our algorithms can solve the models effectively, and are much faster and more accurate than the recent state-of-the-art polyploid haplotyping algorithms. The experiments also show that our algorithms can deal with long reads and deep read coverage effectively and accurately. Furthermore, H-PoP might be applied to help determine the ploidy of an organism. Availability and Implementation: https://github.com/MinzhuXie/H-PoPG Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online

    LGH: A Fast and Accurate Algorithm for Single Individual Haplotyping Based on a Two-Locus Linkage Graph

    No full text
    corecore