78 research outputs found
Identification of Novel Cancer-Related Genes with a Prognostic Role Using Gene Expression and Protein-Protein Interaction Network Data
Early cancer diagnosis and prognosis prediction are necessary for cancer patients. Effective identification of cancer-related genes and biomarkers and survival prediction for cancer patients would facilitate personalized treatment of cancer patients. This study aimed to investigate a method for integrating data regarding gene expression and protein-protein interaction networks to identify cancer-related prognostic genes via random walk with restart algorithm and survival analysis. Known cancer-related genes in protein-protein interaction networks were considered seed genes, and the random walk algorithm was used to identify candidate cancer-related genes. Thereafter, using the univariant Cox regression model, gene expression data were screened to identify survival-related genes. Furthermore, candidate genes and survival-related genes were screened to identify cancer-related prognostic genes. Finally, the effectiveness of the method was verified through gene function analysis and survival prediction. The results indicate that the cancer-related genes can be considered prognostic cancer biomarkers and provide a basis for cancer diagnosis
Ranking-based Deep Cross-modal Hashing
Cross-modal hashing has been receiving increasing interests for its low
storage cost and fast query speed in multi-modal data retrievals. However, most
existing hashing methods are based on hand-crafted or raw level features of
objects, which may not be optimally compatible with the coding process.
Besides, these hashing methods are mainly designed to handle simple pairwise
similarity. The complex multilevel ranking semantic structure of instances
associated with multiple labels has not been well explored yet. In this paper,
we propose a ranking-based deep cross-modal hashing approach (RDCMH). RDCMH
firstly uses the feature and label information of data to derive a
semi-supervised semantic ranking list. Next, to expand the semantic
representation power of hand-crafted features, RDCMH integrates the semantic
ranking information into deep cross-modal hashing and jointly optimizes the
compatible parameters of deep feature representations and of hashing functions.
Experiments on real multi-modal datasets show that RDCMH outperforms other
competitive baselines and achieves the state-of-the-art performance in
cross-modal retrieval applications
Combining Sparse Group Lasso and Linear Mixed Model Improves Power to Detect Genetic Variants Underlying Quantitative Traits
Genome-Wide association studies (GWAS), based on testing one single nucleotide polymorphism (SNP) at a time, have revolutionized our understanding of the genetics of complex traits. In GWAS, there is a need to consider confounding effects such as due to population structure, and take groups of SNPs into account simultaneously due to the āpolygenicā attribute of complex quantitative traits. In this paper, we propose a new approach SGL-LMM that puts together sparse group lasso (SGL) and linear mixed model (LMM) for multivariate associations of quantitative traits. LMM, as has been often used in GWAS, controls for confounders, while SGL maintains sparsity of the underlying multivariate regression model. SGL-LMM first sets a fixed zero effect to learn the parameters of random effects using LMM, and then estimates fixed effects using SGL regularization. We present efficient algorithms for hyperparameter tuning and feature selection using stability selection. While controlling for confounders and constraining for sparse solutions, SGL-LMM also provides a natural framework for incorporating prior biological information into the group structure underlying the model. Results based on both simulated and real data show SGL-LMM outperforms previous approaches in terms of power to detect associations and accuracy of quantitative trait prediction
MaturePred: Efficient Identification of MicroRNAs within Novel Plant Pre-miRNAs
MicroRNAs (miRNAs) are a set of short (19ā¼24 nt) non-coding RNAs that play significant roles as posttranscriptional regulators in animals and plants. The ab initio prediction methods show excellent performance for discovering new pre-miRNAs. While most of these methods can distinguish real pre-miRNAs from pseudo pre-miRNAs, few can predict the positions of miRNAs. Among the existing methods that can also predict the miRNA positions, most of them are designed for mammalian miRNAs, including human and mouse. Minority of methods can predict the positions of plant miRNAs. Accurate prediction of the miRNA positions remains a challenge, especially for plant miRNAs. This motivates us to develop MaturePred, a machine learning method based on support vector machine, to predict the positions of plant miRNAs for the new plant pre-miRNA candidates.A miRNA:miRNA* duplex is regarded as a whole to capture the binding characteristics of miRNAs. We extract the position-specific features, the energy related features, the structure related features, and stability related features from real/pseudo miRNA:miRNA* duplexes. A set of informative features are selected to improve the prediction accuracy. Two-stage sample selection algorithm is proposed to combat the serious imbalance problem between real and pseudo miRNA:miRNA* duplexes. The prediction method, MaturePred, can accurately predict plant miRNAs and achieve higher prediction accuracy compared with the existing methods. Further, we trained a prediction model with animal data to predict animal miRNAs. The model also achieves higher prediction performance. It further confirms the efficiency of our miRNA prediction method.The superior performance of the proposed prediction model can be attributed to the extracted features of plant miRNAs and miRNA*s, the selected training dataset, and the carefully selected features. The web service of MaturePred, the training datasets, the testing datasets, and the selected features are freely available at http://nclab.hit.edu.cn/maturepred/
Reinforcement Causal Structure Learning on Order Graph
Learning directed acyclic graph (DAG) that describes the causality of
observed data is a very challenging but important task. Due to the limited
quantity and quality of observed data, and non-identifiability of causal graph,
it is almost impossible to infer a single precise DAG. Some methods approximate
the posterior distribution of DAGs to explore the DAG space via Markov chain
Monte Carlo (MCMC), but the DAG space is over the nature of super-exponential
growth, accurately characterizing the whole distribution over DAGs is very
intractable. In this paper, we propose {Reinforcement Causal Structure Learning
on Order Graph} (RCL-OG) that uses order graph instead of MCMC to model
different DAG topological orderings and to reduce the problem size. RCL-OG
first defines reinforcement learning with a new reward mechanism to approximate
the posterior distribution of orderings in an efficacy way, and uses deep
Q-learning to update and transfer rewards between nodes. Next, it obtains the
probability transition model of nodes on order graph, and computes the
posterior probability of different orderings. In this way, we can sample on
this model to obtain the ordering with high probability. Experiments on
synthetic and benchmark datasets show that RCL-OG provides accurate posterior
probability approximation and achieves better results than competitive causal
discovery algorithms.Comment: Accepted by the Thirty-Seventh AAAI Conference on Artificial
Intelligence(AAAI2023
Sentence Bag Graph Formulation for Biomedical Distant Supervision Relation Extraction
We introduce a novel graph-based framework for alleviating key challenges in
distantly-supervised relation extraction and demonstrate its effectiveness in
the challenging and important domain of biomedical data. Specifically, we
propose a graph view of sentence bags referring to an entity pair, which
enables message-passing based aggregation of information related to the entity
pair over the sentence bag. The proposed framework alleviates the common
problem of noisy labeling in distantly supervised relation extraction and also
effectively incorporates inter-dependencies between sentences within a bag.
Extensive experiments on two large-scale biomedical relation datasets and the
widely utilized NYT dataset demonstrate that our proposed framework
significantly outperforms the state-of-the-art methods for biomedical distant
supervision relation extraction while also providing excellent performance for
relation extraction in the general text mining domain
A least square method based model for identifying protein complexes in protein-protein interaction network
Protein complex formed by a group of physical interacting proteins plays a crucial role in cell activities. Great effort has been made to computationally identify protein complexes from protein-protein interaction (PPI) network. However, the accuracy of the prediction is still far from being satisfactory, because the topological structures of protein complexes in the PPI network are too complicated. This paper proposes a novel optimization framework to detect complexes from PPI network, named PLSMC. The method is on the basis of the fact that if two proteins are in a common complex, they are likely to be interacting. PLSMC employs this relation to determine complexes by a penalized least squares method. PLSMC is applied to several public yeast PPI networks, and compared with several state-of-the-art methods. The results indicate that PLSMC outperforms other methods. In particular, complexes predicted by PLSMC can match known complexes with a higher accuracy than other methods. Furthermore, the predicted complexes have high functional homogeneity
Mining disease genes using integrated proteināprotein interaction and geneāgene co-regulation information
AbstractIn humans, despite the rapid increase in disease-associated gene discovery, a large proportion of disease-associated genes are still unknown. Many network-based approaches have been used to prioritize disease genes. Many networks, such as the proteināprotein interaction (PPI), KEGG, and gene co-expression networks, have been used. Expression quantitative trait loci (eQTLs) have been successfully applied for the determination of genes associated with several diseases. In this study, we constructed an eQTL-based geneāgene co-regulation network (GGCRN) and used it to mine for disease genes. We adopted the random walk with restart (RWR) algorithm to mine for genes associated with Alzheimer disease. Compared to the Human Protein Reference Database (HPRD) PPI network alone, the integrated HPRD PPI and GGCRN networks provided faster convergence and revealed new disease-related genes. Therefore, using the RWR algorithm for integrated PPI and GGCRN is an effective method for disease-associated gene mining
Construction of Complex Features for Computational Predicting ncRNA-Protein Interaction
Non-coding RNA (ncRNA) plays important roles in many critical regulation processes. Many ncRNAs perform their regulatory functions by the form of RNA-protein complexes. Therefore, identifying the interaction between ncRNA and protein is fundamental to understand functions of ncRNA. Under pressures from expensive cost of experimental techniques, developing an accuracy computational predictive model has become an indispensable way to identify ncRNA-protein interaction. A powerful predicting model of ncRNA-protein interaction needs a good feature set of characterizing the interaction. In this paper, a novel method is put forward to generate complex features for characterizing ncRNA-protein interaction (named CFRP). To obtain a comprehensive description of ncRNA-protein interaction, complex features are generated by non-linear transformations from the traditional k-mer features of ncRNA and protein sequences. To further reduce the dimensions of complex features, a group of discriminative features are selected by random forest. To validate the performances of the proposed method, a series of experiments are carried on several widely-used public datasets. Compared with the traditional k-mer features, the CFRP complex features can boost the performances of ncRNA-protein interaction prediction model. Meanwhile, the CFRP-based prediction model is compared with several state-of-the-art methods, and the results show that the proposed method achieves better performances than the others in term of the evaluation metrics. In conclusion, the complex features generated by CFRP are beneficial for building a powerful predicting model of ncRNA-protein interaction
- ā¦