97 research outputs found
Identification of disease-causing genes using microarray data mining and gene ontology
Background: One of the best and most accurate methods for identifying disease-causing genes is monitoring gene expression values in different samples using microarray technology. One of the shortcomings of microarray data is that they provide a small quantity of samples with respect to the number of genes. This problem reduces the classification accuracy of the methods, so gene selection is essential to improve the predictive accuracy and to identify potential marker genes for a disease. Among numerous existing methods for gene selection, support vector machine-based recursive feature elimination (SVMRFE) has become one of the leading methods, but its performance can be reduced because of the small sample size, noisy data and the fact that the method does not remove redundant genes.
Methods: We propose a novel framework for gene selection which uses the advantageous features of conventional methods and addresses their weaknesses. In fact, we have combined the Fisher method and SVMRFE to utilize the advantages of a filtering method as well as an embedded method. Furthermore, we have added a redundancy reduction stage to address the weakness of the Fisher method and SVMRFE. In addition to gene expression values, the proposed method uses Gene Ontology which is a reliable source of information on genes. The use of Gene Ontology can compensate, in part, for the limitations of microarrays, such as having a small number of samples and erroneous measurement results.
Results: The proposed method has been applied to colon, Diffuse Large B-Cell Lymphoma (DLBCL) and prostate cancer datasets. The empirical results show that our method has improved classification performance in terms of accuracy, sensitivity and specificity. In addition, the study of the molecular function of selected genes strengthened the hypothesis that these genes are involved in the process of cancer growth.
Conclusions: The proposed method addresses the weakness of conventional methods by adding a redundancy reduction stage and utilizing Gene Ontology information. It predicts marker genes for colon, DLBCL and prostate cancer with a high accuracy. The predictions made in this study can serve as a list of candidates for subsequent wet-lab verification and might help in the search for a cure for cancers
Subcellular location prediction of proteins using support vector machines with alignment of block sequences utilizing amino acid composition
Background: Subcellular location prediction of proteins is an important and well-studied problem in bioinformatics. This is a problem of predicting which part in a cell a given protein is transported to, where an amino acid sequence of the protein is given as an input. This problem is becoming more important since information on subcellular location is helpful for annotation of proteins and genes and the number of complete genomes is rapidly increasing. Since existing predictors are based on various heuristics, it is important to develop a simple method with high prediction accuracies. Results: In this paper, we propose a novel and general predicting method by combining techniques for sequence alignment and feature vectors based on amino acid composition. We implemented this method with support vector machines on plant data sets extracted from the TargetP database. Through fivefold cross validation tests, the obtained overall accuracies and average MCC were 0.9096 and 0.8655 respectively. We also applied our method to other datasets including that of WoLF PSORT. Conclusion: Although there is a predictor which uses the information of gene ontology and yields higher accuracy than ours, our accuracies are higher than existing predictors which use only sequence information. Since such information as gene ontology can be obtained only for known proteins, our predictor is considered to be useful for subcellular location prediction of newly-discovered proteins. Furthermore, the idea of combination of alignment and amino acid frequency is novel and general so that it may be applied to other problems in bioinformatics. Our method for plant is also implemented as a web-system and available on http://sunflower.kuicr.kyoto-u.ac.jp/~tamura/slpfa.html webcite
LipocalinPred: a SVM-based method for prediction of lipocalins
<p>Abstract</p> <p>Background</p> <p>Functional annotation of rapidly amassing nucleotide and protein sequences presents a challenging task for modern bioinformatics. This is particularly true for protein families sharing extremely low sequence identity, as for lipocalins, a family of proteins with varied functions and great diversity at the sequence level, yet conserved structures.</p> <p>Results</p> <p>In the present study we propose a SVM based method for identification of lipocalin protein sequences. The SVM models were trained with the input features generated using amino acid, dipeptide and secondary structure compositions as well as PSSM profiles. The model derived using both PSSM and secondary structure emerged as the best model in the study. Apart from achieving a high prediction accuracy (>90% in leave-one-out), lipocalinpred correctly differentiates closely related fatty acid-binding proteins and triabins as non-lipocalins.</p> <p>Conclusion</p> <p>The method offers a promising approach as a lipocalin prediction tool, complementing PROSITE, Pfam and homology modelling methods.</p
Inferring Pathway Activity toward Precise Disease Classification
The advent of microarray technology has made it possible to classify disease states based on gene expression profiles of patients. Typically, marker genes are selected by measuring the power of their expression profiles to discriminate among patients of different disease states. However, expression-based classification can be challenging in complex diseases due to factors such as cellular heterogeneity within a tissue sample and genetic heterogeneity across patients. A promising technique for coping with these challenges is to incorporate pathway information into the disease classification procedure in order to classify disease based on the activity of entire signaling pathways or protein complexes rather than on the expression levels of individual genes or proteins. We propose a new classification method based on pathway activities inferred for each patient. For each pathway, an activity level is summarized from the gene expression levels of its condition-responsive genes (CORGs), defined as the subset of genes in the pathway whose combined expression delivers optimal discriminative power for the disease phenotype. We show that classifiers using pathway activity achieve better performance than classifiers based on individual gene expression, for both simple and complex case-control studies including differentiation of perturbed from non-perturbed cells and subtyping of several different kinds of cancer. Moreover, the new method outperforms several previous approaches that use a static (i.e., non-conditional) definition of pathways. Within a pathway, the identified CORGs may facilitate the development of better diagnostic markers and the discovery of core alterations in human disease
Time to Recurrence and Survival in Serous Ovarian Tumors Predicted from Integrated Genomic Profiles
Serous ovarian cancer (SeOvCa) is an aggressive disease with differential and often inadequate therapeutic outcome after standard treatment. The Cancer Genome Atlas (TCGA) has provided rich molecular and genetic profiles from hundreds of primary surgical samples. These profiles confirm mutations of TP53 in ∼100% of patients and an extraordinarily complex profile of DNA copy number changes with considerable patient-to-patient diversity. This raises the joint challenge of exploiting all new available datasets and reducing their confounding complexity for the purpose of predicting clinical outcomes and identifying disease relevant pathway alterations. We therefore set out to use multi-data type genomic profiles (mRNA, DNA methylation, DNA copy-number alteration and microRNA) available from TCGA to identify prognostic signatures for the prediction of progression-free survival (PFS) and overall survival (OS). prediction algorithm and applied it to two datasets integrated from the four genomic data types. We (1) selected features through cross-validation; (2) generated a prognostic index for patient risk stratification; and (3) directly predicted continuous clinical outcome measures, that is, the time to recurrence and survival time. We used Kaplan-Meier p-values, hazard ratios (HR), and concordance probability estimates (CPE) to assess prediction performance, comparing separate and integrated datasets. Data integration resulted in the best PFS signature (withheld data: p-value = 0.008; HR = 2.83; CPE = 0.72).We provide a prediction tool that inputs genomic profiles of primary surgical samples and generates patient-specific predictions for the time to recurrence and survival, along with outcome risk predictions. Using integrated genomic profiles resulted in information gain for prediction of outcomes. Pathway analysis provided potential insights into functional changes affecting disease progression. The prognostic signatures, if prospectively validated, may be useful for interpreting therapeutic outcomes for clinical trials that aim to improve the therapy for SeOvCa patients
ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples
<p>Abstract</p> <p>Background</p> <p>Elucidating the genetic basis of human diseases is a central goal of genetics and molecular biology. While traditional linkage analysis and modern high-throughput techniques often provide long lists of tens or hundreds of disease gene candidates, the identification of disease genes among the candidates remains time-consuming and expensive. Efficient computational methods are therefore needed to prioritize genes within the list of candidates, by exploiting the wealth of information available about the genes in various databases.</p> <p>Results</p> <p>We propose ProDiGe, a novel algorithm for Prioritization of Disease Genes. ProDiGe implements a novel machine learning strategy based on learning from positive and unlabeled examples, which allows to integrate various sources of information about the genes, to share information about known disease genes across diseases, and to perform genome-wide searches for new disease genes. Experiments on real data show that ProDiGe outperforms state-of-the-art methods for the prioritization of genes in human diseases.</p> <p>Conclusions</p> <p>ProDiGe implements a new machine learning paradigm for gene prioritization, which could help the identification of new disease genes. It is freely available at <url>http://cbio.ensmp.fr/prodige</url>.</p
Dormancy within Staphylococcus epidermidis biofilms : a transcriptomic analysis by RNA-seq
The proportion of dormant bacteria within Staphylococcus epidermidis biofilms may determine its inflammatory profile. Previously, we have shown that S. epidermidis biofilms with higher proportions of dormant bacteria have reduced activation of murine macrophages. RNA-sequencing was used to identify the major transcriptomic differences between S. epidermidis biofilms with different proportions of dormant bacteria. To accomplish this goal, we used an in vitro model where magnesium allowed modulation of the proportion of dormant bacteria within S. epidermidis biofilms. Significant differences were found in the expression of 147 genes. A detailed analysis of the results was performed based on direct and functional gene interactions. Biological processes among the differentially expressed genes were mainly related to oxidation-reduction processes and acetyl-CoA metabolic processes. Gene set enrichment revealed that the translation process is related to the proportion of dormant bacteria. Transcription of mRNAs involved in oxidation-reduction processes was associated with higher proportions of dormant bacteria within S. epidermidis biofilm. Moreover, the pH of the culture medium did not change after the addition of magnesium, and genes related to magnesium transport did not seem to impact entrance of bacterial cells into dormancy.The authors thank Stephen Lorry at Harvard Medical School for providing CLC Genomics software. This work was funded by Fundacao para a Ciencia e a Tecnologia (FCT) and COMPETE grants PTDC/BIA-MIC/113450/2009, FCOMP-01-0124-FEDER-014309, FCOMP-01-0124-FEDER-022718 (FCT PEst-C/SAU/LA0002/2011), QOPNA research unit (project PEst-C/QUI/UI0062/2011), and CENTRO-07-ST24-FEDER-002034. The following authors had an individual FCT fellowship: VC (SFRH/BD/78235/2011) and AF (2SFRH/BD/62359/2009)
The CRE1 carbon catabolite repressor of the fungus Trichoderma reesei: a master regulator of carbon assimilation
<p>Abstract</p> <p>Background</p> <p>The identification and characterization of the transcriptional regulatory networks governing the physiology and adaptation of microbial cells is a key step in understanding their behaviour. One such wide-domain regulatory circuit, essential to all cells, is carbon catabolite repression (CCR): it allows the cell to prefer some carbon sources, whose assimilation is of high nutritional value, over less profitable ones. In lower multicellular fungi, the C2H2 zinc finger CreA/CRE1 protein has been shown to act as the transcriptional repressor in this process. However, the complete list of its gene targets is not known.</p> <p>Results</p> <p>Here, we deciphered the CRE1 regulatory range in the model cellulose and hemicellulose-degrading fungus <it>Trichoderma reesei </it>(anamorph of <it>Hypocrea jecorina</it>) by profiling transcription in a wild-type and a delta-<it>cre1 </it>mutant strain on glucose at constant growth rates known to repress and de-repress CCR-affected genes. Analysis of genome-wide microarrays reveals 2.8% of transcripts whose expression was regulated in at least one of the four experimental conditions: 47.3% of which were repressed by CRE1, whereas 29.0% were actually induced by CRE1, and 17.2% only affected by the growth rate but CRE1 independent. Among CRE1 repressed transcripts, genes encoding unknown proteins and transport proteins were overrepresented. In addition, we found CRE1-repression of nitrogenous substances uptake, components of chromatin remodeling and the transcriptional mediator complex, as well as developmental processes.</p> <p>Conclusions</p> <p>Our study provides the first global insight into the molecular physiological response of a multicellular fungus to carbon catabolite regulation and identifies several not yet known targets in a growth-controlled environment.</p
Large-scale integration of cancer microarray data identifies a robust common cancer signature
<p>Abstract</p> <p>Background</p> <p>There is a continuing need to develop molecular diagnostic tools which complement histopathologic examination to increase the accuracy of cancer diagnosis. DNA microarrays provide a means for measuring gene expression signatures which can then be used as components of genomic-based diagnostic tests to determine the presence of cancer.</p> <p>Results</p> <p>In this study, we collect and integrate ~ 1500 microarray gene expression profiles from 26 published cancer data sets across 21 major human cancer types. We then apply a statistical method, referred to as the <it>T</it>op-<it>S</it>coring <it>P</it>air of <it>G</it>roups (TSPG) classifier, and a repeated random sampling strategy to the integrated training data sets and identify a common cancer signature consisting of 46 genes. These 46 genes are naturally divided into two distinct groups; those in one group are typically expressed less than those in the other group for cancer tissues. Given a new expression profile, the classifier discriminates cancer from normal tissues by ranking the expression values of the 46 genes in the cancer signature and comparing the average ranks of the two groups. This signature is then validated by applying this decision rule to independent test data.</p> <p>Conclusion</p> <p>By combining the TSPG method and repeated random sampling, a robust common cancer signature has been identified from large-scale microarray data integration. Upon further validation, this signature may be useful as a robust and objective diagnostic test for cancer.</p
A taxonomy of epithelial human cancer and their metastases
<p>Abstract</p> <p>Background</p> <p>Microarray technology has allowed to molecularly characterize many different cancer sites. This technology has the potential to individualize therapy and to discover new drug targets. However, due to technological differences and issues in standardized sample collection no study has evaluated the molecular profile of epithelial human cancer in a large number of samples and tissues. Additionally, it has not yet been extensively investigated whether metastases resemble their tissue of origin or tissue of destination.</p> <p>Methods</p> <p>We studied the expression profiles of a series of 1566 primary and 178 metastases by unsupervised hierarchical clustering. The clustering profile was subsequently investigated and correlated with clinico-pathological data. Statistical enrichment of clinico-pathological annotations of groups of samples was investigated using Fisher exact test. Gene set enrichment analysis (GSEA) and DAVID functional enrichment analysis were used to investigate the molecular pathways. Kaplan-Meier survival analysis and log-rank tests were used to investigate prognostic significance of gene signatures.</p> <p>Results</p> <p>Large clusters corresponding to breast, gastrointestinal, ovarian and kidney primary tissues emerged from the data. Chromophobe renal cell carcinoma clustered together with follicular differentiated thyroid carcinoma, which supports recent morphological descriptions of thyroid follicular carcinoma-like tumors in the kidney and suggests that they represent a subtype of chromophobe carcinoma. We also found an expression signature identifying primary tumors of squamous cell histology in multiple tissues. Next, a subset of ovarian tumors enriched with endometrioid histology clustered together with endometrium tumors, confirming that they share their etiopathogenesis, which strongly differs from serous ovarian tumors. In addition, the clustering of colon and breast tumors correlated with clinico-pathological characteristics. Moreover, a signature was developed based on our unsupervised clustering of breast tumors and this was predictive for disease-specific survival in three independent studies. Next, the metastases from ovarian, breast, lung and vulva cluster with their tissue of origin while metastases from colon showed a bimodal distribution. A significant part clusters with tissue of origin while the remaining tumors cluster with the tissue of destination.</p> <p>Conclusion</p> <p>Our molecular taxonomy of epithelial human cancer indicates surprising correlations over tissues. This may have a significant impact on the classification of many cancer sites and may guide pathologists, both in research and daily practice. Moreover, these results based on unsupervised analysis yielded a signature predictive of clinical outcome in breast cancer. Additionally, we hypothesize that metastases from gastrointestinal origin either remember their tissue of origin or adapt to the tissue of destination. More specifically, colon metastases in the liver show strong evidence for such a bimodal tissue specific profile.</p
- …