3,380 research outputs found

    A Short Survey on Data Clustering Algorithms

    Full text link
    With rapidly increasing data, clustering algorithms are important tools for data analytics in modern research. They have been successfully applied to a wide range of domains; for instance, bioinformatics, speech recognition, and financial analysis. Formally speaking, given a set of data instances, a clustering algorithm is expected to divide the set of data instances into the subsets which maximize the intra-subset similarity and inter-subset dissimilarity, where a similarity measure is defined beforehand. In this work, the state-of-the-arts clustering algorithms are reviewed from design concept to methodology; Different clustering paradigms are discussed. Advanced clustering algorithms are also discussed. After that, the existing clustering evaluation metrics are reviewed. A summary with future insights is provided at the end

    Identification of disease-causing genes using microarray data mining and gene ontology

    Get PDF
    Background: One of the best and most accurate methods for identifying disease-causing genes is monitoring gene expression values in different samples using microarray technology. One of the shortcomings of microarray data is that they provide a small quantity of samples with respect to the number of genes. This problem reduces the classification accuracy of the methods, so gene selection is essential to improve the predictive accuracy and to identify potential marker genes for a disease. Among numerous existing methods for gene selection, support vector machine-based recursive feature elimination (SVMRFE) has become one of the leading methods, but its performance can be reduced because of the small sample size, noisy data and the fact that the method does not remove redundant genes. Methods: We propose a novel framework for gene selection which uses the advantageous features of conventional methods and addresses their weaknesses. In fact, we have combined the Fisher method and SVMRFE to utilize the advantages of a filtering method as well as an embedded method. Furthermore, we have added a redundancy reduction stage to address the weakness of the Fisher method and SVMRFE. In addition to gene expression values, the proposed method uses Gene Ontology which is a reliable source of information on genes. The use of Gene Ontology can compensate, in part, for the limitations of microarrays, such as having a small number of samples and erroneous measurement results. Results: The proposed method has been applied to colon, Diffuse Large B-Cell Lymphoma (DLBCL) and prostate cancer datasets. The empirical results show that our method has improved classification performance in terms of accuracy, sensitivity and specificity. In addition, the study of the molecular function of selected genes strengthened the hypothesis that these genes are involved in the process of cancer growth. Conclusions: The proposed method addresses the weakness of conventional methods by adding a redundancy reduction stage and utilizing Gene Ontology information. It predicts marker genes for colon, DLBCL and prostate cancer with a high accuracy. The predictions made in this study can serve as a list of candidates for subsequent wet-lab verification and might help in the search for a cure for cancers

    Identification of an Efficient Gene Expression Panel for Glioblastoma Classification.

    Get PDF
    We present here a novel genetic algorithm-based random forest (GARF) modeling technique that enables a reduction in the complexity of large gene disease signatures to highly accurate, greatly simplified gene panels. When applied to 803 glioblastoma multiforme samples, this method allowed the 840-gene Verhaak et al. gene panel (the standard in the field) to be reduced to a 48-gene classifier, while retaining 90.91% classification accuracy, and outperforming the best available alternative methods. Additionally, using this approach we produced a 32-gene panel which allows for better consistency between RNA-seq and microarray-based classifications, improving cross-platform classification retention from 69.67% to 86.07%. A webpage producing these classifications is available at http://simplegbm.semel.ucla.edu

    CarGene: Characterisation of sets of genes based on metabolic pathways analysis

    Get PDF
    The great amount of biological information provides scientists with an incomparable framework for testing the results of new algorithms. Several tools have been developed for analysing gene-enrichment and most of them are Gene Ontology-based tools. We developed a Kyoto Encyclopedia of Genes and Genomes (Kegg)-based tool that provides a friendly graphical environment for analysing gene-enrichment. The tool integrates two statistical corrections and simultaneously analysing the information about many groups of genes in both visual and textual manner. We tested the usefulness of our approach on a previous analysis (Huttenshower et al.). Furthermore, our tool is freely available (http://www.upo.es/eps/bigs/cargene.html).Ministerio de Ciencia y Tecnología TIN2007-68084-C02-00Ministerio de Ciencia e Innovación PCI2006-A7-0575Junta de Andalucía P07-TIC-02611Junta de Andalucía TIC-20

    Decorin protein core affects the global gene expression profile of the tumor microenvironment in a triple-negative orthotopic breast carcinoma xenograft model

    Get PDF
    Decorin, a member of the small leucine-rich proteoglycan gene family, exists and functions wholly within the tumor microenvironment to suppress tumorigenesis by directly targeting and antagonizing multiple receptor tyrosine kinases, such as the EGFR and Met. This leads to potent and sustained signal attenuation, growth arrest, and angiostasis. We thus sought to evaluate the tumoricidal benefits of systemic decorin on a triple-negative orthotopic breast carcinoma xenograft model. To this end, we employed a novel high-density mixed expression array capable of differentiating and simultaneously measuring gene signatures of both Mus musculus (stromal) and Homo sapiens (epithelial) tissue origins. We found that decorin protein core modulated the differential expression of 374 genes within the stromal compartment of the tumor xenograft. Further, our top gene ontology classes strongly suggests an unexpected and preferential role for decorin protein core to inhibit genes necessary for immunomodulatory responses while simultaneously inducing expression of those possessing cellular adhesion and tumor suppressive gene properties. Rigorous verification of the top scoring candidates led to the discovery of three genes heretofore unlinked to malignant breast cancer that were reproducibly found to be induced in several models of tumor stroma. Collectively, our data provide highly novel and unexpected stromal gene signatures as a direct function of systemic administration of decorin protein core and reveals a fundamental basis of action for decorin to modulate the tumor stroma as a biological mechanism for the ascribed anti-tumorigenic properties

    Genome-wide expression patterns associated with oncogenesis and sarcomatous transdifferentation of cholangiocarcinoma

    Get PDF
    BACKGROUND: The molecular mechanisms of CC (cholangiocarcinoma) oncogenesis and progression are poorly understood. This study aimed to determine the genome-wide expression of genes related to CC oncogenesis and sarcomatous transdifferentiation. METHODS: Genes that were differentially expressed between CC cell lines or tissues and cultured normal biliary epithelial (NBE) cells were identified using DNA microarray technology. Expressions were validated in human CC tissues and cells. RESULTS: Using unsupervised hierarchical clustering analysis of the cell line and tissue samples, we identified a set of 342 commonly regulated (>2-fold change) genes. Of these, 53, including tumor-related genes, were upregulated, and 289, including tumor suppressor genes, were downregulated (<0.5 fold change). Expression of SPP1, EFNB2, E2F2, IRX3, PTTG1, PPARγ, KRT17, UCHL1, IGFBP7 and SPARC proteins was immunohistochemically verified in human and hamster CC tissues. Additional unsupervised hierarchical clustering analysis of sarcomatoid CC cells compared to three adenocarcinomatous CC cell lines revealed 292 differentially upregulated genes (>4-fold change), and 267 differentially downregulated genes (<0.25 fold change). The expression of 12 proteins was validated in the CC cell lines by immunoblot analysis and immunohistochemical staining. Of the proteins analyzed, we found upregulation of the expression of the epithelial-mesenchymal transition (EMT)-related proteins VIM and TWIST1, and restoration of the methylation-silenced proteins LDHB, BNIP3, UCHL1, and NPTX2 during sarcomatoid transdifferentiation of CC. CONCLUSION: The deregulation of oncogenes, tumor suppressor genes, and methylation-related genes may be useful in identifying molecular targets for CC diagnosis and prognosis

    Computational Methods for Knowledge Integration in the Analysis of Large-scale Biological Networks

    Get PDF
    abstract: As we migrate into an era of personalized medicine, understanding how bio-molecules interact with one another to form cellular systems is one of the key focus areas of systems biology. Several challenges such as the dynamic nature of cellular systems, uncertainty due to environmental influences, and the heterogeneity between individual patients render this a difficult task. In the last decade, several algorithms have been proposed to elucidate cellular systems from data, resulting in numerous data-driven hypotheses. However, due to the large number of variables involved in the process, many of which are unknown or not measurable, such computational approaches often lead to a high proportion of false positives. This renders interpretation of the data-driven hypotheses extremely difficult. Consequently, a dismal proportion of these hypotheses are subject to further experimental validation, eventually limiting their potential to augment existing biological knowledge. This dissertation develops a framework of computational methods for the analysis of such data-driven hypotheses leveraging existing biological knowledge. Specifically, I show how biological knowledge can be mapped onto these hypotheses and subsequently augmented through novel hypotheses. Biological hypotheses are learnt in three levels of abstraction -- individual interactions, functional modules and relationships between pathways, corresponding to three complementary aspects of biological systems. The computational methods developed in this dissertation are applied to high throughput cancer data, resulting in novel hypotheses with potentially significant biological impact.Dissertation/ThesisPh.D. Computer Science 201

    Prospecting for Genes involved in transcriptional regulation of plant defenses, a bioinformatics approach

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In order to comprehend the mechanisms of induced plant defense, knowledge of the biosynthesis and signaling pathways mediated by salicylic acid (SA), jasmonic acid (JA) and ethylene (ET) is essential. Potentially, many transcription factors could be involved in the regulation of these pathways, although finding them is a difficult endeavor. Here we report the use of publicly available Arabidopsis microarray datasets to generate gene co-expression networks.</p> <p>Results</p> <p>Using 372 publicly available microarray data sets, a network was constructed in which Arabidopsis genes for known components of SA, JA and ET pathways together with the genes of over 1400 transcription factors were assayed for co-expression. After determining the Pearson Correlation Coefficient cutoff to obtain the most probable biologically relevant co-expressed genes, the resulting network confirmed the presence of many genes previously reported in literature to be relevant for stress responses and connections that fit current models of stress gene regulation, indicating the potential of our approach. In addition, the derived network suggested new candidate genes and associations that are potentially interesting for future research to further unravel their involvement in responses to stress.</p> <p>Conclusions</p> <p>In this study large sets of stress related microarrays were used to reveal co-expression networks of transcription factors and signaling pathway components. These networks will benefit further characterization of the signal transduction pathways involved in plant defense.</p

    Ranked prediction of p53 targets using hidden variable dynamic modeling

    Get PDF
    Full exploitation of microarray data requires hidden information that cannot be extracted using current analysis methodologies. We present a new approach, hidden variable dynamic modeling (HVDM), which derives the hidden profile of a transcription factor from time series microarray data, and generates a ranked list of predicted targets. We applied HVDM to the p53 network, validating predictions experimentally using small interfering RNA. HVDM can be applied in many systems biology contexts to predict regulation of gene activity quantitatively
    corecore