842 research outputs found

    Multiobjective optimization of cluster measures in Microarray Cancer data using Genetic Algorithm Based Fuzzy Clustering

    Get PDF
    The field of biological and biomedical research has been changed rapidly with the invention of microarray technology, which facilitates simultaneously monitoring of large number of genes across different experimental conditions. In this report a multi objective genetic algorithm technique called Non-Dominated Sorting Genetic Algorithm (NSGA) - II based approach has been proposed for fuzzy clustering of microarray cancer expression dataset that encodes the cluster modes and simultaneously optimizes the two factors called fuzzy compactness and fuzzy separation of the clusters. The multiobjective technique produces a set of non-dominated solutions. This approach identifies the solution i.e. the individual chromosome which gives the optimal value of the parameters

    A Multiobjective Evolutionary Conceptual Clustering Methodology for Gene Annotation Within Structural Databases: A Case of Study on the Gene Ontology Database

    Get PDF
    Current tools and techniques devoted to examine the content of large databases are often hampered by their inability to support searches based on criteria that are meaningful to their users. These shortcomings are particularly evident in data banks storing representations of structural data such as biological networks. Conceptual clustering techniques have demonstrated to be appropriate for uncovering relationships between features that characterize objects in structural data. However, typical con ceptual clustering approaches normally recover the most obvious relations, but fail to discover the lessfrequent but more informative underlying data associations. The combination of evolutionary algorithms with multiobjective and multimodal optimization techniques constitutes a suitable tool for solving this problem. We propose a novel conceptual clustering methodology termed evolutionary multiobjective conceptual clustering (EMO-CC), re lying on the NSGA-II multiobjective (MO) genetic algorithm. We apply this methodology to identify conceptual models in struc tural databases generated from gene ontologies. These models can explain and predict phenotypes in the immunoinflammatory response problem, similar to those provided by gene expression or other genetic markers. The analysis of these results reveals that our approach uncovers cohesive clusters, even those comprising a small number of observations explained by several features, which allows describing objects and their interactions from different perspectives and at different levels of detail.Ministerio de Ciencia y Tecnología TIC-2003-00877Ministerio de Ciencia y Tecnología BIO2004-0270EMinisterio de Ciencia y Tecnología TIN2006-1287

    Machine Learning Approaches for Cancer Analysis

    Get PDF
    In addition, we propose many machine learning models that serve as contributions to solve a biological problem. First, we present Zseq, a linear time method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications, and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of unique k-mers in each sequence as its corresponding score and also takes into the account other factors, such as ambiguous nucleotides or high GC-content percentage in k-mers. Based on a z-score threshold, Zseq sweeps through the sequences again and filters those with a z-score less than the user-defined threshold. Zseq is able to provide a better mapping rate; it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using the reference genome as well as de novo assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with another state-of-the-art method. Studying the abundance of select mRNA species throughout prostate cancer progression may provide some insight into the molecular mechanisms that advance the disease. In the second contribution of this dissertation, we reveal that the combination of proper clustering, distance function and Index validation for clusters are suitable in identifying outlier transcripts, which show different trending than the majority of the transcripts, the trending of the transcript is the abundance throughout different stages of prostate cancer. We compare this model with standard hierarchical time-series clustering method based on Euclidean distance. Using time-series profile hierarchical clustering methods, we identified stage-specific mRNA species termed outlier transcripts that exhibit unique trending patterns as compared to most other transcripts during disease progression. This method is able to identify those outliers rather than finding patterns among the trending transcripts compared to the hierarchical clustering method based on Euclidean distance. A wet-lab experiment on a biomarker (CAM2G gene) confirmed the result of the computational model. Genes related to these outlier transcripts were found to be strongly associated with cancer, and in particular, prostate cancer. Further investigation of these outlier transcripts in prostate cancer may identify them as potential stage-specific biomarkers that can predict the progression of the disease. Breast cancer, on the other hand, is a widespread type of cancer in females and accounts for a lot of cancer cases and deaths in the world. Identifying the subtype of breast cancer plays a crucial role in selecting the best treatment. In the third contribution, we propose an optimized hierarchical classification model that is used to predict the breast cancer subtype. Suitable filter feature selection methods and new hybrid feature selection methods are utilized to find discriminative genes. Our proposed model achieves 100% accuracy for predicting the breast cancer subtypes using the same or even fewer genes. Studying breast cancer survivability among different patients who received various treatments may help understand the relationship between the survivability and treatment therapy based on gene expression. In the fourth contribution, we have built a classifier system that predicts whether a given breast cancer patient who underwent some form of treatment, which is either hormone therapy, radiotherapy, or surgery will survive beyond five years after the treatment therapy. Our classifier is a tree-based hierarchical approach that partitions breast cancer patients based on survivability classes; each node in the tree is associated with a treatment therapy and finds a predictive subset of genes that can best predict whether a given patient will survive after that particular treatment. We applied our tree-based method to a gene expression dataset that consists of 347 treated breast cancer patients and identified potential biomarker subsets with prediction accuracies ranging from 80.9% to 100%. We have further investigated the roles of many biomarkers through the literature. Studying gene expression through various time intervals of breast cancer survival may provide insights into the recovery of the patients. Discovery of gene indicators can be a crucial step in predicting survivability and handling of breast cancer patients. In the fifth contribution, we propose a hierarchical clustering method to separate dissimilar groups of genes in time-series data as outliers. These isolated outliers, genes that trend differently from other genes, can serve as potential biomarkers of breast cancer survivability. In the last contribution, we introduce a method that uses machine learning techniques to identify transcripts that correlate with prostate cancer development and progression. We have isolated transcripts that have the potential to serve as prognostic indicators and may have significant value in guiding treatment decisions. Our study also supports PTGFR, NREP, scaRNA22, DOCK9, FLVCR2, IK2F3, USP13, and CLASP1 as potential biomarkers to predict prostate cancer progression, especially between stage II and subsequent stages of the disease

    Algorithms to Explore the Structure and Evolution of Biological Networks

    Get PDF
    High-throughput experimental protocols have revealed thousands of relationships amongst genes and proteins under various conditions. These putative associations are being aggressively mined to decipher the structural and functional architecture of the cell. One useful tool for exploring this data has been computational network analysis. In this thesis, we propose a collection of novel algorithms to explore the structure and evolution of large, noisy, and sparsely annotated biological networks. We first introduce two information-theoretic algorithms to extract interesting patterns and modules embedded in large graphs. The first, graph summarization, uses the minimum description length principle to find compressible parts of the graph. The second, VI-Cut, uses the variation of information to non-parametrically find groups of topologically cohesive and similarly annotated nodes in the network. We show that both algorithms find structure in biological data that is consistent with known biological processes, protein complexes, genetic diseases, and operational taxonomic units. We also propose several algorithms to systematically generate an ensemble of near-optimal network clusterings and show how these multiple views can be used together to identify clustering dynamics that any single solution approach would miss. To facilitate the study of ancient networks, we introduce a framework called ``network archaeology'') for reconstructing the node-by-node and edge-by-edge arrival history of a network. Starting with a present-day network, we apply a probabilistic growth model backwards in time to find high-likelihood previous states of the graph. This allows us to explore how interactions and modules may have evolved over time. In experiments with real-world social and biological networks, we find that our algorithms can recover significant features of ancestral networks that have long since disappeared. Our work is motivated by the need to understand large and complex biological systems that are being revealed to us by imperfect data. As data continues to pour in, we believe that computational network analysis will continue to be an essential tool towards this end

    Revealing the vectors of cellular identity with single-cell genomics

    Get PDF
    Single-cell genomics has now made it possible to create a comprehensive atlas of human cells. At the same time, it has reopened definitions of a cell's identity and of the ways in which identity is regulated by the cell's molecular circuitry. Emerging computational analysis methods, especially in single-cell RNA sequencing (scRNA-seq), have already begun to reveal, in a data-driven way, the diverse simultaneous facets of a cell's identity, from discrete cell types to continuous dynamic transitions and spatial locations. These developments will eventually allow a cell to be represented as a superposition of 'basis vectors', each determining a different (but possibly dependent) aspect of cellular organization and function. However, computational methods must also overcome considerable challenges-from handling technical noise and data scale to forming new abstractions of biology. As the scale of single-cell experiments continues to increase, new computational approaches will be essential for constructing and characterizing a reference map of cell identities.National Institutes of Health (U.S.) (grant P50 HG006193)BRAIN Initiative (grant U01 MH105979)National Institutes of Health (U.S.) (BRAIN grant 1U01MH105960-01)National Cancer Institute (U.S.) (grant 1U24CA180922)National Institute of Allergy and Infectious Diseases (U.S.) (grant 1U24AI118672-01
    corecore