53 research outputs found

    A novel computational framework for fast, distributed computing and knowledge integration for microarray gene expression data analysis

    Get PDF
    The healthcare burden and suffering due to life-threatening diseases such as cancer would be significantly reduced by the design and refinement of computational interpretation of micro-molecular data collected by bioinformaticians. Rapid technological advancements in the field of microarray analysis, an important component in the design of in-silico molecular medicine methods, have generated enormous amounts of such data, a trend that has been increasing exponentially over the last few years. However, the analysis and handling of these data has become one of the major bottlenecks in the utilization of the technology. The rate of collection of these data has far surpassed our ability to analyze the data for novel, non-trivial, and important knowledge. The high-performance computing platform, and algorithms that utilize its embedded computing capacity, has emerged as a leading technology that can handle such data-intensive knowledge discovery applications. In this dissertation, we present a novel framework to achieve fast, robust, and accurate (biologically-significant) multi-class classification of gene expression data using distributed knowledge discovery and integration computational routines, specifically for cancer genomics applications. The research presents a unique computational paradigm for the rapid, accurate, and efficient selection of relevant marker genes, while providing parametric controls to ensure flexibility of its application. The proposed paradigm consists of the following key computational steps: (a) preprocess, normalize the gene expression data; (b) discretize the data for knowledge mining application; (c) partition the data using two proposed methods: partitioning with overlapped windows and adaptive selection; (d) perform knowledge discovery on the partitioned data-spaces for association rule discovery; (e) integrate association rules from partitioned data and knowledge spaces on distributed processor nodes using a novel knowledge integration algorithm; and (f) post-analysis and functional elucidation of the discovered gene rule sets. The framework is implemented on a shared-memory multiprocessor supercomputing environment, and several experimental results are demonstrated to evaluate the algorithms. We conclude with a functional interpretation of the computational discovery routines for enhanced biological physiological discovery from cancer genomics datasets, while suggesting some directions for future research

    Inferring interactions, expression programs and regulatory networks from high throughput biological data

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2003.Includes bibliographical references (leaves 171-180).(cont.) For the networks level I present an algorithm that efficiently combines complementary large-scale expression and protein-DNA binding data to discover co-regulated modules of genes. This algorithm is extended so that it can infer sub-networks for specific systems in the cell. Finally, I present an algorithm which combines some of the above methods to automatically infer a dynamic sub-network for the cell cycle system.In this thesis I present algorithms for analyzing high throughput biological datasets. These algorithms work on a number of different analysis levels to infer interactions between genes, determine gene expression programs and model complex biological networks. Recent advances in high-throughput experimental methods in molecular biology hold great promise. DNA microarray technologies enable researchers to measure the expression levels of thousands of genes simultaneously. Time series expression data offers particularly rich opportunities for understanding the dynamics of biological processes. In addition to measuring expression data, microarrays have been recently exploited to measure genome-wide protein-DNA binding events. While these types of data are revolutionizing biology, they also present many computational challenges. Principled computational methods are required in order to make full use of each of these datasets, and to combine them to infer interactions and discover networks for modeling different systems in the cell. The algorithms presented in this thesis address three different analysis levels of high throughput biological data: Recovering individual gene values, pattern recognition and networks. For time series expression data, I present algorithms that permit the principled estimation of unobserved time-points, alignment and the identification of differentially expressed genes. For pattern recognition, I present algorithms for clustering continuous data, and for ordering the leaves of a clustering tree to infer expression programs.by Ziv Bar-Joseph.Ph.D

    Genetic algorithm-neural network: feature extraction for bioinformatics data.

    Get PDF
    With the advance of gene expression data in the bioinformatics field, the questions which frequently arise, for both computer and medical scientists, are which genes are significantly involved in discriminating cancer classes and which genes are significant with respect to a specific cancer pathology. Numerous computational analysis models have been developed to identify informative genes from the microarray data, however, the integrity of the reported genes is still uncertain. This is mainly due to the misconception of the objectives of microarray study. Furthermore, the application of various preprocessing techniques in the microarray data has jeopardised the quality of the microarray data. As a result, the integrity of the findings has been compromised by the improper use of techniques and the ill-conceived objectives of the study. This research proposes an innovative hybridised model based on genetic algorithms (GAs) and artificial neural networks (ANNs), to extract the highly differentially expressed genes for a specific cancer pathology. The proposed method can efficiently extract the informative genes from the original data set and this has reduced the gene variability errors incurred by the preprocessing techniques. The novelty of the research comes from two perspectives. Firstly, the research emphasises on extracting informative features from a high dimensional and highly complex data set, rather than to improve classification results. Secondly, the use of ANN to compute the fitness function of GA which is rare in the context of feature extraction. Two benchmark microarray data have been taken to research the prominent genes expressed in the tumour development and the results show that the genes respond to different stages of tumourigenesis (i.e. different fitness precision levels) which may be useful for early malignancy detection. The extraction ability of the proposed model is validated based on the expected results in the synthetic data sets. In addition, two bioassay data have been used to examine the efficiency of the proposed model to extract significant features from the large, imbalanced and multiple data representation bioassay data

    Multipartite Graph Algorithms for the Analysis of Heterogeneous Data

    Get PDF
    The explosive growth in the rate of data generation in recent years threatens to outpace the growth in computer power, motivating the need for new, scalable algorithms and big data analytic techniques. No field may be more emblematic of this data deluge than the life sciences, where technologies such as high-throughput mRNA arrays and next generation genome sequencing are routinely used to generate datasets of extreme scale. Data from experiments in genomics, transcriptomics, metabolomics and proteomics are continuously being added to existing repositories. A goal of exploratory analysis of such omics data is to illuminate the functions and relationships of biomolecules within an organism. This dissertation describes the design, implementation and application of graph algorithms, with the goal of seeking dense structure in data derived from omics experiments in order to detect latent associations between often heterogeneous entities, such as genes, diseases and phenotypes. Exact combinatorial solutions are developed and implemented, rather than relying on approximations or heuristics, even when problems are exceedingly large and/or difficult. Datasets on which the algorithms are applied include time series transcriptomic data from an experiment on the developing mouse cerebellum, gene expression data measuring acute ethanol response in the prefrontal cortex, and the analysis of a predicted protein-protein interaction network. A bipartite graph model is used to integrate heterogeneous data types, such as genes with phenotypes and microbes with mouse strains. The techniques are then extended to a multipartite algorithm to enumerate dense substructure in multipartite graphs, constructed using data from three or more heterogeneous sources, with applications to functional genomics. Several new theoretical results are given regarding multipartite graphs and the multipartite enumeration algorithm. In all cases, practical implementations are demonstrated to expand the frontier of computational feasibility

    DETECTING CANCER-RELATED GENES AND GENE-GENE INTERACTIONS BY MACHINE LEARNING METHODS

    Get PDF
    To understand the underlying molecular mechanisms of cancer and therefore to improve pathogenesis, prevention, diagnosis and treatment of cancer, it is necessary to explore the activities of cancer-related genes and the interactions among these genes. In this dissertation, I use machine learning and computational methods to identify differential gene relations and detect gene-gene interactions. To identify gene pairs that have different relationships in normal versus cancer tissues, I develop an integrative method based on the bootstrapping K-S test to evaluate a large number of microarray datasets. The experimental results demonstrate that my method can find meaningful alterations in gene relations. For gene-gene interaction detection, I propose to use two Bayesian Network based methods: DASSO-MB (Detection of ASSOciations using Markov Blanket) and EpiBN (Epistatic interaction detection using Bayesian Network model) to address the two critical challenges: searching and scoring. DASSO-MB is based on the concept of Markov Blanket in Bayesian Networks. In EpiBN, I develop a new scoring function, which can reflect higher-order gene-gene interactions and detect the true number of disease markers, and apply a fast Branch-and-Bound (B&B) algorithm to learn the structure of Bayesian Network. Both DASSO-MB and EpiBN outperform some other commonly-used methods and are scalable to genome-wide data

    CAD Tools for DNA Micro-Array Design, Manufacture and Application

    Get PDF
    Motivation: As the human genome project progresses and some microbial and eukaryotic genomes are recognized, numerous biotechnological processes have attracted increasing number of biologists, bioengineers and computer scientists recently. Biotechnological processes profoundly involve production and analysis of highthroughput experimental data. Numerous sequence libraries of DNA and protein structures of a large number of micro-organisms and a variety of other databases related to biology and chemistry are available. For example, microarray technology, a novel biotechnology, promises to monitor the whole genome at once, so that researchers can study the whole genome on the global level and have a better picture of the expressions among millions of genes simultaneously. Today, it is widely used in many fields- disease diagnosis, gene classification, gene regulatory network, and drug discovery. For example, designing organism specific microarray and analysis of experimental data require combining heterogeneous computational tools that usually differ in the data format; such as, GeneMark for ORF extraction, Promide for DNA probe selection, Chip for probe placement on microarray chip, BLAST to compare sequences, MEGA for phylogenetic analysis, and ClustalX for multiple alignments. Solution: Surprisingly enough, despite huge research efforts invested in DNA array applications, very few works are devoted to computer-aided optimization of DNA array design and manufacturing. Current design practices are dominated by ad-hoc heuristics incorporated in proprietary tools with unknown suboptimality. This will soon become a bottleneck for the new generation of high-density arrays, such as the ones currently being designed at Perlegen [109]. The goal of the already accomplished research was to develop highly scalable tools, with predictable runtime and quality, for cost-effective, computer-aided design and manufacturing of DNA probe arrays. We illustrate the utility of our approach by taking a concrete example of combining the design tools of microarray technology for Harpes B virus DNA data

    Genetic algorithm-neural network : feature extraction for bioinformatics data

    Get PDF
    With the advance of gene expression data in the bioinformatics field, the questions which frequently arise, for both computer and medical scientists, are which genes are significantly involved in discriminating cancer classes and which genes are significant with respect to a specific cancer pathology. Numerous computational analysis models have been developed to identify informative genes from the microarray data, however, the integrity of the reported genes is still uncertain. This is mainly due to the misconception of the objectives of microarray study. Furthermore, the application of various preprocessing techniques in the microarray data has jeopardised the quality of the microarray data. As a result, the integrity of the findings has been compromised by the improper use of techniques and the ill-conceived objectives of the study. This research proposes an innovative hybridised model based on genetic algorithms (GAs) and artificial neural networks (ANNs), to extract the highly differentially expressed genes for a specific cancer pathology. The proposed method can efficiently extract the informative genes from the original data set and this has reduced the gene variability errors incurred by the preprocessing techniques. The novelty of the research comes from two perspectives. Firstly, the research emphasises on extracting informative features from a high dimensional and highly complex data set, rather than to improve classification results. Secondly, the use of ANN to compute the fitness function of GA which is rare in the context of feature extraction. Two benchmark microarray data have been taken to research the prominent genes expressed in the tumour development and the results show that the genes respond to different stages of tumourigenesis (i.e. different fitness precision levels) which may be useful for early malignancy detection. The extraction ability of the proposed model is validated based on the expected results in the synthetic data sets. In addition, two bioassay data have been used to examine the efficiency of the proposed model to extract significant features from the large, imbalanced and multiple data representation bioassay data.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Bioinformatic Investigations Into the Genetic Architecture of Renal Disorders

    Get PDF
    Modern genomic analysis has a significant bioinformatic component due to the high volume of complex data that is involved. During investigations into the genetic components of two renal diseases, we developed two software tools. // Genome-Wide Association Studies (GWAS) datasets may be genotyped on different microarrays and subject to different annotation, leading to a mosaic case-control cohort that has inherent errors, primarily due to strand mismatching. Our software REMEDY seeks to detect and correct strand designation of input datasets, as well as filtering for common sources of noise such as structural and multi-allelic variants. We performed a GWAS on a large cohort of Steroid-sensitive nephrotic syndrome samples; the mosaic input datasets were pre-processed with REMEDY prior to merging and analysis. Our results show that REMEDY significantly reduced noise in GWAS output results. REMEDY outperforms existing software as it has significantly more features available such as auto-strand designation detection, comprehensive variant filtering and high-speed variant matching to dbSNP. // The second tool supported the analysis of a newly characterised rare renal disorder: Polycystic kidney disease with hyperinsulinemic hypoglycemia (HIPKD). Identification of the underlying genetic cause led to the hypothesis that a change in chromatin looping at a specific locus affected the aetiology of the disease. We developed LOOPER, a software suite capable of predicting chromatin loops from ChIP-Seq data to explore the possible conformations of chromatin architecture in the HIPKD genomic region. LOOPER predicted several interesting functional and structural loops that supported our hypothesis. We then extended LOOPER to visualise ChIA-PET and ChIP-Seq data as a force-directed graph to show experimental structural and functional chromatin interactions. Next, we re-analysed the HIPKD region with LOOPER to show experimentally validated chromatin interactions. We first confirmed our original predicted loops and subsequently discovered that the local genomic region has many more chromatin features than first thought

    Random Forests Based Rule Learning And Feature Elimination

    Get PDF
    Much research combines data from multiple sources in an effort to understand the underlying problems. It is important to find and interpret the most important information from these sources. Thus it will be beneficial to have an effective algorithm that can simultaneously extract decision rules and select critical features for good interpretation while preserving the prediction performance. We propose an efficient approach, combining rule extraction and feature elimination, based on 1-norm regularized random forests. This approach simultaneously extracts a small number of rules generated by random forests and selects important features. To evaluate this approach, we have applied it to several drug activity prediction data sets, microarray data sets, a seacoast chemical sensors data set, a Stockori flowering time data set, and three data sets from the UCI repository. This approach performs well compared to state-of-the-art prediction algorithms like random forests in terms of predictive performance and generates only a small number of decision rules. Some of the decision rules extracted are significant in solving the problem being studied. It demonstrates high potential in terms of prediction performance and interpretation on studying real applications
    • …
    corecore