129 research outputs found

    Microarray Analysis in Drug Discovery and Biomarker Identification

    Get PDF

    A novel computational framework for fast, distributed computing and knowledge integration for microarray gene expression data analysis

    Get PDF
    The healthcare burden and suffering due to life-threatening diseases such as cancer would be significantly reduced by the design and refinement of computational interpretation of micro-molecular data collected by bioinformaticians. Rapid technological advancements in the field of microarray analysis, an important component in the design of in-silico molecular medicine methods, have generated enormous amounts of such data, a trend that has been increasing exponentially over the last few years. However, the analysis and handling of these data has become one of the major bottlenecks in the utilization of the technology. The rate of collection of these data has far surpassed our ability to analyze the data for novel, non-trivial, and important knowledge. The high-performance computing platform, and algorithms that utilize its embedded computing capacity, has emerged as a leading technology that can handle such data-intensive knowledge discovery applications. In this dissertation, we present a novel framework to achieve fast, robust, and accurate (biologically-significant) multi-class classification of gene expression data using distributed knowledge discovery and integration computational routines, specifically for cancer genomics applications. The research presents a unique computational paradigm for the rapid, accurate, and efficient selection of relevant marker genes, while providing parametric controls to ensure flexibility of its application. The proposed paradigm consists of the following key computational steps: (a) preprocess, normalize the gene expression data; (b) discretize the data for knowledge mining application; (c) partition the data using two proposed methods: partitioning with overlapped windows and adaptive selection; (d) perform knowledge discovery on the partitioned data-spaces for association rule discovery; (e) integrate association rules from partitioned data and knowledge spaces on distributed processor nodes using a novel knowledge integration algorithm; and (f) post-analysis and functional elucidation of the discovered gene rule sets. The framework is implemented on a shared-memory multiprocessor supercomputing environment, and several experimental results are demonstrated to evaluate the algorithms. We conclude with a functional interpretation of the computational discovery routines for enhanced biological physiological discovery from cancer genomics datasets, while suggesting some directions for future research

    Microarray image processing : a novel neural network framework

    Get PDF
    Due to the vast success of bioengineering techniques, a series of large-scale analysis tools has been developed to discover the functional organization of cells. Among them, cDNA microarray has emerged as a powerful technology that enables biologists to cDNA microarray technology has enabled biologists to study thousands of genes simultaneously within an entire organism, and thus obtain a better understanding of the gene interaction and regulation mechanisms involved. Although microarray technology has been developed so as to offer high tolerances, there exists high signal irregularity through the surface of the microarray image. The imperfection in the microarray image generation process causes noises of many types, which contaminate the resulting image. These errors and noises will propagate down through, and can significantly affect, all subsequent processing and analysis. Therefore, to realize the potential of such technology it is crucial to obtain high quality image data that would indeed reflect the underlying biology in the samples. One of the key steps in extracting information from a microarray image is segmentation: identifying which pixels within an image represent which gene. This area of spotted microarray image analysis has received relatively little attention relative to the advances in proceeding analysis stages. But, the lack of advanced image analysis, including the segmentation, results in sub-optimal data being used in all downstream analysis methods. Although there is recently much research on microarray image analysis with many methods have been proposed, some methods produce better results than others. In general, the most effective approaches require considerable run time (processing) power to process an entire image. Furthermore, there has been little progress on developing sufficiently fast yet efficient and effective algorithms the segmentation of the microarray image by using a highly sophisticated framework such as Cellular Neural Networks (CNNs). It is, therefore, the aim of this thesis to investigate and develop novel methods processing microarray images. The goal is to produce results that outperform the currently available approaches in terms of PSNR, k-means and ICC measurements.EThOS - Electronic Theses Online ServiceAleppo University, SyriaGBUnited Kingdo

    Optimization based clustering and classification algorithms in analysis of microarray gene expression data sets

    Get PDF
    Doctor of PhilosophyBioinformatics and computational biology are relatively new areas that involve the use of different techniques including computer science, informatics, biochemistry, applied math and etc., to solve biological problems. In recent years the development of new molecular genetics technologies, such as DNA microarrays led to the simultaneous measurement of expression levels of thousands and even tens of thousands of genes. Microarray gene expression technology has facilitated the study of genomic structure and investigation of biological systems. Numerical output of this technology is shown as microarray gene expression data sets. These data sets contain a very large number of genes and a relatively small number of samples and their precise analysis requires a robust and suitable computer software. Due to this, only a few existing algorithms are applicable to them, so more efficient methods for solving clustering, gene selection and classification problems of gene expression data sets are required and those methods need to be computationally applicable and less expensive. The aim of this thesis is to develop new algorithms for solving clustering, gene selection and data classification problems on gene expression data sets. Clustering in gene expression data sets is a challenging problem. The increasing use of DNA microarray-based tumour gene expression profiles for cancer diagnosis requires more efficient methods to solve clustering problems of these profiles. Different algorithms for clustering of genes have been proposed, however few algorithms can be applied to the clustering of samples. k-means algorithm, among very few clustering algorithms is applicable to microarray gene expression data sets, however these are not efficient for solving clustering problems when the number of genes is thousands and this algorithm is very sensitive to the choice of a starting point. Additionally, when the number of clusters is relatively large, this algorithm gives local minima which can differ significantly from the global solution. Over the last several years different approaches have been proposed to improve global ii Abstract Abstract search properties of k-means algorithm. One of them is the global k-means algorithm, however this algorithm is not efficient when data are sparse. In this thesis we developed a new version of the global k-means algorithm, the modified global k-means algorithm which is effective for solving clustering problems in gene expression data sets. In a microarray gene expression data set, in many cases only a small fraction of genes are informative whereas most of them are non-informative and make noise. Therefore the development of gene selection algorithms that allow us to remove as many non-informative genes as possible is very important. In this thesis we developed a new overlapping gene selection algorithm. This algorithm is based on calculating overlaps of different genes. It considerably reduces the number of genes and is efficient in finding a subset of informative genes. Over the last decade different approaches have been proposed to solve supervised data classification problems in gene expression data sets. In this thesis we developed a new approach which is based on the so-called max-min separability and is compared with the other approaches. The max-min separability algorithm is an equivalent of piecewise linear separability. An incremental algorithm is presented to compute piecewise linear functions separating two sets. This algorithm is applied along with a special gene selection algorithm. In this thesis, all new algorithms have been tested on 10 publicly available gene expression data sets and our numerical results demonstrate the efficiency of the new algorithms that were developed in the framework of this researc

    Unique networks: a method to identity disease-specific regulatory networks from microarray data

    Get PDF
    This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.The survival of any organismis determined by the mechanisms triggered in response to the inputs received. Underlying mechanisms are described by graphical networks that can be inferred from different types of data such as microarrays. Deriving robust and reliable networks can be complicated due to the microarray structure of the data characterized by a discrepancy between the number of genes and samples of several orders of magnitude, bias and noise. Researchers overcome this problem by integrating independent data together and deriving the common mechanisms through consensus network analysis. Different conditions generate different inputs to the organism which reacts triggering different mechanisms with similarities and differences. A lot of effort has been spent into identifying the commonalities under different conditions. Highlighting similarities may overshadow the differences which often identify the main characteristics of the triggered mechanisms. In this thesis we introduce the concept of study-specific mechanism. We develop a pipeline to semiautomatically identify study-specific networks called unique-networks through a combination of consensus approach, graphical similarities and network analysis. The main pipeline called UNIP (Unique Networks Identification Pipeline) takes a set of independent studies, builds gene regulatory networks for each of them, calculates an adaptation of the sensitivity measure based on the networks graphical similarities, applies clustering to group the studies who generate the most similar networks into study-clusters and derives the consensus networks. Once each study-cluster is associated with a consensus-network, we identify the links that appear only in the consensus network under consideration but not in the others (unique-connections). Considering the genes involved in the unique-connections we build Bayesian networks to derive the unique-networks. Finally, we exploit the inference tool to calculate each gene prediction-accuracy across all studies to further refine the unique-networks. Biological validation through different software and the literature are explored to validate our method. UNIP is first applied to a set of synthetic data perturbed with different levels of noise to study the performance and verify its reliability. Then, wheat under stress conditions and different types of cancer are explored. Finally, we develop a user-friendly interface to combine the set of studies by using AND and NOT logic operators. Based on the findings, UNIP is a robust and reliable method to analyse large sets of transcriptomic data. It easily detects the main complex relationships between transcriptional expression of genes specific for different conditions and also highlights structures and nodes that could be potential targets for further research

    Bayesian learning in bioinformatics

    Get PDF
    Life sciences research is advancing in breadth and scope, affecting many areas of life including medical care and government policy. The field of Bioinformatics, in particular, is growing very rapidly with the help of computer science, statistics, applied mathematics, and engineering. New high-throughput technologies are making it possible to measure genomic variation across phenotypes in organisms at costs that were once inconceivable. In conjunction, and partly as a consequence, massive amounts of information about the genomes of many organisms are becoming accessible in the public domain. Some of the important and exciting questions in the post-genomics era are how to integrate all of the information available from diverse sources. Learning in complex systems biology requires that information be shared in a natural and interpretable way, to integrate knowledge and data. The statistical sciences can support the advancement of learning in Bioinformatics in many ways, not the least of which is by developing methodologies that can support the synchronization of efforts across sciences, offering real-time learning tools that can be shared across many fields from basic science to the clinical applications. This research is an introduction to several current research problems in Bioinformatics that addresses integration of information, and discusses statistical methodologies from the Bayesian school of thought that may be applied. Bayesian statistical methodologies are proposed to integrate biological knowledge and improve statistical inference for three relevant Bioinformatics applications: gene expression arrays, BAC and aCGH arrays, and real-time gene expression experiments. A unified Bayesian model is proposed to perform detection of genes and gene classes, defined from historical pathways, with gene expression arrays. A novel Bayesian statistical method is proposed to infer chromosomal copy number aberrations in clinical populations with BAC or aCGH experiments. A theoretical model is proposed, motivated from historical work in mathematical biology, for inference with real-time gene expression experiments, and fit with Bayesian methods. Simulation and case studies show that Bayesian methodologies show great promise to improve the way we learn with high-throughput Bioinformatics experiments

    Microarray tools and analysis methods to better characterize biological networks

    Get PDF
    To accurately model a biological system (e.g. cell), we first need to characterize each of its distinct networks. While omics data has given us unprecedented insight into the structure and dynamics of these networks, the associated analysis routines are more involved and the accuracy and precision of the experimental technologies not sufficiently examined. The main focus of our research has been to develop methods and tools to better manage and interpret microarray data. How can we improve methods to store and retrieve microarray data from a relational database? What experimental and biological factors most influence our interpretation of a microarray's measurements? By accounting for these factors, can we improve the accuracy and precision of microarray measurements? It's essential to address these last two questions before using 'omics data for downstream analyses, such as inferring transciption regulatory networks from microarray data. While answers to such questions are vital to microarray research in particular, they are equally relevant to systems biology in general. We developed three studies to investigate aspects of these questions when using Affymetrix expression arrays. In the first study, we develop the Data-FATE framework to improve the handling of large scientific data sets. In the next two studies, we developed methods and tools that allow us to examine the impact of physical and technical factors known or suspected to dramatically alter the interpretation of a microarray experiment. In the second study, we develop ArrayInitiative -- a tool that simplifies the process of creating custom CDFs -- so that we can easily re-design the array specifications for Affymetrix 3' IVT expression arrays. This tool is essential for testing the impact of the various factors, and for making the framework easy to communicate and re-use. We then use ArrayInitiative in a case study to illustrate the impact of several factors known to distort microarray signals. In the third study, we systematically and exhaustively examine the effect of physical and technical factors -- both generally accepted and novel -- on our interpretation of dozens of experiments using hundreds of E. coli Affymetrix microarrays
    • ā€¦
    corecore