208 research outputs found

    Building a robust clinical diagnosis support system for childhood cancer using data mining methods

    Full text link
    University of Technology Sydney. Faculty of Engineering and Information Technology.Progress in understanding core pathways and processes of cancer requires thorough analysis of many coding and noncoding regions of the genome. Data mining and knowledge discovery have been applied to datasets across many industries, including bioinformatics. However, data mining faces a major challenge in its application to bioinformatics: the diversity and dimensionality of biomedical data. The term ‘big data’ was applied to the clinical domain by Yoo et al. (2014), specifically referring to single nucleotide polymorphism (SNP) and gene expression data. This research thesis focuses on three different types of data: gene-annotations, gene expression and single nucleotide polymorphisms. Genetic association studies have led to the discovery of single genetic variants associated with common diseases. However, complex diseases are not caused by a single gene acting alone but are the result of complex linear and non-linear interactions among different types of microarray data. In this scenario, a single gene can have a small effect on disease but cannot be the major cause of the disease. For this reason there is a critical need to implement new approaches which take into account linear and non-linear gene-gene and patient-patient interactions that can eventually help in diagnosis and prognosis of complex diseases. Several computational methods have been developed to deal with gene annotations, gene expressions and SNP data of complex diseases. However, analysis of every gene expression and SNP profile, and finding gene-to-gene relationships, is computationally infeasible because of the high-dimensionality of data. In addition, many computational methods have problems with scaling to large datasets, and with overfitting. Therefore, there is growing interest in applying data mining and machine learning approaches to understand different types of microarray data. Cancer is the disease that kills the most children in Australia (Torre et al., 2015). Within this thesis, the focus is on childhood Acute Lymphoblastic Leukaemia. Acute Lymphoblastic Leukaemia is the most common childhood malignancy with 24% of all new cancers occurring in children within Australia (Coates et al., 2001). According to the American Cancer Society (2016), a total of 6,590 cases of ALL have been diagnosed across all age groups in USA and the expected deaths are 1,430 in 2016. The project uses different data mining and visualisation methods applied on different types of biological data: gene annotations, gene expression and SNPs. This thesis focuses on three main issues in genomic and transcriptomic data studies: (i) Proposing, implementing and evaluating a novel framework to find functional relationships between genes from gene-annotation data. (ii) Identifying an optimal dimensionality reduction method to classify between relapsed and non-relapsed ALL patients using gene expression. (iii) Proposing, implementing and evaluating a novel feature selection approach to identify related metabolic pathways in ALL This thesis proposes, implements and validates an efficient framework to find functional relationships between genes based on gene-annotation data. The framework is built on a binary matrix and a proximity matrix, where the binary matrix contains information related to genes and their functionality, while the proximity matrix shows similarity between different features. The framework retrieves gene functionality information from Gene Ontology (GO), a publicly available database, and visualises the functional related genes using singular value decomposition (SVD). From a simple list of gene-annotations, this thesis retrieves features (i.e Gene Ontology terms) related to each gene and calculates a similarity measure based on the distance between terms in the GO hierarchy. The distance measures are based on hierarchical structure of Gene Ontology and these distance measures are called similarity measures. In this framework, two different similarity measures are applied: (i) A hop-based similarity measure where the distance is calculated based on the number of links between two terms. (ii) An information-content similarity measure where the similarity between terms is based on the probability of GO terms in the gene dataset. This framework also identifies which method performs better among these two similarity measures at identifying functional relationships between genes. Singular value decomposition method is used for visualisation, having the advantage that multiple types of relationships can be visualised simultaneously (gene-to-gene, term-to-term and gene-to-term) In this thesis a novel framework is developed for visualizing patient-to-patient relationships using gene expression values. The framework builds on the random forest feature selection method to filter gene expression values and then applies different linear and non-linear machine learning methods to them. The methods used in this framework are Principal Component Analysis (PCA), Kernel Principal Component Analysis (kPCA), Local Linear Embedding (LLE), Stochastic Neighbour Embedding (SNE) and Diffusion Maps. The framework compares these different machine learning methods by tuning different parameters to find the optimal method among them. Area under the curve (AUC) is used to rank the results and SVM is used to classify between relapsed and non-relapsed patients. The final section of the thesis proposes, implements and validates a framework to find active metabolic pathways in ALL using single nucleotide polymorphism (SNP) profiles. The framework is based on the random forest feature selection method. A collected dataset of ALL patient and healthy controls is constructed and later random forest is applied using different parameters to find highly-ranked SNPs. The credibility of the model is assessed based on the error rate of the confusion matrix and kappa values. Selected high ranked SNPs are used to retrieve metabolic pathways related to ALL from the KEGG metabolic pathways database. The methodologies and approaches presented in this thesis emphasise the critical role that different types of microarray data play in understanding complex diseases like ALL. The availability of flexible frameworks for the task of disease diagnosis and prognosis, as proposed in this thesis, will play an important role in understanding the genetic basis to common complex diseases. This thesis contributes to knowledge in two ways: (i) Providing novel data mining and visualisation frameworks to handle biological data. (ii) Providing novel visualisations for microarray data to increase understanding of disease

    Analysing functional genomics data using novel ensemble, consensus and data fusion techniques

    Get PDF
    Motivation: A rapid technological development in the biosciences and in computer science in the last decade has enabled the analysis of high-dimensional biological datasets on standard desktop computers. However, in spite of these technical advances, common properties of the new high-throughput experimental data, like small sample sizes in relation to the number of features, high noise levels and outliers, also pose novel challenges. Ensemble and consensus machine learning techniques and data integration methods can alleviate these issues, but often provide overly complex models which lack generalization capability and interpretability. The goal of this thesis was therefore to develop new approaches to combine algorithms and large-scale biological datasets, including novel approaches to integrate analysis types from different domains (e.g. statistics, topological network analysis, machine learning and text mining), to exploit their synergies in a manner that provides compact and interpretable models for inferring new biological knowledge. Main results: The main contributions of the doctoral project are new ensemble, consensus and cross-domain bioinformatics algorithms, and new analysis pipelines combining these techniques within a general framework. This framework is designed to enable the integrative analysis of both large- scale gene and protein expression data (including the tools ArrayMining, Top-scoring pathway pairs and RNAnalyze) and general gene and protein sets (including the tools TopoGSA , EnrichNet and PathExpand), by combining algorithms for different statistical learning tasks (feature selection, classification and clustering) in a modular fashion. Ensemble and consensus analysis techniques employed within the modules are redesigned such that the compactness and interpretability of the resulting models is optimized in addition to the predictive accuracy and robustness. The framework was applied to real-word biomedical problems, with a focus on cancer biology, providing the following main results: (1) The identification of a novel tumour marker gene in collaboration with the Nottingham Queens Medical Centre, facilitating the distinction between two clinically important breast cancer subtypes (framework tool: ArrayMining) (2) The prediction of novel candidate disease genes for Alzheimer’s disease and pancreatic cancer using an integrative analysis of cellular pathway definitions and protein interaction data (framework tool: PathExpand, collaboration with the Spanish National Cancer Centre) (3) The prioritization of associations between disease-related processes and other cellular pathways using a new rule-based classification method integrating gene expression data and pathway definitions (framework tool: Top-scoring pathway pairs) (4) The discovery of topological similarities between differentially expressed genes in cancers and cellular pathway definitions mapped to a molecular interaction network (framework tool: TopoGSA, collaboration with the Spanish National Cancer Centre) In summary, the framework combines the synergies of multiple cross-domain analysis techniques within a single easy-to-use software and has provided new biological insights in a wide variety of practical settings

    Analysing functional genomics data using novel ensemble, consensus and data fusion techniques

    Get PDF
    Motivation: A rapid technological development in the biosciences and in computer science in the last decade has enabled the analysis of high-dimensional biological datasets on standard desktop computers. However, in spite of these technical advances, common properties of the new high-throughput experimental data, like small sample sizes in relation to the number of features, high noise levels and outliers, also pose novel challenges. Ensemble and consensus machine learning techniques and data integration methods can alleviate these issues, but often provide overly complex models which lack generalization capability and interpretability. The goal of this thesis was therefore to develop new approaches to combine algorithms and large-scale biological datasets, including novel approaches to integrate analysis types from different domains (e.g. statistics, topological network analysis, machine learning and text mining), to exploit their synergies in a manner that provides compact and interpretable models for inferring new biological knowledge. Main results: The main contributions of the doctoral project are new ensemble, consensus and cross-domain bioinformatics algorithms, and new analysis pipelines combining these techniques within a general framework. This framework is designed to enable the integrative analysis of both large- scale gene and protein expression data (including the tools ArrayMining, Top-scoring pathway pairs and RNAnalyze) and general gene and protein sets (including the tools TopoGSA , EnrichNet and PathExpand), by combining algorithms for different statistical learning tasks (feature selection, classification and clustering) in a modular fashion. Ensemble and consensus analysis techniques employed within the modules are redesigned such that the compactness and interpretability of the resulting models is optimized in addition to the predictive accuracy and robustness. The framework was applied to real-word biomedical problems, with a focus on cancer biology, providing the following main results: (1) The identification of a novel tumour marker gene in collaboration with the Nottingham Queens Medical Centre, facilitating the distinction between two clinically important breast cancer subtypes (framework tool: ArrayMining) (2) The prediction of novel candidate disease genes for Alzheimer’s disease and pancreatic cancer using an integrative analysis of cellular pathway definitions and protein interaction data (framework tool: PathExpand, collaboration with the Spanish National Cancer Centre) (3) The prioritization of associations between disease-related processes and other cellular pathways using a new rule-based classification method integrating gene expression data and pathway definitions (framework tool: Top-scoring pathway pairs) (4) The discovery of topological similarities between differentially expressed genes in cancers and cellular pathway definitions mapped to a molecular interaction network (framework tool: TopoGSA, collaboration with the Spanish National Cancer Centre) In summary, the framework combines the synergies of multiple cross-domain analysis techniques within a single easy-to-use software and has provided new biological insights in a wide variety of practical settings

    An Integrated Transcriptomic and Meta-Analysis of Hepatoma Cells Reveals Factors That Influence Susceptibility to HCV Infection

    Get PDF
    Hepatitis C virus (HCV) is a global problem. To better understand HCV infection researchers employ in vitro HCV cell-culture (HCVcc) systems that use Huh-7 derived hepatoma cells that are particularly permissive to HCV infection. A variety of hyper-permissive cells have been subcloned for this purpose. In addition, subclones of Huh-7 which have evolved resistance to HCV are available. However, the mechanisms of susceptibility or resistance to infection among these cells have not been fully determined. In order to elucidate mechanisms by which hepatoma cells are susceptible or resistant to HCV infection we performed genome-wide expression analyses of six Huh-7 derived cell cultures that have different levels of permissiveness to infection. A great number of genes, representing a wide spectrum of functions are differentially expressed between cells. To focus our investigation, we identify host proteins from HCV replicase complexes, perform gene expression analysis of three HCV infected cells and conduct a detailed analysis of differentially expressed host factors by integrating a variety of data sources. Our results demonstrate that changes relating to susceptibility to HCV infection in hepatoma cells are linked to the innate immune response, secreted signal peptides and host factors that have a role in virus entry and replication. This work identifies both known and novel host factors that may influence HCV infection. Our findings build upon current knowledge of the complex interplay between HCV and the host cell, which could aid development of new antiviral strategies

    Development and application of analysis modules in MADIBA, a Web-based toolkit for the interpretation of microarray data

    Get PDF
    Microarray technology makes it possible to identify changes in gene expression of an organism, under various conditions. The challenge to researchers that employ microarray expression profiling is that once pre-processing is completed, and a cluster of co-expressed genes obtained, is to derive biological meaning from this data. Data mining is thus essential for deducing significant biological information such as the identification of new biological mechanisms or putative drug targets. While many algorithms and software have been developed for analysing gene expression, the extraction of relevant information from experimental data is still a substantial challenge, requiring significant time and skill. MADIBA (MicroArray Data Interface for Biological Annotation) facilitates the assignment of biological meaning to gene expression clusters by automating the post-processing stage. A relational database has been designed to store the data from gene to pathway for Plasmodium falciparum, Oryza sativa (rice), Arabidopsis thaliana, and Pectobacterium atrosepticum (Pba) As input, the user submits a cluster of genes, either the gene identifiers or the gene sequences. Tools within the web interface allow rapid analyses for the identification of the Gene Ontology terms relevant to each cluster; visualising the metabolic pathways where the gene products are implicated, their genomic localisations, putative common transcriptional regulatory elements in the upstream sequences, and an analysis specific to the organism being studied. The user has the option of outputting selected results of the analyses, either in PDF or plain text formats. MADIBA is an integrated, online tool that will assist researchers in interpreting their results and understand the meaning of the co-expression of a cluster of genes. Functionality of MADIBA was used to analyse a number of gene clusters from several experiments – expression profiling of the Plasmodium falciparum life cycle, a Ralstonia solanacearum infection ofArabidopsis thaliana, a rice treatment with BTH, a millet SA- and MeJ-treatment experiment, and an expI mutant experiment in Pectobacterium atrosepticum. Data from the Plasmodium falciparum and rice were used to illustrate MADIBA’s functionality. For the A. thaliana analyses, the DRASTIC database was implemented to identify how genes respond to various treatments. In addition, a method named PCA Experiment Comparer was developed, which compares the expression values of the numerous experiments in NASCArrays. Using the A. thaliana-R. solanacearum interaction data several related experiments matched in both the susceptible and resistant interactions. In the millet analyses, besides defence related genes being identified, several genes also involved in photosynthesis were found, possibly suggesting a relation between light and defence signalling. The Pba data identified genes involved in quorum sensing, as well as some associated genes with no known function that may also be related to this regulatory process. With the advent of whole genome microarray chips and an increasing number of organisms being sequenced, tools such as MADIBA will become even more significant in understanding the underlying biology. MADIBA provides access to several genomic data sources and analyses, allowing users to quickly annotate and visualise the results. MADIBA is freely available and can be accessed at http://www.bi.up.ac.za/MADIBA/. CopyrightDissertation (MSc)--University of Pretoria, 2009.Biochemistryunrestricte

    Improving reproducibility and reuse of modelling results in the life sciences

    Get PDF
    Research results are complex and include a variety of heterogeneous data. This entails major computational challenges to (i) to manage simulation studies, (ii) to ensure model exchangeability, stability and validity, and (iii) to foster communication between partners. I describe techniques to improve the reproducibility and reuse of modelling results. First, I introduce a method to characterise differences in computational models. Second, I present approaches to obtain shareable and reproducible research results. Altogether, my methods and tools foster exchange and reuse of modelling results.Die verteilte Entwicklung von komplexen Simulationsstudien birgt eine große Zahl an informationstechnischen Herausforderungen: (i) Modelle müssen verwaltet werden; (ii) Reproduzierbarkeit, Stabilität und Gültigkeit von Ergebnissen muss sichergestellt werden; und (iii) die Kommunikation zwischen Partnern muss verbessert werden. Ich stelle Techniken vor, um die Reproduzierbarkeit und Wiederverwendbarkeit von Modellierungsergebnissen zu verbessern. Meine Implementierungen wurden erfolgreich in internationalen Anwendungen integriert und fördern das Teilen von wissenschaftlichen Ergebnissen

    The mechanisms of evolutionary flexibility in earthworm genomes

    Get PDF
    Many individual organisms have latent phenotypic potentials which are never realised within their lifespans. This potential can include a huge diversity of dormant adaptations across the tree of life, such as the ability to tolerate radical changes in temperature, survive restricted nutrient availability, and resist toxins and parasites. Prior to unrealised phenotypic potentials are necessarily information potentials residing in a dormant state also. This thesis investigates the systematic interactions of facultative morphologies and atavistic adaptivity with the evolutionary systems which propagate them. Earthworms as models are for these purposes an almost archetypal form of a high-latent-potential organism. Examples abound of their thriving as peregrine species with near-global ranges
    • …
    corecore