Search CORE

71 research outputs found

Development of Biclustering Techniques for Gene Expression Data Modeling and Mining

Author: Xie Juan
Publication venue: Open PRAIRIE: Open Public Research Access Institutional Repository and Information Exchange
Publication date: 01/01/2018
Field of study

The next-generation sequencing technologies can generate large-scale biological data with higher resolution, better accuracy, and lower technical variation than the arraybased counterparts. RNA sequencing (RNA-Seq) can generate genome-scale gene expression data in biological samples at a given moment, facilitating a better understanding of cell functions at genetic and cellular levels. The abundance of gene expression datasets provides an opportunity to identify genes with similar expression patterns across multiple conditions, i.e., co-expression gene modules (CEMs). Genomescale identification of CEMs can be modeled and solved by biclustering, a twodimensional data mining technique that allows clustering of rows and columns in a gene expression matrix, simultaneously. Compared with traditional clustering that targets global patterns, biclustering can predict local patterns. This unique feature makes biclustering very useful when applied to big gene expression data since genes that participate in a cellular process are only active in specific conditions, thus are usually coexpressed under a subset of all conditions. The combination of biclustering and large-scale gene expression data holds promising potential for condition-specific functional pathway/network analysis. However, existing biclustering tools do not have satisfied performance on high-resolution RNA-Seq data, majorly due to the lack of (i) a consideration of high sparsity of RNA-Seq data, especially for scRNA-Seq data, and (ii) an understanding of the underlying transcriptional regulation signals of the observed gene expression values. QUBIC2, a novel biclustering algorithm, is designed for large-scale bulk RNA-Seq and single-cell RNA-seq (scRNA-Seq) data analysis. Critical novelties of the algorithm include (i) used a truncated model to handle the unreliable quantification of genes with low or moderate expression; (ii) adopted the Gaussian mixture distribution and an information-divergency objective function to capture shared transcriptional regulation signals among a set of genes; (iii) utilized a Dual strategy to expand the core biclusters, aiming to save dropouts from the background; and (iv) developed a statistical framework to evaluate the significances of all the identified biclusters. Method validation on comprehensive data sets suggests that QUBIC2 had superior performance in functional modules detection and cell type classification. The applications of temporal and spatial data demonstrated that QUBIC2 could derive meaningful biological information from scRNA-Seq data. Also presented in this dissertation is QUBICR. This R package is characterized by an 82% average improved efficiency compared to the source C code of QUBIC. It provides a set of comprehensive functions to facilitate biclustering-based biological studies, including the discretization of expression data, query-based biclustering, bicluster expanding, biclusters comparison, heatmap visualization of any identified biclusters, and co-expression networks elucidation. In the end, a systematical summary is provided regarding the primary applications of biclustering for biological data and more advanced applications for biomedical data. It will assist researchers to effectively analyze their big data and generate valuable biological knowledge and novel insights with higher efficiency

Public Research Access Institutional Repository and Information Exchange

Integrating biclustering techniques with de novo gene regulatory network discovery using RNA-seq from skeletal tissues

Author: Ovens Katie 1990-
Publication venue: 'University of Saskatchewan Library'
Publication date: 01/11/2016
Field of study

In order to improve upon stem cell therapy for osteoarthritis, it is necessary to understand the molecular and cellular processes behind bone development and the differences from cartilage formation. To further elucidate these processes would provide a means to analyze the relatedness of bone and cartilage tissue by determining genes that are expressed and regulated for stem cells to differentiate into skeletal tissues. It would also contribute to the classification of differences in normal skeletogenesis and degenerative conditions involving these tissues. The three predominant skeletal tissues of interest are bone, immature cartilage and mature cartilage. Analysis of the transcriptome of these skeletal tissues using RNA-seq technology was performed using differential expression, clustering and biclustering algorithms, to detect similarly expressed genes, which provides evidence for genes potentially interacting together to produce a particular phenotype. Identifying key regulators in the gene regulatory networks (GRNs) driving cartilage and bone development and the differences in the GRNs they drive will facilitate a means to make comparisons between the tissues at the transcriptomic level. Due to a small number of available samples for gene expression data in bone, immature and mature cartilage, it is necessary to determine how the number of samples influences the ability to make accurate GRN predictions. Machine learning techniques for GRN prediction that can incorporate multiple data types have not been well evaluated for complex organisms, nor has RNA-seq data been used often for evaluating these methods. Therefore, techniques identified to work well with microarray data were applied to RNA-seq data from mouse embryonic stem cells, where more samples are available for evaluation compared to the skeletal tissue RNA-seq samples. The RNA-seq data was combined with ChIP-seq data to determine if the machine learning methods outperform simple, correlation-based methods that have been evaluated using RNA-seq data alone. Two of the best performing GRN prediction algorithms from previous large-scale evaluations, which are incapable of incorporating data beyond expression data, were used as a baseline to determine if the addition of multiple data types could help reduce the number of gene expression samples. It was also necessary to identify a biclustering algorithm that could identify potentially biologically relevant modules. Publicly available ChIP-seq and RNA-seq samples from embryonic stem cells were used to measure the performance and consistency of each method, as there was a well-established network in mouse embryonic stem cells to compare results. The methods were then compared to cMonkey2, a biclustering method used in conjunction with ChIP-seq for two important transcription factors in the embryonic stem cell network. This was done to determine if any of these GRN prediction methods could potentially use the small number of skeletal tissue samples available to determine transcription factors orchestrating the expression of other genes driving cartilage and bone formation. Using the embryonic stem cell RNA-seq samples, it was found that sample size, if above 10, does not have a significant impact on the number of true positives in the top predicted interactions. Random forest methods outperform correlation-based methods when using RNA-seq, with area under ROC (AUROC) for evaluation, but the number of true positive interactions predicted when compared to a literature network were similar when using a strict cut-off. Using a limited set of ChIP-seq data was found to not improve the confidence in the transcription factor interactions and had no obvious affect on biclustering results. Correlation-based methods are likely the safest option when based on consistency of the results over multiple runs, but there is still the challenge of determining an appropriate cut-off to the predictions. To predict the skeletal tissue GRNs, cMonkey was used as an initial feature selection method to identify important genes in skeletal tissues and compared with other biclustering methods that do not use ChIP-seq. The predicted skeletal tissue GRNs will be utilized in future analyses of skeletal tissues, focussing on the evolutionary relationship between the GRNs driving skeletal tissue development

eCommons@USASK

University of Saskatchewan Research Archive

Utilizing gene co-expression networks for comparative transcriptomic analyses

Author: Ovens Katie
Publication venue: 'University of Saskatchewan Library'
Publication date: 15/12/2020
Field of study

The development of high-throughput technologies such as microarray and next-generation RNA sequencing (RNA-seq) has generated numerous transcriptomic data that can be used for comparative transcriptomics studies. Transcriptomes obtained from different species can reveal differentially expressed genes that underlie species-specific traits. It also has the potential to identify genes that have conserved gene expression patterns. However, differential expression alone does not provide information about how the genes relate to each other in terms of gene expression or if groups of genes are correlated in similar ways across species, tissues, etc. This makes gene expression networks, such as co-expression networks, valuable in terms of finding similarities or differences between genes based on their relationships with other genes. The desired outcome of this research was to develop methods for comparative transcriptomics, specifically for comparing gene co-expression networks (GCNs), either within or between any set of organisms. These networks represent genes as nodes in the network, and pairs of genes may be connected by an edge representing the strength of the relationship between the pairs. We begin with a review of currently utilized techniques available that can be used or adapted to compare gene co-expression networks. We also work to systematically determine the appropriate number of samples needed to construct reproducible gene co-expression networks for comparison purposes. In order to systematically compare these replicate networks, software to visualize the relationship between replicate networks was created to determine when the consistency of the networks begins to plateau and if this is affected by factors such as tissue type and sample size. Finally, we developed a tool called Juxtapose that utilizes gene embedding to functionally interpret the commonalities and differences between a given set of co-expression networks constructed using transcriptome datasets from various organisms. A set of transcriptome datasets were utilized from publicly available sources as well as from collaborators. GTEx and Gene Expression Omnibus (GEO) RNA-seq datasets were used for the evaluation of the techniques proposed in this research. Skeletal cell datasets of closely related species and more evolutionarily distant organisms were also analyzed to investigate the evolutionary relationships of several skeletal cell types. We found evidence that data characteristics such as tissue origin, as well as the method used to construct gene co-expression networks, can substantially impact the number of samples required to generate reproducible networks. In particular, if a threshold is used to construct a gene co-expression network for downstream analyses, the number of samples used to construct the networks is an important consideration as many samples may be required to generate networks that have a reproducible edge order when sorted by edge weight. We also demonstrated the capabilities of our proposed method for comparing GCNs, Juxtapose, showing that it is capable of consistently matching up genes in identical networks, and it also reflects the similarity between different networks using cosine distance as a measure of gene similarity. Finally, we applied our proposed method to skeletal cell networks and find evidence of conserved gene relationships within skeletal GCNs from the same species and identify modules of genes with similar embeddings across species that are enriched for biological processes involved in cartilage and osteoblast development. Furthermore, smaller sub-networks of genes reflect the phylogenetic relationships of the species analyzed using our gene embedding strategy to compare the GCNs. This research has produced methodologies and tools that can be used for evolutionary studies and generalizable to scenarios other than cross-species comparisons, including co-expression network comparisons across tissues or conditions within the same species

University of Saskatchewan Research Archive

Mining Biological Networks towards Protein complex Detection and Gene-Disease Association

Author: Hanna Eileen Marie
Publication venue: Scholarworks@UAEU
Publication date: 01/09/2015
Field of study

Large amounts of biological data are continuously generated nowadays, thanks to the advancements of high-throughput experimental techniques. Mining valuable knowledge from such data still motivates the design of suitable computational methods, to complement the experimental work which is often bound by considerable time and cost requirements. Protein complexes or groups of interacting proteins, are key players in most cellular events. The identification of complexes not only allows to better understand normal biological processes but also to uncover Disease-triggering malfunctions. Ultimately, findings in this research branch can highly enhance the design of effective medical treatments. The aim of this research is to detect protein complexes in protein-protein interaction networks and to associate the detected entities to diseases. The work is divided into three main objectives: first, develop a suitable method for the identification of protein complexes in static interaction networks; second, model the dynamic aspect of protein interaction networks and detect complexes accordingly; and third, design a learning model to link proteins, and subsequently protein complexes, to diseases. In response to these objectives, we present, ProRank+, a novel complex-detection approach based on a ranking algorithm and a merging procedure. Then, we introduce DyCluster, which uses gene expression data, to model the dynamics of the interaction networks, and we adapt the detection algorithm accordingly. Finally, we integrate network topology attributes and several biological features of proteins to form a classification model for gene-disease association. The reliability of the proposed methods is supported by various experimental studies conducted to compare them with existing approaches. Pro Rank+ detects more protein complexes than other state-of-the-art methods. DyCluster goes a step further and achieves a better performance than similar techniques. Then, our learning model shows that combining topological and biological features can greatly enhance the gene-disease association process. Finally, we present a comprehensive case study of breast cancer in which we pinpoint disease genes using our learning model; subsequently, we detect favorable groupings of those genes in a protein interaction network using the Pro-rank+ algorithm

United Arab Emirates University: Scholarworks@UAEU / جامعة الامارات

Pathway and Network Analysis of Transcriptomic and Genomic Data

Author: Yoon Sora
Publication venue: Graduate School of UNIST
Publication date: 01/02/2019
Field of study

Department of Biological SciencesThe development of high-throughput technologies has enabled to produce omics data and it has facilitated the systemic analysis of biomolecules in cells. In addition, thanks to the vast amount of knowledge in molecular biology accumulated for decades, numerous biological pathways have been categorized as gene-sets. Using these omics data and pre-defined gene-sets, the pathway analysis identifies genes that are collectively altered on a gene-set level under a phenotype. It helps the biological interpretation of the phenotype, and find phenotype-related genes that are not detected by single gene-based approach. Besides, the high-throughput technologies have contributed to construct various biological networks such as the protein-protein interactions (PPIs), metabolic/cell signaling networks, gene-regulatory networks and gene co-expression networks. Using these networks, we can visualize the relationships among gene-set members and find the hub genes, or infer new biological regulatory modules. Overall, this thesis/dissertation describes three approaches to enhance the performance of pathway and/or network analysis of transcriptomic and genomic data. First, a simple but effective method that improves the gene-permuting gene-set enrichment analysis (GSEA) of RNA-sequencing data will be addressed, which is especially useful for small replicate data. By taking absolute statistic, it greatly reduced the false positive rate caused by inter-gene correlation within gene-sets, and improved the overall discriminatory ability in gene-permuting GSEA. Next, a powerful competitive gene-set analysis tool for GWAS summary data, named GSA-SNP2, will be introduced. The z-score method applied with adjusted gene score greatly improved sensitivity compared to existing competitive gene-set analysis methods while exhibiting decent false positive control. The performance was validated using both simulation and real data. In addition, GSA-SNP2 visualizes protein interaction networks within and across the significant pathways so that the user can prioritize the core subnetworks for further mechanistic study. Finally, a novel approach to predict condition-specific miRNA target network by biclustering a large collection of mRNA fold-change data for sequence-specific targets will be introduced. The bicluster targets exhibited on average 17.0% (median 19.4%) improved gain in certainty (sensitivity + specificity). The net gain was further increased up to 32.0% (median 33.2%) by filtering them using functional network information. The analysis of cancer-related biclusters revealed that PI3K/Akt signaling pathway is strongly enriched in targets of a few miRNAs in breast cancer and diffuse large B-cell lymphoma. Among them, five independent prognostic miRNAs were identified, and repressions of bicluster targets and pathway activity by mir-29 were experimentally validated. The BiMIR database provides a useful resource to search for miRNA regulation modules for 459 human miRNAs.clos

ScholarWorks@UNIST

From Classical to Modern Computational Approaches to Identify Key Genetic Regulatory Components in Plant Biology

Author: Acién Juan Manuel
Arbona Vicent
Candela Héctor
Cañizares Ramos Eva
Gonzalez-Guzman Miguel
Publication venue: 'MDPI AG'
Publication date: 01/01/2023
Field of study

The selection of plant genotypes with improved productivity and tolerance to environmental constraints has always been a major concern in plant breeding. Classical approaches based on the generation of variability and selection of better phenotypes from large variant collections have improved their efficacy and processivity due to the implementation of molecular biology techniques, particularly genomics, Next Generation Sequencing and other omics such as proteomics and metabolomics. In this regard, the identification of interesting variants before they develop the phenotype trait of interest with molecular markers has advanced the breeding process of new varieties. Moreover, the correlation of phenotype or biochemical traits with gene expression or protein abundance has boosted the identification of potential new regulators of the traits of interest, using a relatively low number of variants. These important breakthrough technologies, built on top of classical approaches, will be improved in the future by including the spatial variable, allowing the identification of gene(s) involved in key processes at the tissue and cell levels

Directory of Open Access Journals

Repositori Institucional de la Universitat Jaume I

Genomic integrative analysis to improve fusion transcript detection, liquid association and biclustering

Author: Liu Shuchang
Publication venue
Publication date: 12/05/2017
Field of study

More data provide more possibilities. Growing number of genomic data provide new perspectives to understand some complex biological problems. Many algorithms for single-study have been developed, however, their results are not stable for small sample size or overwhelmed by study-specific signals. Taking the advantage of high throughput genomic data from multiple cohorts, in this dissertation, we are able to detect novel fusion transcripts, explore complex gene regulations and discovery disease subtypes within an integrative analysis framework. In the first project, we evaluated 15 fusion transcript detection tools for paired-end RNA-seq data. Though no single method had distinguished performance over the others, several top tools were selected according to their F-measures. We further developed a fusion meta-caller algorithm by combining top methods to re-prioritize candidate fusion transcripts. The results showed that our meta-caller can successfully balance precision and recall compared to any single fusion detection tool. In the second project, we extended liquid association to two meta-analytic frameworks (MetaLA and MetaMLA). Liquid association is the dynamic gene-gene correlation depending on the expression level of a third gene. Our MetaLA and MetaMLA provided stronger detection signals and more consistent and stable results compared to single-study analysis. When applied our method to five Yeast datasets related to environmental changes, genes in the top triplets were highly enriched in fundamental biological processes corresponding to environmental changes. In the third project, we extended the plaid model from single-study analysis to multiple cohorts for bicluster detection. Our meta-biclustering algorithm can successfully discovery biclusters with higher Jaccard accuracy toward large noise and small sample size. We also introduced the concept of gap statistic for pruning parameter estimation. In addition, biclusters detected from five breast cancer mRNA expression cohorts can successfully select genes highly associated with many breast cancer related pathways and split samples with significantly different survival behaviors. In conclusion, we improved the fusion transcripts detection, liquid association analysis and bicluster discovery through integrative-analysis frameworks. These results provided strong evidence of gene fusion structure variation, three-way gene regulation and disease subtype detection, and thus contribute to better understanding of complex disease mechanism ultimately

D-Scholarship@Pitt

Identifying gene regulatory networks common to multiple plant stress responses

Author: Rhodes Johanna
Publication venue
Publication date
Field of study

Stress responses in plants can be defined as a change that affects the homeostasis of pathways, resulting in a phenotype that may or may not be visible to the human eye, affecting the fitness of the plant. Crosstalk is believed to be the shared components of pathways of networks, and is widespread in plants, as shown by examples of crosstalk between transcriptional regulation pathways, and hormone signalling. Crosstalk between stress responses is believed to exist, particularly crosstalk within the responses to biotic stress, and within the responses to abiotic stress. Certain hormone pathways are known to be involved in the crosstalk between the responses to both biotic and abiotic stresses, and can confer immunity or tolerance of Arabidopsis thaliana to these stresses. Transcriptional regulation has also been identified as an important factor in controlling tolerance and resistance to stresses. In this thesis, networks of regulation mediating the response tomultiple stresses are studied. Firstly, co-regulation was predicted for genes differentially expressed in two or more stresses by development of a novel multi-clustering approach, Wigwams Identifies Genes Working Across Multiple Stresses (Wigwams). This approach finds groups of genes whose expression is correlated within stresses, but also identifies a strong statistical link between subsets of stresses. Wigwams identifies the known co-expression of genes encoding enzymes of metabolic and flavonoid biosynthesis pathways, and predicts novels clusters of co-expressed genes. By hypothesising that by being coexpressed could also infer that the genes are co-regulated, promoter motif analysis and modelling provides information for potential upstream regulators. The context-free regulation of groups of co-expressed genes, or potential regulons, was explored using models generated by modelling techniques, in order to generate a quantitative model of transcriptional regulation during the response to B. cinerea, P. syringae pv. tomato DC3000 and senescence. This model was subsequently validated and extended by experimental techniques, using Yeast 1-Hybrid to investigate the protein-DNA interactions, and also microarrays. Analysis of mutants and plants overexpressing a predicted regulator, Rap2.6L, by gene expression analysis identified a number of potential regulon members as downstream targets. Rap2.6L was identified as an indirect regulator of the transcription factor members of three potential regulons co-expressed in the stresses B. cinerea, P. syringae pv. tomato DC3000 and long day senescence, allowing the confirmation of a predicted gene regulatory network operating in multiple stress responses

Warwick Research Archives Portal Repository

Finding network modules and motifs regulating plant stress responses : integration and model-ling across multiple data sets

Author: Polański Krzysztof
Publication venue
Publication date
Field of study

In spite of constant technological advancements, world hunger remains a major challenge due to exponential population growth, and the loss of e effectiveness of crop treatments such as pesticides. As such, comprehending the plant response to stress is of great importance in breeding more resilient crops. Whilst different stresses elicit distinct responses from the plant, a core set of regulatory interactions are conserved across multiple responses and operate as networks. In this thesis, computational approaches were used to elucidate such regulatory interactions from time course expression datasets, predominantly through identification of genes co-expressed across multiple stimuli responses as a footprint of shared network co-regulation. The identification of such network footprints was tackled through Wigwams, a data mining algorithm capable of detecting groups of genes co-regulated across multiple datasets. In contrast to other algorithms, Wigwams assesses whether the co-expression it detects is likely to reflect co-regulation. The modules it found were significantly enriched in functionality and cis-regulatory elements, indicating actual co-regulation. Wigwams and other computational approaches were applied to time course expression data capturing Arabidopsis thaliana response to Pseudomonas syringae pv. tomato DC3000. The presence of a virulent and avirulent strain in the experiment allowed for the temporal deconstruction of the regulatory events underlying the virulent strain's attempts to overcome plant defence through effector action. This analysis led to the detection of a number of effector-specific transcription changes stifling the defence response and manipulating the host's gene and protein expression. A transcription factor-only regulatory network model was proposed to explain the detected network footprints. The inference of causal regulatory networks from expression data is a daunting task, and transcription factor-only models are a good computational compromise by capturing the key regulatory events taking place. However, they are lacking in target genes that carry out the functionality induced by the signalling, making functional assessment di cult. Wigwams was used to introduce the network footprint components into the corresponding transcription factor-only models, resulting in enhanced network models carrying information about downstream regulated genes. This allows for functional assessment to be used to identify nodes of interest within the network, and propose concise follow-up experiments

Warwick Research Archives Portal Repository

Semantic systems biology of prokaryotes : heterogeneous data integration to understand bacterial metabolism

Author: Dam Jesse C.J., van
Publication venue: 'Wageningen University and Research'
Publication date: 01/01/2019
Field of study

The goal of this thesis is to improve the prediction of genotype to phenotypeassociations with a focus on metabolic phenotypes of prokaryotes. This goal isachieved through data integration, which in turn required the development ofsupporting solutions based on semantic web technologies. Chapter 1 providesan introduction to the challenges associated to data integration. Semantic webtechnologies provide solutions to some of these challenges and the basics ofthese technologies are explained in the Introduction. Furthermore, the ba-sics of constraint based metabolic modeling and construction of genome scalemodels (GEM) are also provided. The chapters in the thesis are separated inthree related topics: chapters 2, 3 and 4 focus on data integration based onheterogeneous networks and their application to the human pathogen M. tu-berculosis; chapters 5, 6, 7, 8 and 9 focus on the semantic web based solutionsto genome annotation and applications thereof; and chapter 10 focus on thefinal goal to associate genotypes to phenotypes using GEMs. Chapter 2 provides the prototype of a workflow to efficiently analyze in-formation generated by different inference and prediction methods. This me-thod relies on providing the user the means to simultaneously visualize andanalyze the coexisting networks generated by different algorithms, heteroge-neous data sets, and a suite of analysis tools. As a show case, we have ana-lyzed the gene co-expression networks of M. tuberculosis generated using over600 expression experiments. Hereby we gained new knowledge about theregulation of the DNA repair, dormancy, iron uptake and zinc uptake sys-tems. Furthermore, it enabled us to develop a pipeline to integrate ChIP-seqdat and a tool to uncover multiple regulatory layers. In chapter 3 the prototype presented in chapter 2 is further developedinto the Synchronous Network Data Integration (SyNDI) framework, whichis based on Cytoscape and Galaxy. The functionality and usability of theframework is highlighted with three biological examples. We analyzed thedistinct connectivity of plasma metabolites in networks associated with highor low latent cardiovascular disease risk. We obtained deeper insights froma few similar inflammatory response pathways in Staphylococcus aureus infec-tion common to human and mouse. We identified not yet reported regulatorymotifs associated with transcriptional adaptations of M. tuberculosis.In chapter 4 we present a review providing a systems level overview ofthe molecular and cellular components involved in divalent metal homeosta-sis and their role in regulating the three main virulence strategies of M. tu-berculosis: immune modulation, dormancy and phagosome escape. With theuse of the tools presented in chapter 2 and 3 we identified a single regulatorycascade for these three virulence strategies that respond to limited availabilityof divalent metals in the phagosome. The tools presented in chapter 2 and 3 achieve data integration throughthe use of multiple similarity, coexistence, coexpression and interaction geneand protein networks. However, the presented tools cannot store additional(genome) annotations. Therefore, we applied semantic web technologies tostore and integrate heterogeneous annotation data sets. An increasing num-ber of widely used biological resources are already available in the RDF datamodel. There are however, no tools available that provide structural overviewsof these resources. Such structural overviews are essential to efficiently querythese resources and to assess their structural integrity and design. There-fore, in chapter 5, I present RDF2Graph, a tool that automatically recoversthe structure of an RDF resource. The generated overview enables users tocreate complex queries on these resources and to structurally validate newlycreated resources. Direct functional comparison support genotype to phenotype predictions.A prerequisite for a direct functional comparison is consistent annotation ofthe genetic elements with evidence statements. However, the standard struc-tured formats used by the public sequence databases to present genome an-notations provide limited support for data mining, hampering comparativeanalyses at large scale. To enable interoperability of genome annotations fordata mining application, we have developed the Genome Biology OntologyLanguage (GBOL) and associated infrastructure (GBOL stack), which is pre-sented in chapter 6. GBOL is provenance aware and thus provides a consistentrepresentation of functional genome annotations linked to the provenance.The provenance of a genome annotation describes the contextual details andderivation history of the process that resulted in the annotation. GBOL is mod-ular in design, extensible and linked to existing ontologies. The GBOL stackof supporting tools enforces consistency within and between the GBOL defi-nitions in the ontology. Based on GBOL, we developed the genome annotation pipeline SAPP (Se-mantic Annotation Platform with Provenance) presented in chapter 7. SAPPautomatically predicts, tracks and stores structural and functional annotationsand associated dataset- and element-wise provenance in a Linked Data for-mat, thereby enabling information mining and retrieval with Semantic Webtechnologies. This greatly reduces the administrative burden of handling mul-tiple analysis tools and versions thereof and facilitates multi-level large scalecomparative analysis. In turn this can be used to make genotype to phenotypepredictions. The development of GBOL and SAPP was done simultaneously. Duringthe development we realized that we had to constantly validated the data ex-ported to RDF to ensure coherence with the ontology. This was an extremelytime consuming process and prone to error, therefore we developed the Em-pusa code generator. Empusa is presented in chapter 8. SAPP has been successfully used to annotate 432 sequenced Pseudomonas strains and integrate the resulting annotation in a large scale functional com-parison using protein domains. This comparison is presented in chapter 9.Additionally, data from six metabolic models, nearly a thousand transcrip-tome measurements and four large scale transposon mutagenesis experimentswere integrated with the genome annotations. In this way, we linked gene es-sentiality, persistence and expression variability. This gave us insight into thediversity, versatility and evolutionary history of the Pseudomonas genus, whichcontains some important pathogens as well some useful species for bioengi-neering and bioremediation purposes. Genome annotation can be used to create GEM, which can be used to betterlink genotypes to phenotypes. Bio-Growmatch, presented in chapter 10, istool that can automatically suggest modification to improve a GEM based onphenotype data. Thereby integrating growth data into the complete processof modelling the metabolism of an organism. Chapter 11 presents a general discussion on how the chapters contributedthe central goal. After which I discuss provenance requirements for data reuseand integration. I further discuss how this can be used to further improveknowledge generation. The acquired knowledge could, in turn, be used to de-sign new experiments. The principles of the dry-lab cycle and how semantictechnologies can contribute to establish these cycles are discussed in chapter11. Finally a discussion is presented on how to apply these principles to im-prove the creation and usability of GEM’s.</p

Wageningen University & Research Publications