15 research outputs found

    eXamine: a Cytoscape app for exploring annotated modules in networks

    Get PDF
    Background. Biological networks have growing importance for the interpretation of high-throughput "omics" data. Statistical and combinatorial methods allow to obtain mechanistic insights through the extraction of smaller subnetwork modules. Further enrichment analyses provide set-based annotations of these modules. Results. We present eXamine, a set-oriented visual analysis approach for annotated modules that displays set membership as contours on top of a node-link layout. Our approach extends upon Self Organizing Maps to simultaneously lay out nodes, links, and set contours. Conclusions. We implemented eXamine as a freely available Cytoscape app. Using eXamine we study a module that is activated by the virally-encoded G-protein coupled receptor US28 and formulate a novel hypothesis about its functioning

    Interactive visualization of metabolic networks using virtual reality

    Get PDF
    A combination of graph layouts in 3D space, interactive computer graphics, and virtual reality (VR) can increase the size of understandable networks for metabolic network visualization. Two models, the directed graph and the compound graph, were used to represent a metabolic network. The directed graph, or nonhierarchical visualization, considers the adjacency relationships. For the nonhierarchical visualization, the weighted GEM-3D layout was adopted to emphasize the reactions among metabolite nodes. The compound graph, or hierarchical visualization, explicitly takes the hierarchical relationships like the pathway-molecule hierarchy or the compartment-molecule hierarchy into consideration to improve the performance and perception. An algorithm was designed, which combines the hierarchical force model with the simulated annealing method, to efficiently generate an effective layout for the compound graph. A detail-on-demand method improved the rendering performance and perception of the hierarchical visualization. The directed graph was also used to represent a sub-network composed of reactions of interest (ROIs), which reveal reactions involving a specific node. The fan layout was proposed for ROIs focusing on a metabolite node. The radial layout was adopted for ROIs focusing on a gene node. Graphics scenes were constructed for the network. The shapes and material properties of geometric objects, such as colors, transparencies, and textures, can encode biological properties, such as node names, reaction edge types, etc. Graphics animations like color morph, shape morph, and edge vibration were used to superimpose gene expression profiling data to the network. Interactions for an effective visualization were defined and implemented using VR interfaces. A pilot usability study and some qualitative comparisons were conducted to show potential advantages of stereoscopic VR for metabolic network visualization

    Algorithmic Techniques in Gene Expression Processing. From Imputation to Visualization

    Get PDF
    The amount of biological data has grown exponentially in recent decades. Modern biotechnologies, such as microarrays and next-generation sequencing, are capable to produce massive amounts of biomedical data in a single experiment. As the amount of the data is rapidly growing there is an urgent need for reliable computational methods for analyzing and visualizing it. This thesis addresses this need by studying how to efficiently and reliably analyze and visualize high-dimensional data, especially that obtained from gene expression microarray experiments. First, we will study the ways to improve the quality of microarray data by replacing (imputing) the missing data entries with the estimated values for these entries. Missing value imputation is a method which is commonly used to make the original incomplete data complete, thus making it easier to be analyzed with statistical and computational methods. Our novel approach was to use curated external biological information as a guide for the missing value imputation. Secondly, we studied the effect of missing value imputation on the downstream data analysis methods like clustering. We compared multiple recent imputation algorithms against 8 publicly available microarray data sets. It was observed that the missing value imputation indeed is a rational way to improve the quality of biological data. The research revealed differences between the clustering results obtained with different imputation methods. On most data sets, the simple and fast k-NN imputation was good enough, but there were also needs for more advanced imputation methods, such as Bayesian Principal Component Algorithm (BPCA). Finally, we studied the visualization of biological network data. Biological interaction networks are examples of the outcome of multiple biological experiments such as using the gene microarray techniques. Such networks are typically very large and highly connected, thus there is a need for fast algorithms for producing visually pleasant layouts. A computationally efficient way to produce layouts of large biological interaction networks was developed. The algorithm uses multilevel optimization within the regular force directed graph layout algorithm.Siirretty Doriast

    Interactive graphics, graphical user interfaces and software interfaces for the analysis of biological experimental data and networks

    Get PDF
    Biologists need to analyze and comprehend increasingly large and more complex multivariate experimental data. Biological experiments often produce multiple data sets, each describing one aspect of the system, such as the transcriptome recorded by a microarray or metabolome recorded using gas chromatography mass spectrometry (GC-MS). A biochemical network model provides a conceptual system-level framework for integrating data from different sources.;Effective use of graphics enhances the comprehension of data, and interactive graphics permit the analyst to actively explore data, check its integrity, satiate curiosities and reveal the unexpected. Interactive graphics have not been widely applied as a means for understanding data from biological experiments.;This thesis addresses these needs by providing new methods and software that apply interactive graphics in coordination with numerical methods to the analysis of biological data, in a manner that is accessible to biologists

    Analysing functional genomics data using novel ensemble, consensus and data fusion techniques

    Get PDF
    Motivation: A rapid technological development in the biosciences and in computer science in the last decade has enabled the analysis of high-dimensional biological datasets on standard desktop computers. However, in spite of these technical advances, common properties of the new high-throughput experimental data, like small sample sizes in relation to the number of features, high noise levels and outliers, also pose novel challenges. Ensemble and consensus machine learning techniques and data integration methods can alleviate these issues, but often provide overly complex models which lack generalization capability and interpretability. The goal of this thesis was therefore to develop new approaches to combine algorithms and large-scale biological datasets, including novel approaches to integrate analysis types from different domains (e.g. statistics, topological network analysis, machine learning and text mining), to exploit their synergies in a manner that provides compact and interpretable models for inferring new biological knowledge. Main results: The main contributions of the doctoral project are new ensemble, consensus and cross-domain bioinformatics algorithms, and new analysis pipelines combining these techniques within a general framework. This framework is designed to enable the integrative analysis of both large- scale gene and protein expression data (including the tools ArrayMining, Top-scoring pathway pairs and RNAnalyze) and general gene and protein sets (including the tools TopoGSA , EnrichNet and PathExpand), by combining algorithms for different statistical learning tasks (feature selection, classification and clustering) in a modular fashion. Ensemble and consensus analysis techniques employed within the modules are redesigned such that the compactness and interpretability of the resulting models is optimized in addition to the predictive accuracy and robustness. The framework was applied to real-word biomedical problems, with a focus on cancer biology, providing the following main results: (1) The identification of a novel tumour marker gene in collaboration with the Nottingham Queens Medical Centre, facilitating the distinction between two clinically important breast cancer subtypes (framework tool: ArrayMining) (2) The prediction of novel candidate disease genes for Alzheimer’s disease and pancreatic cancer using an integrative analysis of cellular pathway definitions and protein interaction data (framework tool: PathExpand, collaboration with the Spanish National Cancer Centre) (3) The prioritization of associations between disease-related processes and other cellular pathways using a new rule-based classification method integrating gene expression data and pathway definitions (framework tool: Top-scoring pathway pairs) (4) The discovery of topological similarities between differentially expressed genes in cancers and cellular pathway definitions mapped to a molecular interaction network (framework tool: TopoGSA, collaboration with the Spanish National Cancer Centre) In summary, the framework combines the synergies of multiple cross-domain analysis techniques within a single easy-to-use software and has provided new biological insights in a wide variety of practical settings

    Analysing functional genomics data using novel ensemble, consensus and data fusion techniques

    Get PDF
    Motivation: A rapid technological development in the biosciences and in computer science in the last decade has enabled the analysis of high-dimensional biological datasets on standard desktop computers. However, in spite of these technical advances, common properties of the new high-throughput experimental data, like small sample sizes in relation to the number of features, high noise levels and outliers, also pose novel challenges. Ensemble and consensus machine learning techniques and data integration methods can alleviate these issues, but often provide overly complex models which lack generalization capability and interpretability. The goal of this thesis was therefore to develop new approaches to combine algorithms and large-scale biological datasets, including novel approaches to integrate analysis types from different domains (e.g. statistics, topological network analysis, machine learning and text mining), to exploit their synergies in a manner that provides compact and interpretable models for inferring new biological knowledge. Main results: The main contributions of the doctoral project are new ensemble, consensus and cross-domain bioinformatics algorithms, and new analysis pipelines combining these techniques within a general framework. This framework is designed to enable the integrative analysis of both large- scale gene and protein expression data (including the tools ArrayMining, Top-scoring pathway pairs and RNAnalyze) and general gene and protein sets (including the tools TopoGSA , EnrichNet and PathExpand), by combining algorithms for different statistical learning tasks (feature selection, classification and clustering) in a modular fashion. Ensemble and consensus analysis techniques employed within the modules are redesigned such that the compactness and interpretability of the resulting models is optimized in addition to the predictive accuracy and robustness. The framework was applied to real-word biomedical problems, with a focus on cancer biology, providing the following main results: (1) The identification of a novel tumour marker gene in collaboration with the Nottingham Queens Medical Centre, facilitating the distinction between two clinically important breast cancer subtypes (framework tool: ArrayMining) (2) The prediction of novel candidate disease genes for Alzheimer’s disease and pancreatic cancer using an integrative analysis of cellular pathway definitions and protein interaction data (framework tool: PathExpand, collaboration with the Spanish National Cancer Centre) (3) The prioritization of associations between disease-related processes and other cellular pathways using a new rule-based classification method integrating gene expression data and pathway definitions (framework tool: Top-scoring pathway pairs) (4) The discovery of topological similarities between differentially expressed genes in cancers and cellular pathway definitions mapped to a molecular interaction network (framework tool: TopoGSA, collaboration with the Spanish National Cancer Centre) In summary, the framework combines the synergies of multiple cross-domain analysis techniques within a single easy-to-use software and has provided new biological insights in a wide variety of practical settings

    Combinatorial optimization for affinity proteomics

    Get PDF
    Biochemical test development can significantly benefit from combinatorial optimization. Multiplex assays do require complex planning decisions during implementation and subsequent validation. Due to the increasing complexity of setups and the limited resources, the need to work efficiently is a key element for the success of biochemical research and test development. The first approached problem was to systemically pool samples in order to create a multi-positive control sample. We could show that pooled samples exhibit a predictable serological profile and by using this prediction a pooled sample with the desired property. For serological assay validation it must be shown that the low, medium, and high levels can be reliably measured. It is shown how to optimally choose a few samples to achieve this requirements. Finally the latter methods were merged to validate multiplexed assays using a set of pooled samples. A novel algorithm combining fast enumeration and a set cover formulation has been introduced. The major part of the thesis deals with optimization and data analysis for Triple X Proteomics - immunoaffinity assays using antibodies binding short linear, terminal epitopes of peptides. It has been shown that the problem of choosing a minimal set of epitopes for TXP setups, which combine mass spectrometry with immunoaffinity enrichment, is equivalent to the well-known set cover problem. TXP Sandwich immunoassays capture and detect peptides by combining the C-terminal and N-terminal binders. A greedy heuristic and a meta-heuristic using local search is presented, which proves to be more efficient than pure ILP formulations. All models were implemented in the novel Java framework SCPSolver, which is applicable to many problems that can be formulated as integer programs. While the main design goal of the software was usability, it also provides a basic modelling language, easy deployment and platform independence. One question arising when analyzing TXP data was: How likely is it to observe multiple peptides sharing the same terminus? The algorithms TXP-TEA and MATERICS were able to identify binding characteristics of TXP antibodies from data obtained in immunoaffinity MS experiments, reducing the cost of such analyses. A multinomial statistical model explains the distributions of short sequences observed in protein databases. This allows deducing the average optimal length of the targeted epitope. Further a closed-from scoring function for epitope enrichment in sequence lists is derived
    corecore