76 research outputs found

    Computational methods for the discovery of molecular signatures from Omics Data

    Get PDF
    Molecular biomarkers, derived from high-throughput technologies, are the foundations of the "next-generation" precision medicine. Despite a decade of intense efforts and investments, the number of clinically valid biomarkers is modest. Indeed, the "big-data" nature of omics data provides new challenges that require an improvement in the strategies of data analysis and interpretation. In this thesis, two themes are proposed, both aimed at improving the statistical and computational methodology in the field of signatures discovery. The first work aim at identifying serum miRNAs to be used as diagnostic biomarkers associated with ovarian cancer. In particular, a guideline and an ad-hoc microarray normalization strategy for the analysis of circulating miRNAs is proposed. In the second work, a new approach for the identification of functional molecular signatures based on Gaussian graphical models is presented. The model can explore the topological information contained in the biological pathways and highlight the potential sources of differential behaviors in two experimental conditions

    Revealing cancer subtypes with higher-order correlations applied to imaging and omics data

    Get PDF
    Figure S9. Screenshot of the interactive Tumor Map visualization, showing HOCUS applied to the TCGA Pancan-12 mutation data. Each point is one tumor sample, which we have color-coded by tissue type. A dotted box highlights the cluster of samples that have both PIK3CA and TP53 mutations, which are usually mutually exclusive. (EPS 751 kb

    Relevance of systems biological approach in the differential diagnosis of invasive lobular carcinoma & invasive ductal carcinoma

    Get PDF
    Breast cancer is a malignant neoplasm originating from breast tissue, most commonly from the inner lining of milk ducts or the lobules that supply the ducts with milk. ILCs and IDCs vary from each other with respect to various histological, biological and clinical features. Remarkably, ductal tumors tending to form glandular structures, whereas lobular tumors are less cohesive and tends to invade in single file. The high degree of similarity in the prognoses of IDC and ILC makes it beneficial to develop a differential diagnostic protocol to classify the two conditions. The main goal of the study is to construct the genetic regulatory network from the microarray data using biological knowledge and constraint-based inferences, in order to explore the potential significant gene regulatory networks that can differentiate IDC and ILC and thereby understand the complex interactions that are influenced by the genetic networks. Out of the 54676 genes present on the GPL570 platform- 29 genes exhibited 4 fold up regulation in case of IDC and 22 in the case of ILC. The ductal and lobular tumors displayed a striking difference in the expression of genes associated with cell adhesion, protein folding, and protein phosphorylation and invasion. Construction of separate gene regulation networks for IDC and ILC on the basis of gene expression altercation can be utilized in understanding the distinction in the possible mechanism that underlies the pathological differences between the two, which can be exploited in identifying diagnostic or therapeutic targets

    NOVEL COMPUTATIONAL METHODS FOR CANCER GENOMICS DATA ANALYSIS

    Get PDF
    Cancer is a genetic disease responsible for one in eight deaths worldwide. The advancement of next-generation sequencing (NGS) technology has revolutionized the cancer research, allowing comprehensively profiling the cancer genome at great resolution. Large-scale cancer genomics research has sparked the needs for efficient and accurate Bioinformatics methods to analyze the data. The research presented in this dissertation focuses on three areas in cancer genomics: cancer somatic mutation detection; cancer driver genes identification and transcriptome profiling on single-cell level. NGS data analysis involves a series of complicated data transformation that convert raw sequencing data to the information that is interpretable by cancer researchers. The first project in the dissertation established a robust, reproducible and scalable cancer genomics data analysis workflow management system that automates the best practice mutation calling pipelines to detect somatic single nucleotide polymorphisms, insertion, deletion and copy number variation from NGS data. It integrates mutation annotation, clinically actionable therapy prediction and data visualization that streamlines the sequence-to-report data transformation. In order to differentiate the driver mutations buried among a vast pool of passenger mutations from a somatic mutation calling project, we developed MEScan in the second project, a novel method that enables genome-scale driver mutations identification based on mutual exclusivity test using cancer somatic mutation data. MEScan implements an efficient statistical framework to de novo screen mutual exclusive patterns and in the meantime taking into account the patient-specific and gene-specific background mutation rate and adjusting the heterogenous mutation frequency. It outperforms several existing methods based on simulation studies and real-world datasets. Genome-wide screening using existing TCGA somatic mutation data discovers novel cancer-specific and pan-cancer mutually exclusive patterns. Bulk RNA sequencing (RNA-Seq) has become one of the most commonly used techniques for transcriptome profiling in a wide spectrum of biomedical and biological research. Analyzing bulk RNA-Seq reads to quantify expression at each gene locus is the first step towards the identification of differentially expressed genes for downstream biological interpretation. Recent advances in single-cell RNA-seq (scRNA-seq) technology allows cancer biologists to profile gene expression on higher resolution cellular level. Preprocessing scRNA-seq data to quantify UMI-based gene count is the key to characterize intra-tumor cellular heterogeneity and identify rare cells that governs tumor progression, metastasis and treatment resistance. Despite its popularity, summarizing gene count from raw sequencing reads remains the one of the most time-consuming steps with existing tools. Current pipelines do not balance the efficiency and accuracy in large-scale gene count summarization in both bulk and scRNA-seq experiments. In the third project, we developed a light-weight k-mer based gene counting algorithm, FastCount, to accurately and efficiently quantify gene-level abundance using bulk RNA-seq or UMI-based scRNA-seq data. It achieves at least an order-of-magnitude speed improvement over the current gold standard pipelines while providing competitive accuracy

    Coexpression analysis of large cancer datasets provides insight into the cellular phenotypes of the tumour microenvironment

    Get PDF
    Background: Biopsies taken from individual tumours exhibit extensive differences in their cellular composition due to the inherent heterogeneity of cancers and vagaries of sample collection. As a result genes expressed in specific cell types, or associated with certain biological processes are detected at widely variable levels across samples in transcriptomic analyses. This heterogeneity also means that the level of expression of genes expressed specifically in a given cell type or process, will vary in line with the number of those cells within samples or activity of the pathway, and will therefore be correlated in their expression.Results: Using a novel 3D network-based approach we have analysed six large human cancer microarray datasets derived from more than 1,000 individuals. Based upon this analysis, and without needing to isolate the individual cells, we have defined a broad spectrum of cell-type and pathway-specific gene signatures present in cancer expression data which were also found to be largely conserved in a number of independent datasets.Conclusions: The conserved signature of the tumour-associated macrophage is shown to be largely-independent of tumour cell type. All stromal cell signatures have some degree of correlation with each other, since they must all be inversely correlated with the tumour component. However, viewed in the context of established tumours, the interactions between stromal components appear to be multifactorial given the level of one component e.g. vasculature, does not correlate tightly with another, such as the macrophage

    BAYESIAN FRAMEWORKS FOR PARSIMONIOUS MODELING OF MOLECULAR CANCER DATA

    Get PDF
    In this era of precision medicine, clinicians and researchers critically need the assistance of computational models that can accurately predict various clinical events and outcomes (e.g,, diagnosis of disease, determining the stage of the disease, or molecular subtyping). Typically, statistics and machine learning are applied to ‘omic’ datasets, yielding computational models that can be used for prediction. In cancer research there is still a critical need for computational models that have high classification performance but are also parsimonious in the number of variables they use. Some models are very good at performing their intended classification task, but are too complex for human researchers and clinicians to understand, due to the large number of variables they use. In contrast, some models are specifically built with a small number of variables, but may lack excellent predictive performance. This dissertation proposes a novel framework, called Junction to Knowledge (J2K), for the construction of parsimonious computational models. The J2K framework consists of four steps: filtering (discretization and variable selection), Bayesian network generation, Junction tree generation, and clique evaluation. The outcome of applying J2K to a particular dataset is a parsimonious Bayesian network model with high predictive performance, but also that is composed of a small number of variables. Not only does J2K find parsimonious gene cliques, but also provides the ability to create multi-omic models that can further improve the classification performance. These multi-omic models have the potential to accelerate biomedical discovery, followed by translation of their results into clinical practice

    The integration of gene and miRNA expression using pathway topology: a case study on Epithelial Ovarian Cancer

    Get PDF
    Pathways are formal descriptions of the biological processes involving finely regulated structures by which a cell converts molecules or processes signals. The study of gene expression in terms of pathways is defined as pathway analysis and aims at identifying groups of functionally related genes that show coordinated expression changes. Recently, pathway analysis moved from algorithms using merely gene list to ones exploiting the topology that define gene connections. A crucial, and unfortunately limiting step for these novel methods are the availability of the pathways as gene networks in which nodes are genes and edges are the relations between two elements. To this aim, we develop a pathway data interpreter, called graphite, able to uniformly store, process and convert pathway information into gene networks. graphite has been made publicly available as R package within the Bioconductor platform. In the field of the topological pathway analysis, graphite fills the existing gap lying between technical and methodological aspects. graphite i) allows performing more informative analysis on omics data and ii) allows developing new methods based on the increased accessibil- ity of biological knowledge. However, the pathways of the four main public resources integrated into graphite (KEGG, Reactome, Biocarta and PID), still lack of crucial interactors: the microRNAs. The microRNAs are small non-coding RNAs that post-transcriptionally regulate gene expression, their function on the messenger target is repressive but their effect on the transcription is dependent of the topology of the pathway in which the miRNA is involved. In the last decade, many targets have been discovered and experimentally validated, dedicated databases are available providing these information. Thus, I worked on an extension of graphite package able to integrate microRNAs in pathway topology, i) linking the non-coding RNAs to their validated target genes, ii) providing integrated networks suitable for the topological pathway analyses. The feasibility of this approach has been validated on a specific biological context, the early stage of Epithelial Ovarian Cancer (EOC). EOC has long been considered as a single disease. The emerging opinion, however, sees ovarian cancer as a general term that encloses a group of histo-pathological subtypes sharing a common anatomic location. In collaboration with the Mario Negri institute, 257 stage I EOC tumour biopsies were collected and stratified into training and validation sets. miRNA microarray data was used to generate the most highly reproducible signatures for each histotype through a dedicated resampling inferential strategy. qRT- PCR was used to validate the results in both the training and validation set. The results indicate that the clear cell histotype is characterized by high expression levels of miR- 30a and miR-30a*, while mucinous patients by high levels of miR-192 and miR-194, interestingly as well as mucinous non-ovarian tissues. Then, the integrative approach that combines mRNA and miRNA profiles using graphite has been applied to identify the mucinous specific regulatory circuits. Taken together our findings demonstrate that EOC histotypes have discriminant regulatory circuits that drive the differentiation of the tumour environment. Our approach successfully guides us towards important biological results with interesting therapeutic implications in EOC

    Integrative computational biology for cancer research

    Get PDF
    Over the past two decades, high-throughput (HTP) technologies such as microarrays and mass spectrometry have fundamentally changed clinical cancer research. They have revealed novel molecular markers of cancer subtypes, metastasis, and drug sensitivity and resistance. Some have been translated into the clinic as tools for early disease diagnosis, prognosis, and individualized treatment and response monitoring. Despite these successes, many challenges remain: HTP platforms are often noisy and suffer from false positives and false negatives; optimal analysis and successful validation require complex workflows; and great volumes of data are accumulating at a rapid pace. Here we discuss these challenges, and show how integrative computational biology can help diminish them by creating new software tools, analytical methods, and data standards

    Graph-Theoretical Tools for the Analysis of Complex Networks

    Get PDF
    We are currently experiencing an explosive growth in data collection technology that threatens to dwarf the commensurate gains in computational power predicted by Moore’s Law. At the same time, researchers across numerous domain sciences are finding success using network models to represent their data. Graph algorithms are then applied to study the topological structure and tease out latent relationships between variables. Unfortunately, the problems of interest, such as finding dense subgraphs, are often the most difficult to solve from a computational point of view. Together, these issues motivate the need for novel algorithmic techniques in the study of graphs derived from large, complex, data sources. This dissertation describes the development and application of graph theoretic tools for the study of complex networks. Algorithms are presented that leverage efficient, exact solutions to difficult combinatorial problems for epigenetic biomarker detection and disease subtyping based on gene expression signatures. Extensive testing on publicly available data is presented supporting the efficacy of these approaches. To address efficient algorithm design, a study of the two core tenets of fixed parameter tractability (branching and kernelization) is considered in the context of a parallel implementation of vertex cover. Results of testing on a wide variety of graphs derived from both real and synthetic data are presented. It is shown that the relative success of kernelization versus branching is found to be largely dependent on the degree distribution of the graph. Throughout, an emphasis is placed upon the practicality of resulting implementations to advance the limits of effective computation
    • …
    corecore