thesis

Integrative and Comparative Analysis of Retinoblastoma and Osteosarcoma

Abstract

In the last one and a half decades, the generalization of high throughput methods in molecular biology has led to the generation of vast amounts of datasets that unraveled the unfathomed complexity of the cell regulatory mechanisms. The recently published results of the ENCODE project (ENCODE Project Consortium et al., 2012) demonstrated the extend of these in the human genome and certainly more regulation mechanisms will be discovered in the future. Already, this complexity within a single cell - without taking into account cell-cell interaction or micro-environment influences - cannot be abstracted by the human mind. However, understanding it is the key to devise adapted treatments to genetic diseases or disorders, among which is cancer. In mathematics, such complex problems are addressed using methods that reduce their complexity, so that they can be modeled in a solvable manner. In biology, it led researchers to develop the concept of systems biology as a mean to abstract the complexity of the cell regulatory network. To date, most of the published studies using high throughput technologies only focus on one kind of regulatory mechanism and hence cannot be used as such to investigate the interactions between these. Moreover, distinguishing causative from confounding factors within such studies is difficult. These were my original motivations to develop analytical and statistical methods that control for confounding factors effects and allow the integrative and comparative analysis of different kinds of datasets. In fine, three different tools were developed to achieve this goal. First, "customCDF": a tool to redefine the Custom Definition File (CDF) of Affymetrix GeneChips. It results in the increased sensitivity of downstream analyses as these bene fit from the constantly evolving human genome reference and annotations. Second, "aSim": a tool to simulate microarray data, which was required to benchmark the developed algorithms. Third, for the integrative analysis, a set of combined statistical methods and finally for the comparative analysis, a modification of the integrative analysis approach. These were bundled in the "crossChip" R package. The "customCDF" and "aSim" tools were first validated on independant datasets. The developed analytical methods ("crossChip") were first validated on "aSim" simulated data and publicly available datasets and then used to answer two biological questions. First, using two retinoblastoma datasets, the effect of genomic copy number variations on gene-expression was investigated. Then, motivated by the fact that retinoblastoma patients have a higher chance to develop osteosarcoma later in life than the average population, datasets of both these tumors were comparatively analyzed to assess these tumors similarities and differences. Despite a rather limited number of samples within the selected datasets, the developed approaches with their higher sensitivity and sensibility were successful and set the ground for larger scale analyses. Indeed, the integrative analysis applied to retinoblastoma revealed the high importance of the chromosome 6 gain at a later stage of the disease, indicating that many genes on that chromosome are beneficial to cancerogenesis. Moreover, in comparison to standard microarray analyses, it demonstrated its efficacy at detecting the interplay of regulatory mechanisms: examples of positive and negative compensation of gene expression in lost and gained regions, respectively, as well as examples of antisense transcription, pseudogene and snRNAs regulation were identified in this dataset. The comparative analysis on the other hand revealed the high similarity of the retinoblastoma and osteosarcoma tumors, while at the same time showing that either of them take advantage of their distinct micro-environment and consequently appear to make use of different signaling pathways, PKC/calmodulin in retinoblastoma and GPCR/RAS in osteosarcoma. The developed tools and statistical methods have demonstrated their validity and utility by giving sensible answers to the two biological questions addressed. Moreover, they generated a large number of interesting hypotheses that need further investigations. And as they are not limited to microarray analysis but can be applied to analyze any high-throughput generated data, they demonstrated the usefulness of "systems biology" approaches to study cancerogenesis

    Similar works