3,081 research outputs found

    A statistical approach for array CGH data analysis

    Get PDF
    BACKGROUND: Microarray-CGH experiments are used to detect and map chromosomal imbalances, by hybridizing targets of genomic DNA from a test and a reference sample to sequences immobilized on a slide. These probes are genomic DNA sequences (BACs) that are mapped on the genome. The signal has a spatial coherence that can be handled by specific statistical tools. Segmentation methods seem to be a natural framework for this purpose. A CGH profile can be viewed as a succession of segments that represent homogeneous regions in the genome whose BACs share the same relative copy number on average. We model a CGH profile by a random Gaussian process whose distribution parameters are affected by abrupt changes at unknown coordinates. Two major problems arise : to determine which parameters are affected by the abrupt changes (the mean and the variance, or the mean only), and the selection of the number of segments in the profile. RESULTS: We demonstrate that existing methods for estimating the number of segments are not well adapted in the case of array CGH data, and we propose an adaptive criterion that detects previously mapped chromosomal aberrations. The performances of this method are discussed based on simulations and publicly available data sets. Then we discuss the choice of modeling for array CGH data and show that the model with a homogeneous variance is adapted to this context. CONCLUSIONS: Array CGH data analysis is an emerging field that needs appropriate statistical tools. Process segmentation and model selection provide a theoretical framework that allows precise biological interpretations. Adaptive methods for model selection give promising results concerning the estimation of the number of altered regions on the genome

    Spatial clustering of array CGH features in combination with hierarchical multiple testing

    Get PDF
    We propose a new approach for clustering DNA features using array CGH data from multiple tumor samples. We distinguish data-collapsing: joining contiguous DNA clones or probes with extremely similar data into regions, from clustering: joining contiguous, correlated regions based on a maximum likelihood principle. The model-based clustering algorithm accounts for the apparent spatial patterns in the data. We evaluate the randomness of the clustering result by a cluster stability score in combination with cross-validation. Moreover, we argue that the clustering really captures spatial genomic dependency by showing that coincidental clustering of independent regions is very unlikely. Using the region and cluster information, we combine testing of these for association with a clinical variable in an hierarchical multiple testing approach. This allows for interpreting the significance of both regions and clusters while controlling the Family-Wise Error Rate simultaneously. We prove that in the context of permutation tests and permutation-invariant clusters it is allowed to perform clustering and testing on the same data set. Our procedures are illustrated on two cancer data sets

    Focus on 16p13.3 Locus in colon cancer

    Get PDF
    Background : With one million new cases of colorectal cancer (CRC) diagnosed annually in the world, CRC is the third most commonly diagnosed cancer in the Western world. Patients with stage I-III CRC can be cured with surgery but are at risk for recurrence. Colorectal cancer is characterized by the presence of chromosomal deletions and gains. Large genomic profiling studies have however not been conducted in this disease. The number of a specific genetic aberration in a tumour sample could correlate with recurrence-free survival or overall survival, possibly leading to its use as biomarker for therapeutic decisions. At this point there are not sufficient markers for prediction of disease recurrence in colorectal cancer, which can be used in the clinic to discriminate between stage II patients who will benefit from adjuvant chemotherapy. For instance, the benefit of adjuvant chemotherapy has been most clearly demonstrated in stage III disease with an approximately 30 percent relative reduction in the risk of disease recurrence. The benefits of adjuvant chemotherapy in stage II disease are less certain, the risk for relapse is much smaller in the overall group and the specific patients at risk are hard to identify. Materials and Methods : In this study, array-comparative genomic hybridization analysis (array-CGH) was applied to study high-resolution DNA copy number alterations in 93 colon carcinoma samples. These genomic data were combined with parameters like KRAS mutation status, microsatellite status and clinicopathological characteristics. Results : Both large and small chromosomal losses and gains were identified in our sample cohort. Recurrent gains were found for chromosome 1q, 7, 8q, 13 and 20 and losses were mostly found for 1p, 4, 8p, 14, 15, 17p, 18, 21 and 22. Data analysis demonstrated that loss of chromosome 4 is linked to a worse prognosis in our patients series. Besides these alterations, two interesting small regions of overlap were identified, which could be associated with disease recurrence. Gain of the 16p13.3 locus (including the RNA binding protein, fox-1 homolog gene, RBFOX1) was linked with a worse recurrence-free survival in our patient cohort. On the other hand, loss of RBFOX1 was only found in patients without disease recurrence. Most interestingly, above mentioned characteristics were also found in stage II patients, for whom there is a high medical need for the identification of new prognostic biomarkers. Conclusions : In conclusion, copy number variation of the 16p13.3 locus seems to be an important parameter for prediction of disease recurrence in colon cancer

    Copy number variants and selective sweeps in natural populations of the house mouse (Mus musculus domesticus)

    Get PDF
    Copy–number variants (CNVs) may play an important role in early adaptations, potentially facilitating rapid divergence of populations. We describe an approach to study this question by investigating CNVs present in natural populations of mice in the early stages of divergence and their involvement in selective sweeps. We have analyzed individuals from two recently diverged natural populations of the house mouse (Mus musculus domesticus) from Germany and France using custom, high–density, comparative genome hybridization arrays (CGH) that covered almost 164 Mb and 2444 genes. One thousand eight hundred and sixty one of those genes we previously identified as differentially expressed between these populations, while the expression of the remaining genes was invariant. In total, we identified 1868 CNVs across all 10 samples, 200 bp to 600 kb in size and affecting 424 genic regions. Roughly two thirds of all CNVs found were deletions. We found no enrichment of CNVs among the differentially expressed genes between the populations compared to the invariant ones, nor any meaningful correlation between CNVs and gene expression changes. Among the CNV genes, we found cellular component gene ontology categories of the synapse overrepresented among all the 2444 genes tested. To investigate potential adaptive significance of the CNV regions, we selected six that showed large differences in frequency of CNVs between the two populations and analyzed variation in at least two microsatellites surrounding the loci in a sample of 46 unrelated animals from the same populations collected in field trappings. We identified two loci with large differences in microsatellite heterozygosity (Sfi1 and Glo1/Dnahc8 regions) and one locus with low variation across the populations (Cmah), thus suggesting that these genomic regions might have recently undergone selective sweeps. Interestingly, the Glo1 CNV has previously been implicated in anxiety–like behavior in mice, suggesting a differential evolution of a behavioral trai

    Bayesian Hidden Markov Modeling of Array CGH Data

    Get PDF
    Genomic alterations have been linked to the development and progression of cancer. The technique of Comparative Genomic Hybridization (CGH) yields data consisting of fluorescence intensity ratios of test and reference DNA samples. The intensity ratios provide information about the number of copies in DNA. Practical issues such as the contamination of tumor cells in tissue specimens and normalization errors necessitate the use of statistics for learning about the genomic alterations from array-CGH data. As increasing amounts of array CGH data become available, there is a growing need for automated algorithms for characterizing genomic profiles. Specifically, there is a need for algorithms that can identify gains and losses in the number of copies based on statistical considerations, rather than merely detect trends in the data. We adopt a Bayesian approach, relying on the hidden Markov model to account for the inherent dependence in the intensity ratios. Posterior inferences are made about gains and losses in copy number. Localized amplifications (associated with oncogene mutations) and deletions (associated with mutations of tumor suppressors) are identified using posterior probabilities. Global trends such as extended regions of altered copy number are detected. Since the posterior distribution is analytically intractable, we implement a Metropolis-within-Gibbs algorithm for efficient simulation-based inference. Publicly available data on pancreatic adenocarcinoma, glioblastoma multiforme and breast cancer are analyzed, and comparisons are made with some widely-used algorithms to illustrate the reliability and success of the technique

    Normalized, Segmented or Called aCGH Data?

    Get PDF
    Array comparative genomic hybridization (aCGH) is a high-throughput lab technique to measure genome-wide chromosomal copy numbers. Data from aCGH experiments require extensive pre-processing, which consists of three steps: normalization, segmentation and calling. Each of these pre-processing steps yields a different data set: normalized data, segmented data, and called data. Publications using aCGH base their findings on data from all stages of the pre-processing. Hence, there is no consensus on which should be used for further down-stream analysis. This consensus is however important for correct reporting of findings, and comparison of results from different studies. We discuss several issues that should be taken into account when deciding on which data are to be used. We express the believe that called data are best used, but would welcome opposing views
    • 

    corecore