77 research outputs found
VegaMC: a R/bioconductor package for fast downstream analysis of large array comparative genomic hybridization datasets
Abstract
Summary: Identification of genetic alterations of tumor cells has become a common method to detect the genes involved in development and progression of cancer. In order to detect driver genes, several samples need to be simultaneously analyzed. The Cancer Genome Atlas (TCGA) project provides access to a large amount of data for several cancer types. TGCA is an invaluable source of information, but analysis of this huge dataset possess important computational problems in terms of memory and execution times. Here, we present a R/package, called VegaMC (Vega multi-channel), that enables fast and efficient detection of significant recurrent copy number alterations in very large datasets. VegaMC is integrated with the output of the common tools that convert allele signal intensities in log R ratio and B allele frequency. It also enables the detection of loss of heterozigosity and provides in output two web pages allowing a rapid and easy navigation of the aberrant genes. Synthetic data and real datasets are used for quantitative and qualitative evaluation purposes. In particular, we demonstrate the ability of VegaMC on two large TGCA datasets: colon adenocarcinoma and glioblastoma multiforme. For both the datasets, we provide the list of aberrant genes which contain previously validated genes and can be used as basis for further investigations.
Availability: VegaMC is a R/Bioconductor Package, available at http://bioconductor.org/packages/release/bioc/html/VegaMC.html.
Contact: [email protected]
Supplementary Information: Supplementary data are available at Bioinformatics online
IRIS: a method for reverse engineering of regulatory relations in gene networks
<p>Abstract</p> <p>Background</p> <p>The ultimate aim of systems biology is to understand and describe how molecular components interact to manifest collective behaviour that is the sum of the single parts. Building a network of molecular interactions is the basic step in modelling a complex entity such as the cell. Even if gene-gene interactions only partially describe real networks because of post-transcriptional modifications and protein regulation, using microarray technology it is possible to combine measurements for thousands of genes into a single analysis step that provides a picture of the cell's gene expression. Several databases provide information about known molecular interactions and various methods have been developed to infer gene networks from expression data. However, network topology alone is not enough to perform simulations and predictions of how a molecular system will respond to perturbations. Rules for interactions among the single parts are needed for a complete definition of the network behaviour. Another interesting question is how to integrate information carried by the network topology, which can be derived from the literature, with large-scale experimental data.</p> <p>Results</p> <p>Here we propose an algorithm, called inference of regulatory interaction schema (IRIS), that uses an iterative approach to map gene expression profile values (both steady-state and time-course) into discrete states and a simple probabilistic method to infer the regulatory functions of the network. These interaction rules are integrated into a factor graph model. We test IRIS on two synthetic networks to determine its accuracy and compare it to other methods. We also apply IRIS to gene expression microarray data for the <it>Saccharomyces cerevisiae </it>cell cycle and for human B-cells and compare the results to literature findings.</p> <p>Conclusions</p> <p>IRIS is a rapid and efficient tool for the inference of regulatory relations in gene networks. A topological description of the network and a matrix of gene expression profiles are required as input to the algorithm. IRIS maps gene expression data onto discrete values and then computes regulatory functions as conditional probability tables. The suitability of the method is demonstrated for synthetic data and microarray data. The resulting network can also be embedded in a factor graph model.</p
Finding recurrent copy number alterations preserving within-sample homogeneity
Abstract
Motivation: Copy number alterations (CNAs) represent an important component of genetic variation and play a significant role in many human diseases. Development of array comparative genomic hybridization (aCGH) technology has made it possible to identify CNAs. Identification of recurrent CNAs represents the first fundamental step to provide a list of genomic regions which form the basis for further biological investigations. The main problem in recurrent CNAs discovery is related to the need to distinguish between functional changes and random events without pathological relevance. Within-sample homogeneity represents a common feature of copy number profile in cancer, so it can be used as additional source of information to increase the accuracy of the results. Although several algorithms aimed at the identification of recurrent CNAs have been proposed, no attempt of a comprehensive comparison of different approaches has yet been published.
Results: We propose a new approach, called Genomic Analysis of Important Alterations (GAIA), to find recurrent CNAs where a statistical hypothesis framework is extended to take into account within-sample homogeneity. Statistical significance and within-sample homogeneity are combined into an iterative procedure to extract the regions that likely are involved in functional changes. Results show that GAIA represents a valid alternative to other proposed approaches. In addition, we perform an accurate comparison by using two real aCGH datasets and a carefully planned simulation study.
Availability: GAIA has been implemented as R/Bioconductor package. It can be downloaded from the following page http://bioinformatics.biogem.it/download/gaia
Contact: [email protected]; [email protected]
Supplementary Information: Supplementary data are available at Bioinformatics online
TimeDelay-ARACNE: Reverse engineering of gene networks from time-course data by an information theoretic approach
<p>Abstract</p> <p>Background</p> <p>One of main aims of Molecular Biology is the gain of knowledge about how molecular components interact each other and to understand gene function regulations. Using microarray technology, it is possible to extract measurements of thousands of genes into a single analysis step having a picture of the cell gene expression. Several methods have been developed to infer gene networks from steady-state data, much less literature is produced about time-course data, so the development of algorithms to infer gene networks from time-series measurements is a current challenge into bioinformatics research area. In order to detect dependencies between genes at different time delays, we propose an approach to infer gene regulatory networks from time-series measurements starting from a well known algorithm based on information theory.</p> <p>Results</p> <p>In this paper we show how the ARACNE (Algorithm for the Reconstruction of Accurate Cellular Networks) algorithm can be used for gene regulatory network inference in the case of time-course expression profiles. The resulting method is called TimeDelay-ARACNE. It just tries to extract dependencies between two genes at different time delays, providing a measure of these dependencies in terms of mutual information. The basic idea of the proposed algorithm is to detect time-delayed dependencies between the expression profiles by assuming as underlying probabilistic model a stationary Markov Random Field. Less informative dependencies are filtered out using an auto calculated threshold, retaining most reliable connections. TimeDelay-ARACNE can infer small local networks of time regulated gene-gene interactions detecting their versus and also discovering cyclic interactions also when only a medium-small number of measurements are available. We test the algorithm both on synthetic networks and on microarray expression profiles. Microarray measurements concern <it>S. cerevisiae </it>cell cycle, <it>E. coli </it>SOS pathways and a recently developed network for in vivo assessment of reverse engineering algorithms. Our results are compared with ARACNE itself and with the ones of two previously published algorithms: Dynamic Bayesian Networks and systems of ODEs, showing that TimeDelay-ARACNE has good accuracy, recall and <it>F</it>-score for the network reconstruction task.</p> <p>Conclusions</p> <p>Here we report the adaptation of the ARACNE algorithm to infer gene regulatory networks from time-course data, so that, the resulting network is represented as a directed graph. The proposed algorithm is expected to be useful in reconstruction of small biological directed networks from time course data.</p
VEGAWES: variational segmentation on whole exome sequencing for copy number detection
Background
Copy number variations are important in the detection and progression of significant tumors and diseases. Recently, Whole Exome Sequencing is gaining popularity with copy number variations detection due to low cost and better efficiency. In this work, we developed VEGAWES for accurate and robust detection of copy number variations on WES data. VEGAWES is an extension to a variational based segmentation algorithm, VEGA: Variational estimator for genomic aberrations, which has previously outperformed several algorithms on segmenting array comparative genomic hybridization data.
Results
We tested this algorithm on synthetic data and 100 Glioblastoma Multiforme primary tumor samples. The results on the real data were analyzed with segmentation obtained from Single-nucleotide polymorphism data as ground truth. We compared our results with two other segmentation algorithms and assessed the performance based on accuracy and time.
Conclusions
In terms of both accuracy and time, VEGAWES provided better results on the synthetic data and tumor samples demonstrating its potential in robust detection of aberrant regions in the genome
Short inverted repeats contribute to localized mutability in human somatic cells.
Selected repetitive sequences termed short inverted repeats (SIRs) have the propensity to form secondary DNA structures called hairpins. SIRs comprise palindromic arm sequences separated by short spacer sequences that form the hairpin stem and loop respectively. Here, we show that SIRs confer an increase in localized mutability in breast cancer, which is domain-dependent with the greatest mutability observed within spacer sequences (∼1.35-fold above background). Mutability is influenced by factors that increase the likelihood of formation of hairpins such as loop lengths (of 4-5 bp) and stem lengths (of 7-15 bp). Increased mutability is an intrinsic property of SIRs as evidenced by how almost all mutational processes demonstrate a higher rate of mutagenesis of spacer sequences. We further identified 88 spacer sequences showing enrichment from 1.8- to 90-fold of local mutability distributed across 283 sites in the genome that intriguingly, can be used to inform the biological status of a tumor
The genome as a record of environmental exposure.
Whole genome sequencing of human tumours has revealed distinct patterns of mutation that hint at the causative origins of cancer. Experimental investigations of the mutations and mutation spectra induced by environmental mutagens have traditionally focused on single genes. With the advent of faster cheaper sequencing platforms, it is now possible to assess mutation spectra in experimental models across the whole genome. As a proof of principle, we have examined the whole genome mutation profiles of mouse embryo fibroblasts immortalised following exposure to benzo[a]pyrene (BaP), ultraviolet light (UV) and aristolochic acid (AA). The results reveal that each mutagen induces a characteristic mutation signature: predominantly G→T mutations for BaP, C→T and CC→TT for UV and A→T for AA. The data are not only consistent with existing knowledge but also provide additional information at higher levels of genomic organisation. The approach holds promise for identifying agents responsible for mutations in human tumours and for shedding light on the aetiology of human cancer
The topography of mutational processes in breast cancer genomes.
Somatic mutations in human cancers show unevenness in genomic distribution that correlate with aspects of genome structure and function. These mutations are, however, generated by multiple mutational processes operating through the cellular lineage between the fertilized egg and the cancer cell, each composed of specific DNA damage and repair components and leaving its own characteristic mutational signature on the genome. Using somatic mutation catalogues from 560 breast cancer whole-genome sequences, here we show that each of 12 base substitution, 2 insertion/deletion (indel) and 6 rearrangement mutational signatures present in breast tissue, exhibit distinct relationships with genomic features relating to transcription, DNA replication and chromatin organization. This signature-based approach permits visualization of the genomic distribution of mutational processes associated with APOBEC enzymes, mismatch repair deficiency and homologous recombinational repair deficiency, as well as mutational processes of unknown aetiology. Furthermore, it highlights mechanistic insights including a putative replication-dependent mechanism of APOBEC-related mutagenesis
A somatic-mutational process recurrently duplicates germline susceptibility loci and tissue-specific super-enhancers in breast cancers
Somatic rearrangements contribute to the mutagenized landscape of cancer genomes. Here, we systematically interrogated rearrangements in 560 breast cancers by using a piecewise constant fitting approach. We identified 33 hotspots of large (>100 kb) tandem duplications, a mutational signature associated with homologous-recombination-repair deficiency. Notably, these tandem-duplication hotspots were enriched in breast cancer germline susceptibility loci (odds ratio (OR) = 4.28) and breast-specific 'super-enhancer' regulatory elements (OR = 3.54). These hotspots may be sites of selective susceptibility to double-strand-break damage due to high transcriptional activity or, through incrementally increasing copy number, may be sites of secondary selective pressure. The transcriptomic consequences ranged from strong individual oncogene effects to weak but quantifiable multigene expression effects. We thus present a somatic-rearrangement mutational process affecting coding sequences and noncoding regulatory elements and contributing a continuum of driver consequences, from modest to strong effects, thereby supporting a polygenic model of cancer development.DG is supported by the EU-FP7-SUPPRESSTEM project. SN-Z is funded by a Wellcome Trust Intermediate Fellowship (WT100183MA) and is a Wellcome Beit Fellow. For more information, please visit the publisher's website
A somatic-mutational process recurrently duplicates germline susceptibility loci and tissue-specific super-enhancers in breast cancers
Somatic rearrangements contribute to the mutagenized landscape of cancer genomes. Here, we systematically interrogated rearrangements in 560 breast cancers by using a piecewise constant fitting approach. We identified 33 hotspots of large (>100 kb) tandem duplications, a mutational signature associated with homologous-recombination-repair deficiency. Notably, these tandem-duplication hotspots were enriched in breast cancer germline susceptibility loci (odds ratio (OR) = 4.28) and breast-specific 'super-enhancer' regulatory elements (OR = 3.54). These hotspots may b
- …