110 research outputs found

    Statistically validated networks in bipartite complex systems

    Get PDF
    Many complex systems present an intrinsic bipartite nature and are often described and modeled in terms of networks [1-5]. Examples include movies and actors [1, 2, 4], authors and scientific papers [6-9], email accounts and emails [10], plants and animals that pollinate them [11, 12]. Bipartite networks are often very heterogeneous in the number of relationships that the elements of one set establish with the elements of the other set. When one constructs a projected network with nodes from only one set, the system heterogeneity makes it very difficult to identify preferential links between the elements. Here we introduce an unsupervised method to statistically validate each link of the projected network against a null hypothesis taking into account the heterogeneity of the system. We apply our method to three different systems, namely the set of clusters of orthologous genes (COG) in completely sequenced genomes [13, 14], a set of daily returns of 500 US financial stocks, and the set of world movies of the IMDb database [15]. In all these systems, both different in size and level of heterogeneity, we find that our method is able to detect network structures which are informative about the system and are not simply expression of its heterogeneity. Specifically, our method (i) identifies the preferential relationships between the elements, (ii) naturally highlights the clustered structure of investigated systems, and (iii) allows to classify links according to the type of statistically validated relationships between the connected nodes.Comment: Main text: 13 pages, 3 figures, and 1 Table. Supplementary information: 15 pages, 3 figures, and 2 Table

    Application of machine learning methods to histone methylation ChIP-Seq data reveals H4R3me2 globally represses gene expression

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In the last decade, biochemical studies have revealed that epigenetic modifications including histone modifications, histone variants and DNA methylation form a complex network that regulate the state of chromatin and processes that depend on it including transcription and DNA replication. Currently, a large number of these epigenetic modifications are being mapped in a variety of cell lines at different stages of development using high throughput sequencing by members of the ENCODE consortium, the NIH Roadmap Epigenomics Program and the Human Epigenome Project. An extremely promising and underexplored area of research is the application of machine learning methods, which are designed to construct predictive network models, to these large-scale epigenomic data sets.</p> <p>Results</p> <p>Using a ChIP-Seq data set of 20 histone lysine and arginine methylations and histone variant H2A.Z in human CD4<sup>+ </sup>T-cells, we built predictive models of gene expression as a function of histone modification/variant levels using Multilinear (ML) Regression and Multivariate Adaptive Regression Splines (MARS). Along with extensive crosstalk among the 20 histone methylations, we found H4R3me2 was the most and second most globally repressive histone methylation among the 20 studied in the ML and MARS models, respectively. In support of our finding, a number of experimental studies show that PRMT5-catalyzed symmetric dimethylation of H4R3 is associated with repression of gene expression. This includes a recent study, which demonstrated that H4R3me2 is required for DNMT3A-mediated DNA methylation--a known global repressor of gene expression.</p> <p>Conclusion</p> <p>In stark contrast to univariate analysis of the relationship between H4R3me2 and gene expression levels, our study showed that the regulatory role of some modifications like H4R3me2 is masked by confounding variables, but can be elucidated by multivariate/systems-level approaches.</p

    Impaired Resting-State Functional Integrations within Default Mode Network of Generalized Tonic-Clonic Seizures Epilepsy

    Get PDF
    Generalized tonic-clonic seizures (GTCS) are characterized by unresponsiveness and convulsions, which cause complete loss of consciousness. Many recent studies have found that the ictal alterations in brain activity of the GTCS epilepsy patients are focally involved in some brain regions, including thalamus, upper brainstem, medial prefrontal cortex, posterior midbrain regions, and lateral parietal cortex. Notably, many of these affected brain regions are the same and overlap considerably with the components of the so-called default mode network (DMN). Here, we hypothesize that the brain activity of the DMN of the GTCS epilepsy patients are different from normal controls, even in the resting state. To test this hypothesis, we compared the DMN of the GTCS epilepsy patients and the controls using the resting state functional magnetic resonance imaging. Thirteen brain areas in the DMN were extracted, and a complete undirected weighted graph was used to model the DMN for each participant. When directly comparing the edges of the graph, we found significant decreased functional connectivities within the DMN of the GTCS epilepsy patients comparing to the controls. As for the nodes of the graph, we found that the degree of some brain areas within the DMN was significantly reduced in the GTCS epilepsy patients, including the anterior medial prefrontal cortex, the bilateral superior frontal cortex, and the posterior cingulate cortex. Then we investigated into possible mechanisms of how GTCS epilepsy could cause the reduction of the functional integrations of DMN. We suggested the damaged functional integrations of the DMN in the GTCS epilepsy patients even during the resting state, which could help to understand the neural correlations of the impaired consciousness of GTCS epilepsy patients

    GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers

    Get PDF
    We describe methods with enhanced power and specificity to identify genes targeted by somatic copy-number alterations (SCNAs) that drive cancer growth. By separating SCNA profiles into underlying arm-level and focal alterations, we improve the estimation of background rates for each category. We additionally describe a probabilistic method for defining the boundaries of selected-for SCNA regions with user-defined confidence. Here we detail this revised computational approach, GISTIC2.0, and validate its performance in real and simulated datasets

    Detection of recurrent rearrangement breakpoints from copy number data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Copy number variants (CNVs), including deletions, amplifications, and other rearrangements, are common in human and cancer genomes. Copy number data from array comparative genome hybridization (aCGH) and next-generation DNA sequencing is widely used to measure copy number variants. Comparison of copy number data from multiple individuals reveals recurrent variants. Typically, the interior of a recurrent CNV is examined for genes or other loci associated with a phenotype. However, in some cases, such as gene truncations and fusion genes, the target of variant lies at the boundary of the variant.</p> <p>Results</p> <p>We introduce Neighborhood Breakpoint Conservation (NBC), an algorithm for identifying rearrangement breakpoints that are highly conserved at the same locus in multiple individuals. NBC detects recurrent breakpoints at varying levels of resolution, including breakpoints whose location is exactly conserved and breakpoints whose location varies within a gene. NBC also identifies pairs of recurrent breakpoints such as those that result from fusion genes. We apply NBC to aCGH data from 36 primary prostate tumors and identify 12 novel rearrangements, one of which is the well-known TMPRSS2-ERG fusion gene. We also apply NBC to 227 glioblastoma tumors and predict 93 novel rearrangements which we further classify as gene truncations, germline structural variants, and fusion genes. A number of these variants involve the protein phosphatase PTPN12 suggesting that deregulation of PTPN12, via a variety of rearrangements, is common in glioblastoma.</p> <p>Conclusions</p> <p>We demonstrate that NBC is useful for detection of recurrent breakpoints resulting from copy number variants or other structural variants, and in particular identifies recurrent breakpoints that result in gene truncations or fusion genes. Software is available at <url>http://http.//cs.brown.edu/people/braphael/software.html</url>.</p

    Rapid and Accurate Multiple Testing Correction and Power Estimation for Millions of Correlated Markers

    Get PDF
    With the development of high-throughput sequencing and genotyping technologies, the number of markers collected in genetic association studies is growing rapidly, increasing the importance of methods for correcting for multiple hypothesis testing. The permutation test is widely considered the gold standard for accurate multiple testing correction, but it is often computationally impractical for these large datasets. Recently, several studies proposed efficient alternative approaches to the permutation test based on the multivariate normal distribution (MVN). However, they cannot accurately correct for multiple testing in genome-wide association studies for two reasons. First, these methods require partitioning of the genome into many disjoint blocks and ignore all correlations between markers from different blocks. Second, the true null distribution of the test statistic often fails to follow the asymptotic distribution at the tails of the distribution. We propose an accurate and efficient method for multiple testing correction in genome-wide association studies—SLIDE. Our method accounts for all correlation within a sliding window and corrects for the departure of the true null distribution of the statistic from the asymptotic distribution. In simulations using the Wellcome Trust Case Control Consortium data, the error rate of SLIDE's corrected p-values is more than 20 times smaller than the error rate of the previous MVN-based methods' corrected p-values, while SLIDE is orders of magnitude faster than the permutation test and other competing methods. We also extend the MVN framework to the problem of estimating the statistical power of an association study with correlated markers and propose an efficient and accurate power estimation method SLIP. SLIP and SLIDE are available at http://slide.cs.ucla.edu

    Genetically-Based Olfactory Signatures Persist Despite Dietary Variation

    Get PDF
    Individual mice have a unique odor, or odortype, that facilitates individual recognition. Odortypes, like other phenotypes, can be influenced by genetic and environmental variation. The genetic influence derives in part from genes of the major histocompatibility complex (MHC). A major environmental influence is diet, which could obscure the genetic contribution to odortype. Because odortype stability is a prerequisite for individual recognition under normal behavioral conditions, we investigated whether MHC-determined urinary odortypes of inbred mice can be identified in the face of large diet-induced variation. Mice trained to discriminate urines from panels of mice that differed both in diet and MHC type found the diet odor more salient in generalization trials. Nevertheless, when mice were trained to discriminate mice with only MHC differences (but on the same diet), they recognized the MHC difference when tested with urines from mice on a different diet. This indicates that MHC odor profiles remain despite large dietary variation. Chemical analyses of urinary volatile organic compounds (VOCs) extracted by solid phase microextraction (SPME) and analyzed by gas chromatography/mass spectrometry (GC/MS) are consistent with this inference. Although diet influenced VOC variation more than MHC, with algorithmic training (supervised classification) MHC types could be accurately discriminated across different diets. Thus, although there are clear diet effects on urinary volatile profiles, they do not obscure MHC effects

    Combined analysis of transcriptome and metabolite data reveals extensive differences between black and brown nearly-isogenic soybean (Glycine max) seed coats enabling the identification of pigment isogenes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The <it>R </it>locus controls the color of pigmented soybean (<it>Glycine max</it>) seeds. However information about its control over seed coat biochemistry and gene expressions remains limited. The seed coats of nearly-isogenic black (<it>iRT</it>) and brown (<it>irT</it>) soybean (<it>Glycine max</it>) were known to differ by the presence or absence of anthocyanins, respectively, with genes for only a single enzyme (anthocyanidin synthase) found to be differentially expressed between isolines. We recently identified and characterized a UDP-glycose:flavonoid-3-<it>O</it>-glycosyltransferase (<it>UGT78K1</it>) from the seed coat of black (<it>iRT</it>) soybean with the aim to engineer seed coat color by suppression of an anthocyanin-specific gene. However, it remained to be investigated whether <it>UGT78K1 </it>was overexpressed with anthocyanin biosynthesis in the black (<it>iRT</it>) seed coat compared to the nearly-isogenic brown (<it>irT</it>) tissue.</p> <p>In this study, we performed a combined analysis of transcriptome and metabolite data to elucidate the control of the R locus over seed coat biochemistry and to identify pigment biosynthesis genes. Two differentially expressed late-stage anthocyanin biosynthesis isogenes were further characterized, as they may serve as useful targets for the manipulation of soybean grain color while minimizing the potential for unintended effects on the plant system.</p> <p>Results</p> <p>Metabolite composition differences were found to not be limited to anthocyanins, with specific proanthocyanidins, isoflavones, and phenylpropanoids present exclusively in the black (<it>iRT</it>) or the brown (<it>irT</it>) seed coat. A global analysis of gene expressions identified <it>UGT78K1 </it>and 19 other anthocyanin, (iso)flavonoid, and phenylpropanoid isogenes to be differentially expressed between isolines. A combined analysis of metabolite and gene expression data enabled the assignment of putative functions to biosynthesis and transport isogenes. The recombinant enzymes of two genes were validated to catalyze late-stage steps in anthocyanin biosynthesis <it>in vitro </it>and expression profiles of the corresponding genes were shown to parallel anthocyanin biosynthesis during black (<it>iRT</it>) seed coat development.</p> <p>Conclusion</p> <p>Metabolite composition and gene expression differences between black (<it>iRT</it>) and brown (<it>irT</it>) seed coats are far more extensive than previously thought. Putative anthocyanin, proanthocyanidin, (iso)flavonoid, and phenylpropanoid isogenes were differentially-expressed between black (<it>iRT</it>) and brown (<it>irT</it>) seed coats, and <it>UGT78K2 </it>and <it>OMT5 </it>were validated to code UDP-glycose:flavonoid-3-<it>O</it>-glycosyltransferase and anthocyanin 3'-<it>O</it>-methyltransferase proteins <it>in vitro</it>, respectively. Duplicate gene copies for several enzymes were overexpressed in the black (<it>iRT</it>) seed coat suggesting more than one isogene may have to be silenced to engineer seed coat color using RNA interference.</p
    corecore