120 research outputs found

    Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space

    Get PDF
    Motivation: UPGMA (average linking) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. However, UPGMA requires the entire dissimilarity matrix in memory. Due to this prohibitive requirement, UPGMA is not scalable to very large datasets

    RegPredict: an integrated system for regulon inference in prokaryotes by comparative genomics approach

    Get PDF
    RegPredict web server is designed to provide comparative genomics tools for reconstruction and analysis of microbial regulons using comparative genomics approach. The server allows the user to rapidly generate reference sets of regulons and regulatory motif profiles in a group of prokaryotic genomes. The new concept of a cluster of co-regulated orthologous operons allows the user to distribute the analysis of large regulons and to perform the comparative analysis of multiple clusters independently. Two major workflows currently implemented in RegPredict are: (i) regulon reconstruction for a known regulatory motif and (ii) ab initio inference of a novel regulon using several scenarios for the generation of starting gene sets. RegPredict provides a comprehensive collection of manually curated positional weight matrices of regulatory motifs. It is based on genomic sequences, ortholog and operon predictions from the MicrobesOnline. An interactive web interface of RegPredict integrates and presents diverse genomic and functional information about the candidate regulon members from several web resources. RegPredict is freely accessible at http://regpredict.lbl.gov

    Bayesian hierarchical clustering for studying cancer gene expression data with unknown statistics

    Get PDF
    Clustering analysis is an important tool in studying gene expression data. The Bayesian hierarchical clustering (BHC) algorithm can automatically infer the number of clusters and uses Bayesian model selection to improve clustering quality. In this paper, we present an extension of the BHC algorithm. Our Gaussian BHC (GBHC) algorithm represents data as a mixture of Gaussian distributions. It uses normal-gamma distribution as a conjugate prior on the mean and precision of each of the Gaussian components. We tested GBHC over 11 cancer and 3 synthetic datasets. The results on cancer datasets show that in sample clustering, GBHC on average produces a clustering partition that is more concordant with the ground truth than those obtained from other commonly used algorithms. Furthermore, GBHC frequently infers the number of clusters that is often close to the ground truth. In gene clustering, GBHC also produces a clustering partition that is more biologically plausible than several other state-of-the-art methods. This suggests GBHC as an alternative tool for studying gene expression data. The implementation of GBHC is available at https://sites. google.com/site/gaussianbhc

    Parallel mutual information estimation for inferring gene regulatory networks on GPUs

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Mutual information is a measure of similarity between two variables. It has been widely used in various application domains including computational biology, machine learning, statistics, image processing, and financial computing. Previously used simple histogram based mutual information estimators lack the precision in quality compared to kernel based methods. The recently introduced B-spline function based mutual information estimation method is competitive to the kernel based methods in terms of quality but at a lower computational complexity.</p> <p>Results</p> <p>We present a new approach to accelerate the B-spline function based mutual information estimation algorithm with commodity graphics hardware. To derive an efficient mapping onto this type of architecture, we have used the Compute Unified Device Architecture (CUDA) programming model to design and implement a new parallel algorithm. Our implementation, called CUDA-MI, can achieve speedups of up to 82 using double precision on a single GPU compared to a multi-threaded implementation on a quad-core CPU for large microarray datasets. We have used the results obtained by CUDA-MI to infer gene regulatory networks (GRNs) from microarray data. The comparisons to existing methods including ARACNE and TINGe show that CUDA-MI produces GRNs of higher quality in less time.</p> <p>Conclusions</p> <p>CUDA-MI is publicly available open-source software, written in CUDA and C++ programming languages. It obtains significant speedup over sequential multi-threaded implementation by fully exploiting the compute capability of commonly used CUDA-enabled low-cost GPUs.</p

    Equilibrium reconstruction for Single Helical Axis reversed field pinch plasmas

    Full text link
    Single Helical Axis (SHAx) configurations are emerging as the natural state for high current reversed field pinch (RFP) plasmas. These states feature the presence of transport barriers in the core plasma. Here we present a method for computing the equilibrium magnetic surfaces for these states in the force-free approximation, which has been implemented in the SHEq code. The method is based on the superposition of a zeroth order axisymmetric equilibrium and of a first order helical perturbation computed according to Newcomb's equation supplemented with edge magnetic field measurements. The mapping of the measured electron temperature profiles, soft X-ray emission and interferometric density measurements on the computed magnetic surfaces demonstrates the quality of the equilibrium reconstruction. The procedure for computing flux surface averages is illustrated, and applied to the evaluation of the thermal conductivity profile. The consistency of the evaluated equilibria with Ohm's law is also discussed.Comment: Submitted to Plasma Physics and Controlled Fusio

    Eigengene networks for studying the relationships between co-expression modules

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>There is evidence that genes and their protein products are organized into functional modules according to cellular processes and pathways. Gene co-expression networks have been used to describe the relationships between gene transcripts. Ample literature exists on how to detect biologically meaningful modules in networks but there is a need for methods that allow one to study the relationships between modules.</p> <p>Results</p> <p>We show that network methods can also be used to describe the relationships between co-expression modules and present the following methodology. First, we describe several methods for detecting modules that are shared by two or more networks (referred to as consensus modules). We represent the gene expression profiles of each module by an eigengene. Second, we propose a method for constructing an eigengene network, where the edges are undirected but maintain information on the sign of the co-expression information. Third, we propose methods for differential eigengene network analysis that allow one to assess the preservation of network properties across different data sets. We illustrate the value of eigengene networks in studying the relationships between consensus modules in human and chimpanzee brains; the relationships between consensus modules in brain, muscle, liver, and adipose mouse tissues; and the relationships between male-female mouse consensus modules and clinical traits. In some applications, we find that module eigengenes can be organized into higher level clusters which we refer to as meta-modules.</p> <p>Conclusion</p> <p>Eigengene networks can be effective and biologically meaningful tools for studying the relationships between modules of a gene co-expression network. The proposed methods may reveal a higher order organization of the transcriptome. R software tutorials, the data, and supplementary material can be found at the following webpage: <url>http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/EigengeneNetwork</url>.</p

    A ChIP-Seq Benchmark Shows That Sequence Conservation Mainly Improves Detection of Strong Transcription Factor Binding Sites

    Get PDF
    Transcription factors are important controllers of gene expression and mapping transcription factor binding sites (TFBS) is key to inferring transcription factor regulatory networks. Several methods for predicting TFBS exist, but there are no standard genome-wide datasets on which to assess the performance of these prediction methods. Also, it is believed that information about sequence conservation across different genomes can generally improve accuracy of motif-based predictors, but it is not clear under what circumstances use of conservation is most beneficial.Here we use published ChIP-seq data and an improved peak detection method to create comprehensive benchmark datasets for prediction methods which use known descriptors or binding motifs to detect TFBS in genomic sequences. We use this benchmark to assess the performance of five different prediction methods and find that the methods that use information about sequence conservation generally perform better than simpler motif-scanning methods. The difference is greater on high-affinity peaks and when using short and information-poor motifs. However, if the motifs are specific and information-rich, we find that simple motif-scanning methods can perform better than conservation-based methods.Our benchmark provides a comprehensive test that can be used to rank the relative performance of transcription factor binding site prediction methods. Moreover, our results show that, contrary to previous reports, sequence conservation is better suited for predicting strong than weak transcription factor binding sites
    • …
    corecore