72,145 research outputs found

    Bayesian correlated clustering to integrate multiple datasets

    Get PDF
    Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct – but often complementary – information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured via parameters that describe the agreement among the datasets. Results: Using a set of 6 artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real S. cerevisiae datasets. In the 2-dataset case, we show that MDI’s performance is comparable to the present state of the art. We then move beyond the capabilities of current approaches and integrate gene expression, ChIP-chip and protein-protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques – as well as to non-integrative approaches – demonstrate that MDI is very competitive, while also providing information that would be difficult or impossible to extract using other methods

    Expression cartography of human tissues using self organizing maps

    Get PDF
    Background: The availability of parallel, high-throughput microarray and sequencing experiments poses a challenge how to best arrange and to analyze the obtained heap of multidimensional data in a concerted way. Self organizing maps (SOM), a machine learning method, enables the parallel sample- and gene-centered view on the data combined with strong visualization and second-level analysis capabilities. The paper addresses aspects of the method with practical impact in the context of expression analysis of complex data sets.
Results: The method was applied to generate a SOM characterizing the whole genome expression profiles of 67 healthy human tissues selected from ten tissue categories (adipose, endocrine, homeostasis, digestion, exocrine, epithelium, sexual reproduction, muscle, immune system and nervous tissues). SOM mapping reduces the dimension of expression data from ten thousands of genes to a few thousands of metagenes where each metagene acts as representative of a minicluster of co-regulated single genes. Tissue-specific and common properties shared between groups of tissues emerge as a handful of localized spots in the tissue maps collecting groups of co-regulated and co-expressed metagenes. The functional context of the spots was discovered using overrepresentation analysis with respect to pre-defined gene sets of known functional impact. We found that tissue related spots typically contain enriched populations of gene sets well corresponding to molecular processes in the respective tissues. Analysis techniques normally used at the gene-level such as two-way hierarchical clustering provide a better signal-to-noise ratio and a better representativeness of the method if applied to the metagenes. Metagene-based clustering analyses aggregate the tissues into essentially three clusters containing nervous, immune system and the remaining tissues. 
Conclusions: The global view on the behavior of a few well-defined modules of correlated and differentially expressed genes is more intuitive and more informative than the separate discovery of the expression levels of hundreds or thousands of individual genes. The metagene approach is less sensitive to a priori selection of genes. It can detect a coordinated expression pattern whose components would not pass single-gene significance thresholds and it is able to extract context-dependent patterns of gene expression in complex data sets.
&#xa

    A comparative study of the AHP and TOPSIS methods for implementing load shedding scheme in a pulp mill system

    Get PDF
    The advancement of technology had encouraged mankind to design and create useful equipment and devices. These equipment enable users to fully utilize them in various applications. Pulp mill is one of the heavy industries that consumes large amount of electricity in its production. Due to this, any malfunction of the equipment might cause mass losses to the company. In particular, the breakdown of the generator would cause other generators to be overloaded. In the meantime, the subsequence loads will be shed until the generators are sufficient to provide the power to other loads. Once the fault had been fixed, the load shedding scheme can be deactivated. Thus, load shedding scheme is the best way in handling such condition. Selected load will be shed under this scheme in order to protect the generators from being damaged. Multi Criteria Decision Making (MCDM) can be applied in determination of the load shedding scheme in the electric power system. In this thesis two methods which are Analytic Hierarchy Process (AHP) and Technique for Order Preference by Similarity to Ideal Solution (TOPSIS) were introduced and applied. From this thesis, a series of analyses are conducted and the results are determined. Among these two methods which are AHP and TOPSIS, the results shown that TOPSIS is the best Multi criteria Decision Making (MCDM) for load shedding scheme in the pulp mill system. TOPSIS is the most effective solution because of the highest percentage effectiveness of load shedding between these two methods. The results of the AHP and TOPSIS analysis to the pulp mill system are very promising

    Expression cartography of human tissues using self organizing maps

    Get PDF
    Background: The availability of parallel, high-throughput microarray and sequencing experiments poses a challenge how to best arrange and to analyze the obtained heap of multidimensional data in a concerted way. Self organizing maps (SOM), a machine learning method, enables the parallel sample- and gene-centered view on the data combined with strong visualization and second-level analysis capabilities. The paper addresses aspects of the method with practical impact in the context of expression analysis of complex data sets.
Results: The method was applied to generate a SOM characterizing the whole genome expression profiles of 67 healthy human tissues selected from ten tissue categories (adipose, endocrine, homeostasis, digestion, exocrine, epithelium, sexual reproduction, muscle, immune system and nervous tissues). SOM mapping reduces the dimension of expression data from ten thousands of genes to a few thousands of metagenes where each metagene acts as representative of a minicluster of co-regulated single genes. Tissue-specific and common properties shared between groups of tissues emerge as a handful of localized spots in the tissue maps collecting groups of co-regulated and co-expressed metagenes. The functional context of the spots was discovered using overrepresentation analysis with respect to pre-defined gene sets of known functional impact. We found that tissue related spots typically contain enriched populations of gene sets well corresponding to molecular processes in the respective tissues. Analysis techniques normally used at the gene-level such as two-way hierarchical clustering provide a better signal-to-noise ratio and a better representativeness of the method if applied to the metagenes. Metagene-based clustering analyses aggregate the tissues into essentially three clusters containing nervous, immune system and the remaining tissues. 
Conclusions: The global view on the behavior of a few well-defined modules of correlated and differentially expressed genes is more intuitive and more informative than the separate discovery of the expression levels of hundreds or thousands of individual genes. The metagene approach is less sensitive to a priori selection of genes. It can detect a coordinated expression pattern whose components would not pass single-gene significance thresholds and it is able to extract context-dependent patterns of gene expression in complex data sets.
&#xa

    Generalized gene co-expression analysis via subspace clustering using low-rank representation

    Get PDF
    BACKGROUND: Gene Co-expression Network Analysis (GCNA) helps identify gene modules with potential biological functions and has become a popular method in bioinformatics and biomedical research. However, most current GCNA algorithms use correlation to build gene co-expression networks and identify modules with highly correlated genes. There is a need to look beyond correlation and identify gene modules using other similarity measures for finding novel biologically meaningful modules. RESULTS: We propose a new generalized gene co-expression analysis algorithm via subspace clustering that can identify biologically meaningful gene co-expression modules with genes that are not all highly correlated. We use low-rank representation to construct gene co-expression networks and local maximal quasi-clique merger to identify gene co-expression modules. We applied our method on three large microarray datasets and a single-cell RNA sequencing dataset. We demonstrate that our method can identify gene modules with different biological functions than current GCNA methods and find gene modules with prognostic values. CONCLUSIONS: The presented method takes advantage of subspace clustering to generate gene co-expression networks rather than using correlation as the similarity measure between genes. Our generalized GCNA method can provide new insights from gene expression datasets and serve as a complement to current GCNA algorithms

    Does Gravitational Clustering Stabilize On Small Scales?

    Get PDF
    The stable clustering hypothesis is a key analytical anchor on the nonlinear dynamics of gravitational clustering in cosmology. It states that on sufficiently small scales the mean pair velocity approaches zero, or equivalently, that the mean number of neighbours of a particle remains constant in time at a given physical separation. In this paper we use N-body simulations of scale free spectra P(k) \propto k^n with -2 \leq n \leq 0 and of the CDM spectrum to test for stable clustering using the time evolution and shape of the correlation function \xi(x,t), and the mean pair velocity on small scales. For all spectra the results are consistent with the stable clustering predictions on the smallest scales probed, x < 0.07 x_{nl}(t), where x_{nl}(t) is the correlation length. The measured stable clustering regime corresponds to a typical range of 200 \lsim \xi \lsim 2000, though spectra with more small scale power approach the stable clustering asymptote at larger values of \xi. We test the amplitude of \xi predicted by the analytical model of Sheth \& Jain (1996), and find agreement to within 20\% in the stable clustering regime for nearly all spectra. For the CDM spectrum the nonlinear \xi is accurately approximated by this model with n \simeq -2 on physical scales \lsim 100-300 h^{-1} kpc for \sigma_8 = 0.5-1, and on smaller scales at earlier times. The growth of \xi for CDM-like models is discussed in the context of a power law parameterization often used to describe galaxy clustering at high redshifts. The growth parameter \epsilon is computed as a function of time and length scale, and found to be larger than 1 in the moderately nonlinear regime -- thus the growth of \xi is much faster on scales of interest than is commonly assumed.Comment: 13 pages, 8 figures included; submitted to MNRA

    Hoodsquare: Modeling and Recommending Neighborhoods in Location-based Social Networks

    Full text link
    Information garnered from activity on location-based social networks can be harnessed to characterize urban spaces and organize them into neighborhoods. In this work, we adopt a data-driven approach to the identification and modeling of urban neighborhoods using location-based social networks. We represent geographic points in the city using spatio-temporal information about Foursquare user check-ins and semantic information about places, with the goal of developing features to input into a novel neighborhood detection algorithm. The algorithm first employs a similarity metric that assesses the homogeneity of a geographic area, and then with a simple mechanism of geographic navigation, it detects the boundaries of a city's neighborhoods. The models and algorithms devised are subsequently integrated into a publicly available, map-based tool named Hoodsquare that allows users to explore activities and neighborhoods in cities around the world. Finally, we evaluate Hoodsquare in the context of a recommendation application where user profiles are matched to urban neighborhoods. By comparing with a number of baselines, we demonstrate how Hoodsquare can be used to accurately predict the home neighborhood of Twitter users. We also show that we are able to suggest neighborhoods geographically constrained in size, a desirable property in mobile recommendation scenarios for which geographical precision is key.Comment: ASE/IEEE SocialCom 201
    • …
    corecore