5 research outputs found

    clusterExperiment and RSEC: A Bioconductor package and framework for clustering of single-cell and other large gene expression datasets

    Get PDF
    Clustering of genes and/or samples is a common task in gene expression analysis. The goals in clustering can vary, but an important scenario is that of finding biologically meaningful subtypes within the samples. This is an application that is particularly appropriate when there are large numbers of samples, as in many human disease studies. With the increasing popularity of single-cell transcriptome sequencing (RNA-Seq), many more controlled experiments on model organisms are similarly creating large gene expression datasets with the goal of detecting previously unknown heterogeneity within cells. It is common in the detection of novel subtypes to run many clustering algorithms, as well as rely on subsampling and ensemble methods to improve robustness. We introduce a Bioconductor R package, clusterExperiment, that implements a general and flexible strategy we entitle Resampling-based Sequential Ensemble Clustering (RSEC). RSEC enables the user to easily create multiple, competing clusterings of the data based on different techniques and associated tuning parameters, including easy integration of resampling and sequential clustering, and then provides methods for consolidating the multiple clusterings into a final consensus clustering. The package is modular and allows the user to separately apply the individual components of the RSEC procedure, i.e., apply multiple clustering algorithms, create a consensus clustering or choose tuning parameters, and merge clusters. Additionally, clusterExperiment provides a variety of visualization tools for the clustering process, as well as methods for the identification of possible cluster signatures or biomarkers. The R package clusterExperiment is publicly available through the Bioconductor Project, with a detailed manual (vignette) as well as well documented help pages for each function.</div

    ONJAG, network overlays supporting distributed graph processing

    Get PDF
    The "Big Data" term refers to the exponential growth that is affecting the production of structured and unstructured data. However, due to the size characterising this data, usually deep analyses are required in order to extract its intrinsic value. Several computational models and various techniques have been studied and employed in order to process this data in a distribute manner, i.e. the capabilities of a single machine can not carry out the computation of this data. Today, a significant part of such data is modelled as a graph. Recently, graph processing frameworks orchestrate the execution as a network simulation where vertices and edges correspond to nodes and links, respectively. In this context the thesis exploits the Peer-to-Peer approach. The overlay concept is introduced and ONJAG ("Overlays Not Just A Graph"), a distributed framework, is developed. ONJAG runs over Spark, a distributed Bulk Synchronous Parallel-like data processing framework. Moreover, a well-known problem in graph theory has studied. It is the balanced minimum k-way partitioning problem, which is also called minimum k-way cut. Finally, a novel algorithm to solve the balanced minimum k-way cut is proposed. The proposal exploits the P2P approach and the overlays in order to improve a pre-existent solution
    corecore