Search CORE

1,320 research outputs found

clusterExperiment and RSEC: A Bioconductor package and framework for clustering of single-cell and other large gene expression datasets

Author: Das Diya
Dudoit Sandrine
Fletcher Russell
Ngai John
Purdom Elizabeth
Purvis Liam
Risso Davide
Publication venue
Publication date: 01/01/2018
Field of study

Clustering of genes and/or samples is a common task in gene expression analysis. The goals in clustering can vary, but an important scenario is that of finding biologically meaningful subtypes within the samples. This is an application that is particularly appropriate when there are large numbers of samples, as in many human disease studies. With the increasing popularity of single-cell transcriptome sequencing (RNA-Seq), many more controlled experiments on model organisms are similarly creating large gene expression datasets with the goal of detecting previously unknown heterogeneity within cells. It is common in the detection of novel subtypes to run many clustering algorithms, as well as rely on subsampling and ensemble methods to improve robustness. We introduce a Bioconductor R package, clusterExperiment, that implements a general and flexible strategy we entitle Resampling-based Sequential Ensemble Clustering (RSEC). RSEC enables the user to easily create multiple, competing clusterings of the data based on different techniques and associated tuning parameters, including easy integration of resampling and sequential clustering, and then provides methods for consolidating the multiple clusterings into a final consensus clustering. The package is modular and allows the user to separately apply the individual components of the RSEC procedure, i.e., apply multiple clustering algorithms, create a consensus clustering or choose tuning parameters, and merge clusters. Additionally, clusterExperiment provides a variety of visualization tools for the clustering process, as well as methods for the identification of possible cluster signatures or biomarkers. The R package clusterExperiment is publicly available through the Bioconductor Project, with a detailed manual (vignette) as well as well documented help pages for each function.</div

Directory of Open Access Journals

eScholarship - University of California

Archivio istituzionale della ricerca - Università di Padova

FigShare

Recommended from our members

Combined supervised and unsupervised learning to identify subclasses of disease for better prediction

Author: Alsaid Alyousef Awad
Publication venue: Brunel University London
Publication date: 01/01/2022
Field of study

This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonDisease subtyping, which aids in the development of personalised treatments, remains a challenge in data analysis because of the many different ways to group patients based upon their data. However, if I can identify subclasses of disease, this will help to develop better models that are more specific to individuals and should therefore improve prediction and understanding of the underlying characteristics of the disease in question. In addition, patients might suffer from multiple disease complications. Models that are tailored to individuals could improve both prediction of multiple complications and understanding of underlying disease characteristics. However, AI models can become outdated over time due to either sudden changes in the underlying data, such as those caused by new measurement methods, or incremental changes, such as the ageing of the study population. This thesis proposes a new algorithm that integrates consensus clustering methods with classification in order to overcome issues with sample bias. The method was tested on a freely available dataset of real-world breast cancer cases and data from a London hospital on systemic sclerosis, a rare and potentially fatal condition. The results show that nearest consensus clustering classification improves accuracy and prediction significantly when this algorithm is compared with competitive similar methods. In addition, this thesis proposes a new algorithm that integrates latent class models with classification. The new algorithm uses latent class models to cluster patients within groups; this results in improved classification and aids in the understanding of the underlying differences of the discovered groups. The method was tested on data from patients with systemic sclerosis (SSc), a rare and potentially fatal condition, and coronary heart disease. Results show that the latent class multi-label classification (MLC) model improves accuracy when compared with competitive similar methods. Finally, this thesis implemented the updated concept drift method (DDM) to monitor AI models over time and detect drifts when they occur. The method was tested on data from patients with SSc and patients with coronavirus disease (COVID)

Brunel University Research Archive

The Mycobacterium tuberculosis transposon sequencing database (MtbTnDB): a large-scale guide to genetic conditional essentiality [preprint]

Author: DeJesus Michael A.
Ehrt Sabine
Flores-Bautista Emanuel
Ioerger Thomas R.
Jinich Adrian
Rhee Kyu
Rock Jeremy M.
Sassetti Christopher M.
Schnappinger Dirk
Smith Clare M.
Zaveri Anisha
Publication venue: eScholarship@UMassChan
Publication date: 06/03/2021
Field of study

Characterization of gene essentiality across different conditions is a useful approach for predicting gene function. Transposon sequencing (TnSeq) is a powerful means of generating genome-wide profiles of essentiality and has been used extensively in Mycobacterium tuberculosis (Mtb) genetic research. Over the past two decades, dozens of TnSeq screens have been published, yielding valuable insights into the biology of Mtb in vitro, inside macrophages, and in model host organisms. However, these Mtb TnSeq profiles are distributed across dozens of research papers within supplementary materials, which makes querying them cumbersome and assembling a complete and consistent synthesis of existing data challenging. Here, we address this problem by building a central repository of publicly available TnSeq screens performed in M. tuberculosis, which we call the Mtb transposon sequencing database (MtbTnDB). The MtbTnDB encompasses 64 published and unpublished TnSeq screens, and is standardized, open-access, and allows users easy access to data, visualizations, and functional predictions through an interactive web-app (www.mtbtndb.app). We also present evidence that (i) genes in the same genomic neighborhood tend to have similar TnSeq profiles, and (ii) clusters of genes with similar TnSeq profiles tend to be enriched for genes belonging to the same functional categories. Finally, we test and evaluate machine learning models trained on TnSeq profiles to guide functional annotation of orphan genes in Mtb. In addition to facilitating the exploration of conditional genetic essentiality in this important human pathogen via a centralized TnSeq data repository, the MtbTnDB will enable hypothesis generation and the extraction of meaningful patterns by facilitating the comparison of datasets across conditions. This will provide a basis for insights into the functional organization of Mtb genes as well as gene function prediction

eScholarship@UMMS

Caltech Authors

Machine Intelligence Identifies Soluble TNFa as a Therapeutic Target for Spinal Cord Injury

Author: Beattie M. S.
Bresnahan J. C.
Ferguson A. R.
Gensel John C.
Huie J. R.
Irvine K.-A.
Kyritsis N.
Lin A.
Nielson J. L.
Oldham M. C.
Pan J. Z.
Ratan R. R.
Schupp P. G.
Segal M. R.
Publication venue: UKnowledge
Publication date: 01/02/2021
Field of study

Traumatic spinal cord injury (SCI) produces a complex syndrome that is expressed across multiple endpoints ranging from molecular and cellular changes to functional behavioral deficits. Effective therapeutic strategies for CNS injury are therefore likely to manifest multi-factorial effects across a broad range of biological and functional outcome measures. Thus, multivariate analytic approaches are needed to capture the linkage between biological and neurobehavioral outcomes. Injury-induced neuroinflammation (NI) presents a particularly challenging therapeutic target, since NI is involved in both degeneration and repair. Here, we used big-data integration and large-scale analytics to examine a large dataset of preclinical efficacy tests combining five different blinded, fully counter-balanced treatment trials for different acute anti-inflammatory treatments for cervical spinal cord injury in rats. Multi-dimensional discovery, using topological data analysis (TDA) and principal components analysis (PCA) revealed that only one showed consistent multidimensional syndromic benefit: intrathecal application of recombinant soluble TNFα receptor 1 (sTNFR1), which showed an inverse-U dose response efficacy. Using the optimal acute dose, we showed that clinically-relevant 90 min delayed treatment profoundly affected multiple biological indices of NI in the first 48 h after injury, including reduction in pro-inflammatory cytokines and gene expression of a coherent complex of acute inflammatory mediators and receptors. Further, a 90 min delayed bolus dose of sTNFR1 reduced the expression of NI markers in the chronic perilesional spinal cord, and consistently improved neurological function over 6 weeks post SCI. These results provide validation of a novel strategy for precision preclinical drug discovery that is likely to improve translation in the difficult landscape of CNS trauma, and confirm the importance of TNFα signaling as a therapeutic target

University of Kentucky

eScholarship - University of California

Quantifying the Economic Costs of Global Warming

Author: Callahan Christopher W
Publication venue: Dartmouth Digital Commons
Publication date: 01/01/2023
Field of study

Climate change poses a threat to the well-being of people across the globe. Rising global temperatures will increase the frequency and magnitude of extreme climate events, threatening the lives and livelihoods of vulnerable people. Yet the magnitude and persistence of these economic impacts are poorly understood, making it difficult both to design equitable mitigation and adaptation strategies and to hold emitters accountable for the impacts of their emissions. In this thesis, I combine methods from detection and attribution, climate projection, and causal inference to understand the global economic consequences of past and future climate change. I show that two extreme climate events that have not been previously integrated into climate-economy analyses---heat waves and El Niño events---reduce economic growth globally. But these impacts are highly unequal across the globe: Heat waves have their greatest effects in warm regions, and El Niño events primarily harm highly teleconnected countries. As a result, these effects fall most severely on the people that have contributed least to warming, a sign of the inequities embedded in the causes and consequences of global warming. To quantitively understand these inequities and support efforts to hold major emitters accountable for the impacts of their emissions, I develop an end-to-end attribution framework that links individual emitters to the economic effects of the warming induced by their emissions. I show that warming from the emissions of high-income countries in the global North have driven billions of dollars of economic losses in low-income, low-emitting countries. I then combine this framework with my previous results on extreme heat, showing that the emissions of major fossil fuel firms have intensified heat waves, and the resulting economic penalties, across the global tropics. These first-of-their-kind results lend scientific support to emerging discussions over climate liability and loss and damage payments. More broadly, these findings together highlight the already-emerging economic threat of global warming, raising the importance of climate mitigation and adaptation in order to avoid accelerating losses to the most vulnerable people around the globe

Dartmouth Digital Commons (Dartmouth College)

Assessment of Stability in Partitional Clustering Using Resampling Techniques

Author: Mucha Hans-Joachim
Publication venue: KIT Scientific Publishing, Karlsruhe
Publication date: 01/01/2016
Field of study

The assessment of stability in cluster analysis is strongly related to the main difficult problem of determining the number of clusters present in the data. The latter is subject of many investigations and papers considering different resampling techniques as practical tools. In this paper, we consider non-parametric resampling from the empirical distribution of a given dataset in order to investigate the stability of results of partitional clustering. In detail, we investigate here only the very popular K-means method. The estimation of the sampling distribution of the adjusted Rand index (ARI) and the averaged Jaccard index seems to be the most general way to do this. In addition, we compare bootstrapping with different subsampling schemes (i.e., with different cardinality of the drawn samples) with respect to their performance in finding the true number of clusters for both synthetic and real data

KITopen

Repositorium für Naturwissenschaften und Technik

Deconstructing allostery by computational assessment of the binding determinants of allosteric PTP1B modulators

Author: Cossins Benjamin P.
Hardie Adele
Lovera Silvia
Michel Julien
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 15/06/2023
Field of study

Edinburgh Research Explorer

Machine Learning-Based Rockfalls Detection with 3D Point Clouds, Example in the Montserrat Massif (Spain)

Author: Blanco Núñez Laura
García Sellés David
Gratacós Torrà Òscar
Guinau Sellés Marta
Janeras Casanova Marc
Muñoz J. A.
Pedraza Oriol
Puig Puig Anna
Salamó Llorente Maria
Zoumpekas Thanasis
Publication venue: 'MDPI AG'
Publication date: 01/09/2022
Field of study

Rock slope monitoring using 3D point cloud data allows the creation of rockfall inventories, provided that an efficient methodology is available to quantify the activity. However, monitoring with high temporal and spatial resolution entails the processing of a great volume of data, which can become a problem for the processing system. The standard methodology for monitoring includes the steps of data capture, point cloud alignment, the measure of differences, clustering differences, and identification of rockfalls. In this article, we propose a new methodology adapted from existing algorithms (multiscale model to model cloud comparison and density-based spatial clustering of applications with noise algorithm) and machine learning techniques to facilitate the identification of rockfalls from compared temporary 3D point clouds, possibly the step with most user interpretation. Point clouds are processed to generate 33 new features related to the rock cliff differences, predominant differences, or orientation for classification with 11 machine learning models, combined with 2 undersampling and 13 oversampling methods. The proposed methodology is divided into two software packages: point cloud monitoring and cluster classification. The prediction model applied in two study cases in the Montserrat conglomeratic massif (Barcelona, Spain) reveal that a reduction of 98% in the initial number of clusters is sufficient to identify the totality of rockfalls in the first case study. The second case study requires a 96% reduction to identify 90% of the rockfalls, suggesting that the homogeneity of the rockfall characteristics is a key factor for the correct prediction of the machine learning models

Multidisciplinary Digital Publishing Institute

Diposit Digital de la Universitat de Barcelona