Search CORE

12 research outputs found

Comparative study of unsupervised dimension reduction techniques for the visualization of microarray gene expression data

Author: A Antoniadis
A Butte
AL Boulesteix
B Nadler
B Schölkopf
B Schölkopf
C Chatfield
CC Chang
CCC Liu
Christian Ruckert
Christoph Bartenhagen
CL Nutt
D Geman
D Singh
DV Nguyen
H Hotelling
Hans-Ulrich Klein
HU Klein
I Del Giudice
IS Lim
IT Jolliffe
J Baek
J Misra
JB Tenenbaum
JI Powell
JJ Dai
K Dawson
KQ Weinberger
KQ Weinberger
KY Yeung
LJP Van der Maaten
LK Saul
M Belkin
M Belkin
M Mramor
M Vlachos
MA Hibbs
Martin Dugas
N Cristianini
N Pochet
O Chapelle
R Verhaak
R Xu
S Chao
S Lafon
SB Cho
ST Roweis
T Li
TF Cox
TJ Umpai
TR Golub
U Alon
VD Silva
X Lin
Xiaoyi Jiang
Y Su
Y Wang
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Visualization of DNA microarray data in two or three dimensional spaces is an important exploratory analysis step in order to detect quality issues or to generate new hypotheses. Principal Component Analysis (PCA) is a widely used linear method to define the mapping between the high-dimensional data and its low-dimensional representation. During the last decade, many new nonlinear methods for dimension reduction have been proposed, but it is still unclear how well these methods capture the underlying structure of microarray gene expression data. In this study, we assessed the performance of the PCA approach and of six nonlinear dimension reduction methods, namely Kernel PCA, Locally Linear Embedding, Isomap, Diffusion Maps, Laplacian Eigenmaps and Maximum Variance Unfolding, in terms of visualization of microarray data. Results A systematic benchmark, consisting of Support Vector Machine classification, cluster validation and noise evaluations was applied to ten microarray and several simulated datasets. Significant differences between PCA and most of the nonlinear methods were observed in two and three dimensional target spaces. With an increasing number of dimensions and an increasing number of differentially expressed genes, all methods showed similar performance. PCA and Diffusion Maps responded less sensitive to noise than the other nonlinear methods. Conclusions Locally Linear Embedding and Isomap showed a superior performance on all datasets. In very low-dimensional representations and with few differentially expressed genes, these two methods preserve more of the underlying structure of the data than PCA, and thus are favorable alternatives for the visualization of microarray data.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Biasogram: visualization of confounding technical bias in gene expression data.

Author: AC Eklund
Aron C. Eklund
BJ Daigle Jr
C Bartenhagen
D Venet
FM Giorgi
H Auer
HK Dressman
J Wang
JC Chang
JC Chang
JK Lee
KA Baggerly
KR Gabriel
KR Hess
Marcin Krzystanek
Q Li
S Dudoit
Xiaofeng Wang
Zoltan Szallasi
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2013
Field of study

Gene expression profiles of clinical cohorts can be used to identify genes that are correlated with a clinical variable of interest such as patient outcome or response to a particular drug. However, expression measurements are susceptible to technical bias caused by variation in extraneous factors such as RNA quality and array hybridization conditions. If such technical bias is correlated with the clinical variable of interest, the likelihood of identifying false positive genes is increased. Here we describe a method to visualize an expression matrix as a projection of all genes onto a plane defined by a clinical variable and a technical nuisance variable. The resulting plot indicates the extent to which each gene is correlated with the clinical variable or the technical variable. We demonstrate this method by applying it to three clinical trial microarray data sets, one of which identified genes that may have been driven by a confounding technical variable. This approach can be used as a quality control step to identify data sets that are likely to yield false positive results

CiteSeerX

Public Library of Science (PLOS)

Crossref

Harvard University - DASH

Directory of Open Access Journals

PubMed Central

Online Research Database In Technology

FigShare

A PERCEPTRON-BASED FEATURE SELECTION APPROACH FOR DECISION TREE CLASSIFICATION

Author: Casaroti Carla Jaqueline
Centeno Jorge Antonio Silva
Fuchs Stephan
Publication venue: Bulletin of Geodetic Sciences
Publication date: 17/11/2020
Field of study

The use of OBIA for high spatial resolution image classification can be divided in two main steps, the first being segmentation and the second regarding the labeling of the objects in accordance with a particular set of features and a classifier. Decision trees are often used to represent human knowledge in the latter. The issue falls in how to select a smaller amount of features from a feature space with spatial, spectral and textural variables to describe the classes of interest, which engenders the matter of choosing the best or more convenient feature selection (FS) method. In this work, an approach for FS within a decision tree was introduced using a single perceptron and the Backpropagation algorithm. Three alternatives were compared: single, double and multiple inputs, using a sequential backward search (SBS). Test regions were used to evaluate the efficiency of the proposed methods. Results showed that it is possible to use a single perceptron in each node, with an overall accuracy (OA) between 77.6% and 77.9%. Only SBS reached an OA larger than 88%. Thus, the quality of the proposed solution depends on the number of input features

Biblioteca Digital de Periódicos da UFPR (Universidade Federal do Paraná)

Multi-Cell ECM compaction is predictable via superposition of nonlinear cell dynamics linearized in augmented state space

Author: Asada Harry
Kim Min-Cheol
Mayalu Michaëlle N.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/09/2019
Field of study

Cells interacting through an extracellular matrix (ECM) exhibit emergent behaviors resulting from collective intercellular interaction. In wound healing and tissue development, characteristic compaction of ECM gel is induced by multiple cells that generate tensions in the ECM fibers and coordinate their actions with other cells. Computational prediction of collective cell-ECM interaction based on first principles is highly complex especially as the number of cells increase. Here, we introduce a computationally-efficient method for predicting nonlinear behaviors of multiple cells interacting mechanically through a 3-D ECM fiber network. The key enabling technique is superposition of single cell computational models to predict multicellular behaviors. While cell-ECM interactions are highly nonlinear, they can be linearized accurately with a unique method, termed Dual-Faceted Linearization. This method recasts the original nonlinear dynamics in an augmented space where the system behaves more linearly. The independent state variables are augmented by combining auxiliary variables that inform nonlinear elements involved in the system. This computational method involves a) expressing the original nonlinear state equations with two sets of linear dynamic equations b) reducing the order of the augmented linear system via principal component analysis and c) superposing individual single cell-ECM dynamics to predict collective behaviors of multiple cells. The method is computationally efficient compared to original nonlinear dynamic simulation and accurate compared to traditional Taylor expansion linearization. Furthermore, we reproduce reported experimental results of multi-cell induced ECM compaction

Directory of Open Access Journals

Caltech Authors

Recommended from our members

The Computational Diet: A Review of Computational Methods Across Diet, Microbiome, and Health.

Author: Eetemadi Ameen
Kim Minseung
Pereira Beatriz Merchel Piovesan
Rai Navneet
Schmitz Harold
Tagkopoulos Ilias
Publication venue: eScholarship, University of California
Publication date: 01/01/2020
Field of study

Food and human health are inextricably linked. As such, revolutionary impacts on health have been derived from advances in the production and distribution of food relating to food safety and fortification with micronutrients. During the past two decades, it has become apparent that the human microbiome has the potential to modulate health, including in ways that may be related to diet and the composition of specific foods. Despite the excitement and potential surrounding this area, the complexity of the gut microbiome, the chemical composition of food, and their interplay in situ remains a daunting task to fully understand. However, recent advances in high-throughput sequencing, metabolomics profiling, compositional analysis of food, and the emergence of electronic health records provide new sources of data that can contribute to addressing this challenge. Computational science will play an essential role in this effort as it will provide the foundation to integrate these data layers and derive insights capable of revealing and understanding the complex interactions between diet, gut microbiome, and health. Here, we review the current knowledge on diet-health-gut microbiota, relevant data sources, bioinformatics tools, machine learning capabilities, as well as the intellectual property and legislative regulatory landscape. We provide guidance on employing machine learning and data analytics, identify gaps in current methods, and describe new scenarios to be unlocked in the next few years in the context of current knowledge

eScholarship - University of California

A primer on correlation-based dimension reduction methods for multi-omics analysis

Author: Angelopoulos Nicos
Downing Tim
Publication venue
Publication date: 27/05/2023
Field of study

The continuing advances of omic technologies mean that it is now more tangible to measure the numerous features collectively reflecting the molecular properties of a sample. When multiple omic methods are used, statistical and computational approaches can exploit these large, connected profiles. Multi-omics is the integration of different omic data sources from the same biological sample. In this review, we focus on correlation-based dimension reduction approaches for single omic datasets, followed by methods for pairs of omics datasets, before detailing further techniques for three or more omic datasets. We also briefly detail network methods when three or more omic datasets are available and which complement correlation-oriented tools. To aid readers new to this area, these are all linked to relevant R packages that can implement these procedures. Finally, we discuss scenarios of experimental design and present road maps that simplify the selection of appropriate analysis methods. This review will guide researchers navigate the emerging methods for multi-omics and help them integrate diverse omic datasets appropriately and embrace the opportunity of population multi-omics.Comment: 30 pages, 2 figures, 6 table

arXiv.org e-Print Archive

A comparative evaluation of dimensionality reduction methods on large-scale gene expression datasets

Author: Sara Carolina Martins Ribeiro
Publication venue
Publication date: 09/12/2020
Field of study

Repositório Aberto da Universidade do Porto

Adaptive grid based localized learning for multidimensional data

Author: Saini Sheetal
Publication venue: Louisiana Tech Digital Commons
Publication date: 01/10/2012
Field of study

Rapid advances in data-rich domains of science, technology, and business has amplified the computational challenges of Big Data synthesis necessary to slow the widening gap between the rate at which the data is being collected and analyzed for knowledge. This has led to the renewed need for efficient and accurate algorithms, framework, and algorithmic mechanisms essential for knowledge discovery, especially in the domains of clustering, classification, dimensionality reduction, feature ranking, and feature selection. However, data mining algorithms are frequently challenged by the sparseness due to the high dimensionality of the datasets in such domains which is particularly detrimental to the performance of unsupervised learning algorithms. The motivation for the research presented in this dissertation is to develop novel data mining algorithms to address the challenges of high dimensionality, sparseness and large volumes of datasets by using a unique grid-based localized learning paradigm for data movement clustering and classification schema. The grid-based learning is recognized in data mining as these algorithms are inherently efficient since they reduce the search space by partitioning the feature space into effective partitions. However, these approaches have not been successfully devised for supervised learning algorithms or sparseness reduction algorithm as they require careful estimation of grid sizes, partitions and data movement error calculations. Grid-based localized learning algorithms can scale well with an increase in dimensionality and the size of the datasets. To fulfill the goal of designing and developing learning algorithms that can handle data sparseness, high data dimensionality, and large size of data, in a concurrent manner to avoid the feature selection biases, a set of novel data mining algorithms using grid-based localized learning principles are developed and presented. The first algorithm is a unique computational framework for feature ranking that employs adaptive grid-based data shrinking for feature ranking. This method addresses the limitations of existing feature ranking methods by using a scoring function that discovers and exploits dependencies from all the features in the data. Data shrinking principles are established and metricized to capture and exploit dependencies between features. The second core algorithmic contribution is a novel supervised learning algorithm that utilizes grid-based localized learning to build a nonparametric classification model. In this classification model, feature space is divided using uniform/non-uniform partitions and data space subdivision is performed using a grid structure which is then used to build a classification model using grid-based nearest-neighbor learning. The third algorithm is an unsupervised clustering algorithm that is augmented with data shrinking to enhance the clustering performance of the algorithm. This algorithm addresses the limitations of the existing grid-based data shrinking and clustering algorithms by using an adaptive grid-based learning. Multiple experiments on a diversified set of datasets evaluate and discuss the effectiveness of dimensionality reduction, feature selection, unsupervised and supervised learning, and the scalability of the proposed methods compared to the established methods in the literature

Louisiana Tech Digital Commons

Statistical learning methods for multi-omics data integration in dimension reduction, supervised and unsupervised machine learning

Author: Kim SungHwan
Publication venue
Publication date: 29/06/2015
Field of study

Over the decades, many statistical learning techniques such as supervised learning, unsupervised learning, dimension reduction technique have played ground breaking roles for important tasks in biomedical research. More recently, multi-omics data integration analysis has become increasingly popular to answer to many intractable biomedical questions, to improve statistical power by exploiting large size samples and different types omics data, and to replicate individual experiments for validation. This dissertation covers the several analytic methods and frameworks to tackle with practical problems in multi-omics data integration analysis. Supervised prediction rules have been widely applied to high-throughput omics data to predict disease diagnosis, prognosis or survival risk. The top scoring pair (TSP) algorithm is a supervised discriminant rule that applies a robust simple rank-based algorithm to identify rank-altered gene pairs in case/control classes. TSP usually generates greatly reduced accuracy in inter-study prediction (i.e., the prediction model is established in the training study and applied to an independent test study). In the first part, we introduce a MetaTSP algorithm that combines multiple transcriptomic studies and generates a robust prediction model applicable to independent test studies. One important objective of omics data analysis is clustering unlabeled patients in order to identify meaningful disease subtypes. In the second part, we propose a group structured integrative clustering method to incorporate a sparse overlapping group lasso technique and a tight clustering via regularization to integrate inter-omics regulation flow, and to encourage outlier samples scattering away from tight clusters. We show by two real examples and simulated data that our proposed methods improve the existing integrative clustering in clustering accuracy, biological interpretation, and are able to generate coherent tight clusters. Principal component analysis (PCA) is commonly used for projection to low-dimensional space for visualization. In the third part, we introduce two meta-analysis frameworks of PCA (Meta-PCA) for analyzing multiple high-dimensional studies in common principal component space. Theoretically, Meta-PCA specializes to identify meta principal component (Meta-PC) space; (1) by decomposing the sum of variances and (2) by minimizing the sum of squared cosines. Applications to various simulated data shows that Meta-PCAs outstandingly identify true principal component space, and retain robustness to noise features and outlier samples. We also propose sparse Meta-PCAs that penalize principal components in order to selectively accommodate significant principal component projections. With several simulated and real data applications, we found Meta-PCA efficient to detect significant transcriptomic features, and to recognize visual patterns for multi-omics data sets. In the future, the success of data integration analysis will play an important role in revealing the molecular and cellular process inside multiple data, and will facilitate disease subtype discovery and characterization that improve hypothesis generation towards precision medicine, and potentially advance public health research

D-Scholarship@Pitt