Search CORE

31 research outputs found

Evaluation of clustering algorithms for gene expression data

Author: A Ruepp
I Gat-Viks
J Quackenbush
JA Hartigan
JD Banfield
JT Taylor
L Kaufman
MC Abba
PJ Rousseeuw
R Shamir
S Chu
S Datta
S Datta
S Datta
S Dudoit
Somnath Datta
Susmita Datta
T Kohonen
WN Venables
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Cluster analysis is an integral part of high dimensional data analysis. In the context of large scale gene expression data, a filtered set of genes are grouped together according to their expression profiles using one of numerous clustering algorithms that exist in the statistics and machine learning literature. A closely related problem is that of selecting a clustering algorithm that is "optimal" in some sense from a rather impressive list of clustering algorithms that currently exist. RESULTS: In this paper, we propose two validation measures each with two parts: one measuring the statistical consistency (stability) of the clusters produced and the other representing their biological functional congruence. Smaller values of these indices indicate better performance for a clustering algorithm. We illustrate this approach using two case studies with publicly available gene expression data sets: one involving a SAGE data of breast cancer patients and the other involving a time course cDNA microarray data on yeast. Six well known clustering algorithms UPGMA, K-Means, Diana, Fanny, Model-Based and SOM were evaluated. CONCLUSION: No single clustering algorithm may be best suited for clustering genes into functional groups via expression profiles for all data sets. The validation measures introduced in this paper can aid in the selection of an optimal algorithm, for a given data set, from a collection of available clustering algorithms

Crossref

Springer - Publisher Connector

PubMed Central

RETINOBASE: a web database, data mining and analysis platform for gene expression data on retina

Author: Berthommier Guillaume
Gagniere Nicolas
Kalathur Ravi Kiran Reddy
Léveillard Thierry
Poch Olivier
Poidevin Laetitia
Raffelsberger Wolfgang
Ripp Raymond
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background The retina is a multi-layered sensory tissue that lines the back of the eye and acts at the interface of input light and visual perception. Its main function is to capture photons and convert them into electrical impulses that travel along the optic nerve to the brain where they are turned into images. It consists of neurons, nourishing blood vessels and different cell types, of which neural cells predominate. Defects in any of these cells can lead to a variety of retinal diseases, including age-related macular degeneration, retinitis pigmentosa, Leber congenital amaurosis and glaucoma. Recent progress in genomics and microarray technology provides extensive opportunities to examine alterations in retinal gene expression profiles during development and diseases. However, there is no specific database that deals with retinal gene expression profiling. In this context we have built RETINOBASE, a dedicated microarray database for retina. Description RETINOBASE is a microarray relational database, analysis and visualization system that allows simple yet powerful queries to retrieve information about gene expression in retina. It provides access to gene expression meta-data and offers significant insights into gene networks in retina, resulting in better hypothesis framing for biological problems that can subsequently be tested in the laboratory. Public and proprietary data are automatically analyzed with 3 distinct methods, RMA, dChip and MAS5, then clustered using 2 different K-means and 1 mixture models method. Thus, RETINOBASE provides a framework to compare these methods and to optimize the retinal data analysis. RETINOBASE has three different modules, "Gene Information", "Raw Data System Analysis" and "Fold change system Analysis" that are interconnected in a relational schema, allowing efficient retrieval and cross comparison of data. Currently, RETINOBASE contains datasets from 28 different microarray experiments performed in 5 different model systems: drosophila, zebrafish, rat, mouse and human. The database is supported by a platform that is designed to easily integrate new functionalities and is also frequently updated. Conclusion The results obtained from various biological scenarios can be visualized, compared and downloaded. The results of a case study are presented that highlight the utility of RETINOBASE. Overall, RETINOBASE provides efficient access to the global expression profiling of retinal genes from different organisms under various conditions.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Principal component tests: applied to temporal gene expression data

Author: Fang Hong-Bin
Song Jiuzhou
Zhang Wensheng
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Clustering analysis is a common statistical tool for knowledge discovery. It is mainly conducted when a project still is in the exploratory phase without any priori hypotheses. However, the statistical significance testing between the clusters can be meaningful in helping the researchers to assess if the classification results from implementing a clustering algorithm need to be improved, even after the cluster number has been determined by a well-established criterion. This is important when we want to identify highly-specific patterns through classification. We proposed to use a principal component (PC) test, which is an implementation of an exact F statistic for the measures at multiple endpoints based on elliptical distribution theory, to assess the statistical significance between clusters. A challenge in the implementation is the choice of the number (q) of principal components to be considered, which can severely influence the statistical power of the method. We optimized the determination via validation according to a permutation test based on the clustering to be evaluated. The method was applied to a public dataset in classifying genes according to their temporal gene expression profiles. The results demonstrated that the PC testing were useful for determining the optimal number of clusters.https://doi.org/10.1186/1471-2105-10-S1-S2

Crossref

Springer - Publisher Connector

PubMed Central

Digital Repository at the University of Maryland

New resampling method for evaluating stability of clusters

Author: A Bhattacharjee
A Thalamuthu
B Efron
F Tschentscher
GC Tseng
H Pruscha
H Schneider
Irina M Gana Dresen
J Handl
J Quackenbush
JC Gower
JH Ward
Johannes Huesing
K Zhang
Karl-Heinz Joeckel
L Hubert
LM McShane
M Bittner
M Smolkin
Markus Neuhaeuser
MB Eisen
MK Kerr
PHA Sneath
RR Sokal
S Datta
S Datta
S Datta
S Dudoit
S Monti
T Margush
T Sørensen
Tanja Boes
WM Rand
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Hierarchical clustering is a widely applied tool in the analysis of microarray gene expression data. The assessment of cluster stability is a major challenge in clustering procedures. Statistical methods are required to distinguish between real and random clusters. Several methods for assessing cluster stability have been published, including resampling methods such as the bootstrap. We propose a new resampling method based on continuous weights to assess the stability of clusters in hierarchical clustering. While in bootstrapping approximately one third of the original items is lost, continuous weights avoid zero elements and instead allow non integer diagonal elements, which leads to retention of the full dimensionality of space, i.e. each variable of the original data set is represented in the resampling sample. Results Comparison of continuous weights and bootstrapping using real datasets and simulation studies reveals the advantage of continuous weights especially when the dataset has only few observations, few differentially expressed genes and the fold change of differentially expressed genes is low. Conclusion We recommend the use of continuous weights in small as well as in large datasets, because according to our results they produce at least the same results as conventional bootstrapping and in some cases they surpass it.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

CellBIC:bimodality-based top-down clustering of single-cell RNA sequencing data reveals hierarchical structure of the cell type

Author: Kim Junil
Stanescu Diana E
Won Kyoung Jae
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2018
Field of study

Copenhagen University Research Information System

Defining an informativeness metric for clustering gene expression data

Author: Akaike
Christine A. Wells
Datta
Datta
Dunn
Eisen
Gibbons
Handl
Jessica C. Mar
John Quackenbush
Kanehisa
McLachlan
Michaels
Milligan
Müller
Rousseeuw
Schwarz
Tibshirani
Yeung
Publication venue: Oxford University Press
Publication date: 01/01/2011
Field of study

Motivation: Unsupervised ‘cluster’ analysis is an invaluable tool for exploratory microarray data analysis, as it organizes the data into groups of genes or samples in which the elements share common patterns. Once the data are clustered, finding the optimal number of informative subgroups within a dataset is a problem that, while important for understanding the underlying phenotypes, is one for which there is no robust, widely accepted solution

Crossref

Harvard University - DASH

PubMed Central

Enlighten

University of Melbourne Institutional Repository

University of Queensland eSpace

Single cell transcriptional analysis reveals novel innate immune cell types

Author: Abdel-Rahman
Bajikar
Bengtsson
Brennecke
Chambers
Chang
Chen
Cohen
Dalerba
Dietrich
Enver
Feinerman
Ferreira
Flatz
Gascoigne
Geva-Zatorsky
Hoshida
Huang
Huang
Huang da
Huang da
Janes
Kalisky
Kalmar
Kobayashi
Kurimoto
Liss
Ma
MacQueen
McDavid
Morris
Mucida
Nakao
Neildez-Nguyen
Niepel
Orth
Rahman
Rajan
Sachs
Shalek
Sharma
Shi
Shi
Singh
Spencer
Spencer
Stockholm
Suzuki
Wang
Wang
Ward
White
Zhang
Publication venue: PeerJ Inc.
Publication date: 01/06/2014
Field of study

Single-cell analysis has the potential to provide us with a host of new knowledge about biological systems, but it comes with the challenge of correctly interpreting the biological information. While emerging techniques have made it possible to measure inter-cellular variability at the transcriptome level, no consensus yet exists on the most appropriate method of data analysis of such single cell data. Methods for analysis of transcriptional data at the population level are well established but are not well suited to single cell analysis due to their dependence on population averages. In order to address this question, we have systematically tested combinations of methods for primary data analysis on single cell transcription data generated from two types of primary immune cells, neutrophils and T lymphocytes. Cells were obtained from healthy individuals, and single cell transcript expression data was obtained by a combination of single cell sorting and nanoscale quantitative real time PCR (qRT-PCR) for markers of cell type, intracellular signaling, and immune functionality. Gene expression analysis was focused on hierarchical clustering to determine the existence of cellular subgroups within the populations. Nine combinations of criteria for data exclusion and normalization were tested and evaluated. Bimodality in gene expression indicated the presence of cellular subgroups which were also revealed by data clustering. We observed evidence for two clearly defined cellular subtypes in the neutrophil populations and at least two in the T lymphocyte populations. When normalizing the data by different methods, we observed varying outcomes with corresponding interpretations of the biological characteristics of the cell populations. Normalization of the data by linear standardization taking into account technical effects such as plate effects, resulted in interpretations that most closely matched biological expectations. Single cell transcription profiling provides evidence of cellular subclasses in neutrophils and leukocytes that may be independent of traditional classifications based on cell surface markers. The choice of primary data analysis method had a substantial effect on the interpretation of the data. Adjustment for technical effects is critical to prevent misinterpretation of single cell transcript data

Crossref

Directory of Open Access Journals

PubMed Central

Construction and analysis of tissue microarrays

Author: Khmelnitskaya N. M.
Khramtsov A. L.
Khramtsova G. F.
Хмельницкая H. М.
Храмцов А. И.
Храмцова Г. Ф.
Publication venue: Уральский Центр Медицинской и Фармацевтической Информации
Publication date: 01/01/2011
Field of study

Tissue microarray technology allows investigators to detect expression of proteins on multiple tissue samples, which helps to obtain more precise results as well as reduce the cost of the study. This high throughput technique can become an effective way to solve many scientific and diagnostic problems. Knowledge of preparation and various data analysis techniques may facilitate introduction of this method in routine practice pathologist. This review presents practical aspects of construction of tissue microarray blocks, validation of the data, and statistical analysis using this method.Технология тканевых матриц позволяет одновременно определить экспрессию протеинов на множественных тканевых образцах, что дает возможность не только получить более точные результаты, но и значительно удешевить само исследование. Этот высокотехнологичный метод может стать путем разрешения многих научных и диагностических проблем. Знание особенностей техники изготовления и возможных методов анализа данных может значительно облегчить процесс внедрения этой технологии в рутинную практику патоморфолога. Настоящая работа посвящена практическим аспектам построения множественно-тканных блоков, вариантам анализа и статистической обработки данных, полученных с помощью этого метода

Ural State Medical University Repository

Measuring gene similarity by means of the classification distance

Author: A Ben-Dor
A Statnikov
A Thalamuthu
Alessandro Fiori
BS Everitt
CC Chang
D Huang
D Jiang
D Jiang
Elena Baralis
FR Hampel
G Petrovics
Giulia Bruno
H Liu
J Gu
JJ Chen
JL Gregg
L Davies
L Fu
L Kaufman
L Wang
M Bouguessa
M Daszykowski
M Royuela
O Gevaert
P Rosini
P Yang
PR Bushel
RC Thompson
S Datta
S Mukkamala
SB Aicha
T Bo
T Chu
TF Cox
TR Golub
U Alon
WM Rand
X He
Y Torosyan
YH Yang
Publication venue: Springer London
Publication date: 01/01/2011
Field of study

Crossref

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino