2,693 research outputs found
The Iterative Signature Algorithm for the analysis of large scale gene expression data
We present a new approach for the analysis of genome-wide expression data.
Our method is designed to overcome the limitations of traditional techniques,
when applied to large-scale data. Rather than alloting each gene to a single
cluster, we assign both genes and conditions to context-dependent and
potentially overlapping transcription modules. We provide a rigorous definition
of a transcription module as the object to be retrieved from the expression
data. An efficient algorithm, that searches for the modules encoded in the data
by iteratively refining sets of genes and conditions until they match this
definition, is established. Each iteration involves a linear map, induced by
the normalized expression matrix, followed by the application of a threshold
function. We argue that our method is in fact a generalization of Singular
Value Decomposition, which corresponds to the special case where no threshold
is applied. We show analytically that for noisy expression data our approach
leads to better classification due to the implementation of the threshold. This
result is confirmed by numerical analyses based on in-silico expression data.
We discuss briefly results obtained by applying our algorithm to expression
data from the yeast S. cerevisiae.Comment: Latex, 36 pages, 8 figure
Temporal patterns of gene expression via nonmetric multidimensional scaling analysis
Motivation: Microarray experiments result in large scale data sets that
require extensive mining and refining to extract useful information. We have
been developing an efficient novel algorithm for nonmetric multidimensional
scaling (nMDS) analysis for very large data sets as a maximally unsupervised
data mining device. We wish to demonstrate its usefulness in the context of
bioinformatics. In our motivation is also an aim to demonstrate that
intrinsically nonlinear methods are generally advantageous in data mining.
Results: The Pearson correlation distance measure is used to indicate the
dissimilarity of the gene activities in transcriptional response of cell
cycle-synchronized human fibroblasts to serum [Iyer et al., Science vol. 283,
p83 (1999)]. These dissimilarity data have been analyzed with our nMDS
algorithm to produce an almost circular arrangement of the genes. The temporal
expression patterns of the genes rotate along this circular arrangement. If an
appropriate preparation procedure may be applied to the original data set,
linear methods such as the principal component analysis (PCA) could achieve
reasonable results, but without data preprocessing linear methods such as PCA
cannot achieve a useful picture. Furthermore, even with an appropriate data
preprocessing, the outcomes of linear procedures are not as clearcut as those
by nMDS without preprocessing.Comment: 11 pages, 6 figures + online only 2 color figures, submitted to
Bioinformatic
Identification of an Efficient Gene Expression Panel for Glioblastoma Classification.
We present here a novel genetic algorithm-based random forest (GARF) modeling technique that enables a reduction in the complexity of large gene disease signatures to highly accurate, greatly simplified gene panels. When applied to 803 glioblastoma multiforme samples, this method allowed the 840-gene Verhaak et al. gene panel (the standard in the field) to be reduced to a 48-gene classifier, while retaining 90.91% classification accuracy, and outperforming the best available alternative methods. Additionally, using this approach we produced a 32-gene panel which allows for better consistency between RNA-seq and microarray-based classifications, improving cross-platform classification retention from 69.67% to 86.07%. A webpage producing these classifications is available at http://simplegbm.semel.ucla.edu
Global rank-invariant set normalization (GRSN) to reduce systematic distortions in microarray data
<p>Abstract</p> <p>Background</p> <p>Microarray technology has become very popular for globally evaluating gene expression in biological samples. However, non-linear variation associated with the technology can make data interpretation unreliable. Therefore, methods to correct this kind of technical variation are critical. Here we consider a method to reduce this type of variation applied after three common procedures for processing microarray data: MAS 5.0, RMA, and dChip<sup>®</sup>.</p> <p>Results</p> <p>We commonly observe intensity-dependent technical variation between samples in a single microarray experiment. This is most common when MAS 5.0 is used to process probe level data, but we also see this type of technical variation with RMA and dChip<sup>® </sup>processed data. Datasets with unbalanced numbers of up and down regulated genes seem to be particularly susceptible to this type of intensity-dependent technical variation. Unbalanced gene regulation is common when studying cancer samples or genetically manipulated animal models and preservation of this biologically relevant information, while removing technical variation has not been well addressed in the literature. We propose a method based on using rank-invariant, endogenous transcripts as reference points for normalization (GRSN). While the use of rank-invariant transcripts has been described previously, we have added to this concept by the creation of a global rank-invariant set of transcripts used to generate a robust average reference that is used to normalize all samples within a dataset. The global rank-invariant set is selected in an iterative manner so as to preserve unbalanced gene expression. Moreover, our method works well as an overlay that can be applied to data already processed with other probe set summary methods. We demonstrate that this additional normalization step at the "probe set level" effectively corrects a specific type of technical variation that often distorts samples in datasets.</p> <p>Conclusion</p> <p>We have developed a simple post-processing tool to help detect and correct non-linear technical variation in microarray data and demonstrate how it can reduce technical variation and improve the results of downstream statistical gene selection and pathway identification methods.</p
Coupled Two-Way Clustering Analysis of Gene Microarray Data
We present a novel coupled two-way clustering approach to gene microarray
data analysis. The main idea is to identify subsets of the genes and samples,
such that when one of these is used to cluster the other, stable and
significant partitions emerge. The search for such subsets is a computationally
complex task: we present an algorithm, based on iterative clustering, which
performs such a search. This analysis is especially suitable for gene
microarray data, where the contributions of a variety of biological mechanisms
to the gene expression levels are entangled in a large body of experimental
data. The method was applied to two gene microarray data sets, on colon cancer
and leukemia. By identifying relevant subsets of the data and focusing on them
we were able to discover partitions and correlations that were masked and
hidden when the full dataset was used in the analysis. Some of these partitions
have clear biological interpretation; others can serve to identify possible
directions for future research
Whole-transciptome analysis of [psi+] budding yeast via cDNA microarrays
Introduction: Prions of yeast present a novel analytical challenge in terms of both initial characterization and in vitro manipulation as models for human disease research. Presently, few robust analysis strategies have been successfully implemented which enable the efficient study of prion behavior in vivo. This study sought to evaluate the utilization of conventional dual-channel cDNA microarrays for the surveillance of transcriptomic regulation patterns by the [PSI+] yeast prion relative to an identical prion deficient yeast variant, [psi-]. Methods: A data analysis and normalization workflow strategy was developed and applied to cDNA array images, yielded quality-regulated expression ratios for a subset of genes exhibiting statistical congruence across multiple experimental repetitions and nested hybridization events. The significant gene list was analyzed using classical analytical approaches including several clustering-based methods and singular value decomposition. To add biological meaning to the differential expression data in hand, functional annotation using the Gene Ontology as well as several pathway-mapping approaches was conducted. Finally, the expression patterns observed were queried against all publicly curated microarray data performed using S. cerevisiae in order to discover similar expression behavior across a vast array of experimental conditions. Results: These data collectively implicate a low-level of overall genomic regulation as a result of the [PSI+] state, where the maximum statistically significant degree of differential expression was less than ±1 Log2(FC) in all cases. Notwithstanding, the [PSI+] differential expression was localized to several specific classes of structural elements and cellular functions, implying under homeostatic conditions significant up or down regulation is likely unnecessary but possible in those specific systems if environmental conditions warranted. As a result of these findings additional work pertaining to this system should include controlled insult to both yeast variants of differing environmental properties to promote a potential [PSI+] regulatory response coupled with co-surveillance of these conditions using transcriptomic and proteomic analysis methodologies
Identification of a Proliferation Gene Cluster Associated with HPV E6/E7 Expression Level and Viral DNA Load in Invasive Cervical Carcinoma
Specific HPV DNA sequences are associated with more than 90% of invasive
carcinomas of the uterine cervix. Viral E6 and E7 oncogenes are key mediators
in cell transformation by disrupting TP53 and RB pathways. To investigate
molecular mechanisms involved in the progression of invasive cervical
carcinoma, we performed a gene expression study on cases selected according to
viral and clinical parameters. Using Coupled Two-Way Clustering and Sorting
Points Into Neighbourhoods methods, we identified a Cervical Cancer
Proliferation Cluster composed of 163 highly correlated transcripts, many of
which corresponded to E2F pathway genes controlling cell proliferation, whereas
no primary TP53 targets were present in this cluster. The average expression
level of the genes of this cluster was higher in tumours with an early relapse
than in tumours with a favourable course (P=0.026). Moreover, we found that
E6/E7 mRNA expression level was positively correlated with the expression level
of the cluster genes and with viral DNA load. These findings suggest that HPV
E6/E7 expression level plays a key role in the progression of invasive
carcinoma of the uterine cervix via the deregulation of cellular genes
controlling tumour cell proliferation. HPV expression level may thus correspond
to a biological marker useful for prognosis assessment and specific therapy of
the disease
Internal standard-based analysis of microarray data. Part 1: analysis of differential gene expressions
Genome-scale microarray experiments for comparative analysis of gene expressions produce massive amounts of information. Traditional statistical approaches fail to achieve the required accuracy in sensitivity and specificity of the analysis. Since the problem can be resolved neither by increasing the number of replicates nor by manipulating thresholds, one needs a novel approach to the analysis. This article describes methods to improve the power of microarray analyses by defining internal standards to characterize features of the biological system being studied and the technological processes underlying the microarray experiments. Applying these methods, internal standards are identified and then the obtained parameters are used to define (i) genes that are distinct in their expression from background; (ii) genes that are differentially expressed; and finally (iii) genes that have similar dynamical behavio
Distribution-free factor analysis - Estimation theory and applicability to high-dimensional data
We here provide a distribution-free approach to the random factor analysis
model. We show that it leads to the same estimating equations as for the
classical ML estimates under normality, but more easily derived, and valid also
in the case of more variables than observations (). For this case we also
advocate a simple iteration method. In an illustration with and
it was seen to lead to convergence after just a few iterations. We show that
there is no reason to expect Heywood cases to appear, and that the factor
scores will typically be precisely estimated/predicted as soon as is large.
We state as a general conjecture that the nice behaviour is not despite ,
but because .Comment: 12 pages, 2 figure
Internal standard-based analysis of microarray data. Part 1: analysis of differential gene expressions
Genome-scale microarray experiments for comparative analysis of gene expressions produce massive amounts of information. Traditional statistical approaches fail to achieve the required accuracy in sensitivity and specificity of the analysis. Since the problem can be resolved neither by increasing the number of replicates nor by manipulating thresholds, one needs a novel approach to the analysis. This article describes methods to improve the power of microarray analyses by defining internal standards to characterize features of the biological system being studied and the technological processes underlying the microarray experiments. Applying these methods, internal standards are identified and then the obtained parameters are used to define (i) genes that are distinct in their expression from background; (ii) genes that are differentially expressed; and finally (iii) genes that have similar dynamical behavior
- …