762 research outputs found
Evaluation of clustering algorithms for gene expression data
BACKGROUND: Cluster analysis is an integral part of high dimensional data analysis. In the context of large scale gene expression data, a filtered set of genes are grouped together according to their expression profiles using one of numerous clustering algorithms that exist in the statistics and machine learning literature. A closely related problem is that of selecting a clustering algorithm that is "optimal" in some sense from a rather impressive list of clustering algorithms that currently exist. RESULTS: In this paper, we propose two validation measures each with two parts: one measuring the statistical consistency (stability) of the clusters produced and the other representing their biological functional congruence. Smaller values of these indices indicate better performance for a clustering algorithm. We illustrate this approach using two case studies with publicly available gene expression data sets: one involving a SAGE data of breast cancer patients and the other involving a time course cDNA microarray data on yeast. Six well known clustering algorithms UPGMA, K-Means, Diana, Fanny, Model-Based and SOM were evaluated. CONCLUSION: No single clustering algorithm may be best suited for clustering genes into functional groups via expression profiles for all data sets. The validation measures introduced in this paper can aid in the selection of an optimal algorithm, for a given data set, from a collection of available clustering algorithms
MedZIM: Mediation analysis for Zero-Inflated Mediators with applications to microbiome data
The human microbiome can contribute to the pathogenesis of many complex
diseases such as cancer and Alzheimer's disease by mediating disease-leading
causal pathways. However, standard mediation analysis is not adequate in the
context of microbiome data due to the excessive number of zero values in the
data. Zero-valued sequencing reads, commonly observed in microbiome studies,
arise for technical and/or biological reasons. Mediation analysis approaches
for analyzing zero-inflated mediators are still lacking largely because of
challenges raised by the zero-inflated data structure: (a) disentangling the
mediation effect induced by the point mass at zero; and (b) identifying the
observed zero-valued data points that are actually not zero (i.e., false
zeros). We develop a novel mediation analysis method under the
potential-outcomes framework to fill this gap. We show that the mediation
effect of the microbiome can be decomposed into two components that are
inherent to the two-part nature of zero-inflated distributions. The first
component corresponds to the mediation effect attributable to a unit-change
over the positive relative abundance and the second component corresponds to
the mediation effect attributable to discrete binary change of the mediator
from zero to a non-zero state. With probabilistic models to account for
observing zeros, we also address the challenge with false zeros. A
comprehensive simulation study and the applications in two real microbiome
studies demonstrate that our approach outperforms existing mediation analysis
approaches.Comment: Corresponding: Zhigang L
Colors of Luminous Bulges in Cluster MS1054-03 and Field Galaxies at Redshifts z ~ 0.83
Using HST images, we separate the bulge-like (pbulge) and disk-like (pdisk)
components of 71 galaxies in the rich cluster MS1054-03 and of 21 in the field.
Our key finding is that luminous pbulges are very red with restframe U-B ~
0.45, while predicted colors are bluer by 0.20 mag. Moreover, these very red
colors appear to be independent of environment, pbulge luminosity, pdisk color,
and pbulge fraction. These results challenge any models of hierarchical galaxy
formation that predict the colors of distant (z ~ 0.8) luminous field and
cluster bulges would differ. Our findings also disagree with other claims that
30% to 50% of bright bulges and ellipticals at z ~ 1 are very blue (U-B < 0).Comment: 5 pages and 1 figure. Accepted for publication in ApJ Letter
Evidence for the association of the DAOA (G72) gene with schizophrenia and bipolar disorder but not for the association of the DAO gene with schizophrenia
Background: Previous linkage and association studies have implicated the D-amino acid oxidase activator gene (DAOA)/G30 locus or neighbouring region of chromosome 13q33.2 in the genetic susceptibility to both schizophrenia and bipolar disorder. Four single nucleotide polymorphisms (SNPs) within the D-amino acid oxidase (DAO) gene located at 12q24.11 have also been found to show allelic association with schizophrenia.Methods: We used the case control method to test for genetic association with variants at these loci in a sample of 431 patients with schizophrenia, 303 patients with bipolar disorder and 442 ancestrally matched supernormal controls all selected from the UK population.Results: Ten SNPs spanning the DAOA locus were genotyped in these samples. In addition three SNPs were genotyped at the DAO locus in the schizophrenia sample. Allelic association was detected between the marker rs3918342 (M23), 3' to the DAOA gene and both schizophrenia (chi(2) = 5.824 p = 0.016) and bipolar disorder (chi(2) = 4.293 p = 0.038). A trend towards association with schizophrenia was observed for two other DAOA markers rs3916967 (M14, chi(2) = 3.675 p = 0.055) and rs1421292 (M24; chi(2) = 3.499 p = 0.062). A test of association between a three marker haplotype comprising of the SNPs rs778293 (M22), rs3918342 (M23) and rs1421292 (M24) and schizophrenia gave a global empirical significance of p = 0.015. No evidence was found to confirm the association of genetic markers at the DAO gene with schizophrenia.Conclusion: Our results provide some support for a role for DAOA in susceptibility to schizophrenia and bipolar disorder
Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes
BACKGROUND: A cluster analysis is the most commonly performed procedure (often regarded as a first step) on a set of gene expression profiles. In most cases, a post hoc analysis is done to see if the genes in the same clusters can be functionally correlated. While past successes of such analyses have often been reported in a number of microarray studies (most of which used the standard hierarchical clustering, UPGMA, with one minus the Pearson's correlation coefficient as a measure of dissimilarity), often times such groupings could be misleading. More importantly, a systematic evaluation of the entire set of clusters produced by such unsupervised procedures is necessary since they also contain genes that are seemingly unrelated or may have more than one common function. Here we quantify the performance of a given unsupervised clustering algorithm applied to a given microarray study in terms of its ability to produce biologically meaningful clusters using a reference set of functional classes. Such a reference set may come from prior biological knowledge specific to a microarray study or may be formed using the growing databases of gene ontologies (GO) for the annotated genes of the relevant species. RESULTS: In this paper, we introduce two performance measures for evaluating the results of a clustering algorithm in its ability to produce biologically meaningful clusters. The first measure is a biological homogeneity index (BHI). As the name suggests, it is a measure of how biologically homogeneous the clusters are. This can be used to quantify the performance of a given clustering algorithm such as UPGMA in grouping genes for a particular data set and also for comparing the performance of a number of competing clustering algorithms applied to the same data set. The second performance measure is called a biological stability index (BSI). For a given clustering algorithm and an expression data set, it measures the consistency of the clustering algorithm's ability to produce biologically meaningful clusters when applied repeatedly to similar data sets. A good clustering algorithm should have high BHI and moderate to high BSI. We evaluated the performance of ten well known clustering algorithms on two gene expression data sets and identified the optimal algorithm in each case. The first data set deals with SAGE profiles of differentially expressed tags between normal and ductal carcinoma in situ samples of breast cancer patients. The second data set contains the expression profiles over time of positively expressed genes (ORF's) during sporulation of budding yeast. Two separate choices of the functional classes were used for this data set and the results were compared for consistency. CONCLUSION: Functional information of annotated genes available from various GO databases mined using ontology tools can be used to systematically judge the results of an unsupervised clustering algorithm as applied to a gene expression data set in clustering genes. This information could be used to select the right algorithm from a class of clustering algorithms for the given data set
Failure to confirm allelic and haplotypic association between markers at the chromosome 6p22.3 dystrobrevin-binding protein 1 (DTNBP1) locus and schizophrenia
Background: Previous linkage and association studies may have implicated the Dystrobrevin-binding protein 1 (DTNBP1) gene locus or a gene in linkage disequilibrium with DTNBP1 on chromosome 6p22.3 in genetic susceptibility to schizophrenia.Methods: We used the case control design to test for of allelic and haplotypic association with schizophrenia in a sample of four hundred and fifty research subjects with schizophrenia and four hundred and fifty ancestrally matched supernormal controls. We genotyped the SNP markers previously found to be significantly associated with schizophrenia in the original study and also other markers found to be positive in subsequent studies.Results: We could find no evidence of allelic, genotypic or haplotypic association with schizophrenia in our UK sample.Conclusion: The results suggest that the DTNBP1 gene contribution to schizophrenia must be rare or absent in our sample. The discrepant allelic association results in previous studies of association between DTNBP1 and schizophrenia could be due population admixture. However, even positive studies of European populations do not show any consistent DTNBP1 alleles or haplotypes associated with schizophrenia. Further research is needed to resolve these issues. The possible confounding of linkage with association in family samples already showing linkage at 6p22.3 might be revealed by testing genes closely linked to DTNBP1 for allelic association and by restricting family based tests of association to only one case per family
Identification and Characterization of Nucleolin as a COUP-TFII Coactivator of Retinoic Acid Receptor β Transcription in Breast Cancer Cells
The orphan nuclear receptor COUP-TFII plays an undefined role in breast cancer. Previously we reported lower COUP-TFII expression in tamoxifen/endocrine-resistant versus sensitive breast cancer cell lines. The identification of COUP-TFII-interacting proteins will help to elucidate its mechanism of action as a transcriptional regulator in breast cancer.FLAG-affinity purification and multidimensional protein identification technology (MudPIT) identified nucleolin among the proteins interacting with COUP-TFII in MCF-7 tamoxifen-sensitive breast cancer cells. Interaction of COUP-TFII and nucleolin was confirmed by coimmunoprecipitation of endogenous proteins in MCF-7 and T47D breast cancer cells. In vitro studies revealed that COUP-TFII interacts with the C-terminal arginine-glycine repeat (RGG) domain of nucleolin. Functional interaction between COUP-TFII and nucleolin was indicated by studies showing that siRNA knockdown of nucleolin and an oligonucleotide aptamer that targets nucleolin, AS1411, inhibited endogenous COUP-TFII-stimulated RARB2 expression in MCF-7 and T47D cells. Chromatin immunoprecipitation revealed COUP-TFII occupancy of the RARB2 promoter was increased by all-trans retinoic acid (atRA). RARβ2 regulated gene RRIG1 was increased by atRA and COUP-TFII transfection and inhibited by siCOUP-TFII. Immunohistochemical staining of breast tumor microarrays showed nuclear COUP-TFII and nucleolin staining was correlated in invasive ductal carcinomas. COUP-TFII staining correlated with ERα, SRC-1, AIB1, Pea3, MMP2, and phospho-Src and was reduced with increased tumor grade.Our data indicate that nucleolin plays a coregulatory role in transcriptional regulation of the tumor suppressor RARB2 by COUP-TFII
Minimal information for studies of extracellular vesicles 2018 (MISEV2018):a position statement of the International Society for Extracellular Vesicles and update of the MISEV2014 guidelines
The last decade has seen a sharp increase in the number of scientific publications describing physiological and pathological functions of extracellular vesicles (EVs), a collective term covering various subtypes of cell-released, membranous structures, called exosomes, microvesicles, microparticles, ectosomes, oncosomes, apoptotic bodies, and many other names. However, specific issues arise when working with these entities, whose size and amount often make them difficult to obtain as relatively pure preparations, and to characterize properly. The International Society for Extracellular Vesicles (ISEV) proposed Minimal Information for Studies of Extracellular Vesicles (“MISEV”) guidelines for the field in 2014. We now update these “MISEV2014” guidelines based on evolution of the collective knowledge in the last four years. An important point to consider is that ascribing a specific function to EVs in general, or to subtypes of EVs, requires reporting of specific information beyond mere description of function in a crude, potentially contaminated, and heterogeneous preparation. For example, claims that exosomes are endowed with exquisite and specific activities remain difficult to support experimentally, given our still limited knowledge of their specific molecular machineries of biogenesis and release, as compared with other biophysically similar EVs. The MISEV2018 guidelines include tables and outlines of suggested protocols and steps to follow to document specific EV-associated functional activities. Finally, a checklist is provided with summaries of key points
Recommended from our members
Joint Analysis Of Psychiatric Disorders Increases Accuracy Of Risk Prediction For Schizophrenia, Bipolar Disorder, And Major Depressive Disorder
Genetic risk prediction has several potential applications in medical research and clinical practice and could be used, for example, to stratify a heterogeneous population of patients by their predicted genetic risk. However, for polygenic traits, such as psychiatric disorders, the accuracy of risk prediction is low. Here we use a multivariate linear mixed model and apply multi-trait genomic best linear unbiased prediction for genetic risk prediction. This method exploits correlations between disorders and simultaneously evaluates individual risk for each disorder. We show that the multivariate approach significantly increases the prediction accuracy for schizophrenia, bipolar disorder, and major depressive disorder in the discovery as well as in independent validation datasets. By grouping SNPs based on genome annotation and fitting multiple random effects, we show that the prediction accuracy could be further improved. The gain in prediction accuracy of the multivariate approach is equivalent to an increase in sample size of 34% for schizophrenia, 68% for bipolar disorder, and 76% for major depressive disorders using single trait models. Because our approach can be readily applied to any number of GWAS datasets of correlated traits, it is a flexible and powerful tool to maximize prediction accuracy. With current sample size, risk predictors are not useful in a clinical setting but already are a valuable research tool, for example in experimental designs comparing cases with high and low polygenic risk
Predicting survival times for neuroblastoma patients using RNA-seq expression profiles
Abstract Background Neuroblastoma is the most common tumor of early childhood and is notorious for its high variability in clinical presentation. Accurate prognosis has remained a challenge for many patients. In this study, expression profiles from RNA-sequencing are used to predict survival times directly. Several models are investigated using various annotation levels of expression profiles (genes, transcripts, and introns), and an ensemble predictor is proposed as a heuristic for combining these different profiles. Results The use of RNA-seq data is shown to improve accuracy in comparison to using clinical data alone for predicting overall survival times. Furthermore, clinically high-risk patients can be subclassified based on their predicted overall survival times. In this effort, the best performing model was the elastic net using both transcripts and introns together. This model separated patients into two groups with 2-year overall survival rates of 0.40±0.11 (n=22) versus 0.80±0.05 (n=68). The ensemble approach gave similar results, with groups 0.42±0.10 (n=25) versus 0.82±0.05 (n=65). This suggests that the ensemble is able to effectively combine the individual RNA-seq datasets. Conclusions Using predicted survival times based on RNA-seq data can provide improved prognosis by subclassifying clinically high-risk neuroblastoma patients. Reviewers This article was reviewed by Subharup Guha and Isabel Nepomuceno
- …