59 research outputs found

    Non-Negative Matrix Factorization for the Analysis of Complex Gene Expression Data: Identification of Clinically Relevant Tumor Subtypes

    Get PDF
    Non-negative matrix factorization (NMF) is a relatively new approach to analyze gene expression data that models data by additive combinations of non-negative basis vectors (metagenes). The non-negativity constraint makes sense biologically as genes may either be expressed or not, but never show negative expression. We applied NMF to five different microarray data sets. We estimated the appropriate number metagens by comparing the residual error of NMF reconstruction of data to that of NMF reconstruction of permutated data, thus finding when a given solution contained more information than noise. This analysis also revealed that NMF could not factorize one of the data sets in a meaningful way. We used GO categories and pre defined gene sets to evaluate the biological significance of the obtained metagenes. By analyses of metagenes specific for the same GO-categories we could show that individual metagenes activated different aspects of the same biological processes. Several of the obtained metagenes correlated with tumor subtypes and tumors with characteristic chromosomal translocations, indicating that metagenes may correspond to specific disease entities. Hence, NMF extracts biological relevant structures of microarray expression data and may thus contribute to a deeper understanding of tumor behavior

    Robust assignment of cancer subtypes from expression data using a uni-variate gene expression average as classifier

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Genome wide gene expression data is a rich source for the identification of gene signatures suitable for clinical purposes and a number of statistical algorithms have been described for both identification and evaluation of such signatures. Some employed algorithms are fairly complex and hence sensitive to over-fitting whereas others are more simple and straight forward. Here we present a new type of simple algorithm based on ROC analysis and the use of metagenes that we believe will be a good complement to existing algorithms.</p> <p>Results</p> <p>The basis for the proposed approach is the use of metagenes, instead of collections of individual genes, and a feature selection using AUC values obtained by ROC analysis. Each gene in a data set is assigned an AUC value relative to the tumor class under investigation and the genes are ranked according to these values. Metagenes are then formed by calculating the mean expression level for an increasing number of ranked genes, and the metagene expression value that optimally discriminates tumor classes in the training set is used for classification of new samples. The performance of the metagene is then evaluated using LOOCV and balanced accuracies.</p> <p>Conclusions</p> <p>We show that the simple uni-variate gene expression average algorithm performs as well as several alternative algorithms such as discriminant analysis and the more complex approaches such as SVM and neural networks. The R package <it>rocc </it>is freely available at <url>http://cran.r-project.org/web/packages/rocc/index.html</url>.</p

    Independent component analysis reveals new and biologically significant structures in micro array data

    Get PDF
    BACKGROUND: An alternative to standard approaches to uncover biologically meaningful structures in micro array data is to treat the data as a blind source separation (BSS) problem. BSS attempts to separate a mixture of signals into their different sources and refers to the problem of recovering signals from several observed linear mixtures. In the context of micro array data, "sources" may correspond to specific cellular responses or to co-regulated genes. RESULTS: We applied independent component analysis (ICA) to three different microarray data sets; two tumor data sets and one time series experiment. To obtain reliable components we used iterated ICA to estimate component centrotypes. We found that many of the low ranking components indeed may show a strong biological coherence and hence be of biological significance. Generally ICA achieved a higher resolution when compared with results based on correlated expression and a larger number of gene clusters with significantly enriched for gene ontology (GO) categories. In addition, components characteristic for molecular subtypes and for tumors with specific chromosomal translocations were identified. ICA also identified more than one gene clusters significant for the same GO categories and hence disclosed a higher level of biological heterogeneity, even within coherent groups of genes. CONCLUSION: Although the ICA approach primarily detects hidden variables, these surfaced as highly correlated genes in time series data and in one instance in the tumor data. This further strengthens the biological relevance of latent variables detected by ICA

    Patients with suspected acute coronary syndrome in a university hospital emergency department: an observational study

    Get PDF
    BACKGROUND: It is widely considered that improved diagnostics in suspected acute coronary syndrome (ACS) are needed. To help clarify the current situation and the improvement potential, we analyzed characteristics, disposition and outcome among patients with suspected ACS at a university hospital emergency department (ED). METHODS: 157 consecutive patients with symptoms of ACS were included at the ED during 10 days. Risk of ACS was estimated in the ED for each patient based on history, physical examination and ECG by assigning them to one of four risk categories; I (obvious myocardial infarction, MI), II (strong suspicion of ACS), III (vague suspicion of ACS), and IV (no suspicion of ACS). RESULTS: 4, 17, 29 and 50% of the patients were allocated to risk categories I-IV respectively. 74 patients (47%) were hospitalized but only 19 (26%) had ACS as the discharge diagnose. In risk categories I-IV, ACS rates were 100, 37, 12 and 0%, respectively. Of those admitted without ACS, at least 37% could probably, given perfect ED diagnostics, have been immediately discharged. 83 patients were discharged from the ED, and among them there were no hospitalizations for ACS or cardiac mortality at 6 months. Only about three patients per 24 h were considered eligible for a potential ED chest pain unit. CONCLUSIONS: Almost 75% of the patients hospitalized with suspected ACS did not have it, and some 40% of these patients could probably, given perfect immediate diagnostics, have been managed as outpatients. The potential for diagnostic improvement in the ED seems large

    Topics in multifractal measures, nonparametrics and biostatistics

    No full text
    This thesis consists of four papers. The first two papers, which comprise the main part of the thesis, deal with an unexpected connection between kernel density estimators and dimension spectra for multifractal measures. The third paper presents a fully automated expert system for the diagnosis of pulmonary embolism from ventilation/perfusion scintigraphy. The final paper concerns statistical properties of the parameters of the operational model of pharmacological agonism, a widely applied model for dose-response curves in pharmacology. In the first paper kernel density estimators for singular distributions are studied. The density estimator f is a function of the sample size and the bandwidth. It was found that the integral of H(f), where H is a suitable “magnifying” functional, diverges as the sample increases to infinity and the bandwidth goes to 0. In the second paper it is shown that, for a particular choice of H, the velocity with which the integral of H(f) diverges depends on the q:th generalized Hentschel-Procaccia dimension of the measure from which the sample is drawn. This gives a new way to estimate dimension spectra for multifractal measures. An alternative kernel-based method that gives the correlation integral as a special case is also studied, which enables the estimation of the correlation dimension. The classic way of estimating generalized fractal dimensions with the aid of grids gives the generalized Rényi dimension. For q>-1 this is proved to be equivalent to the generalized Hentschel-Procaccia dimension. For q<-1 the Rényi dimension may depend on the choice of grid and thus be different from the uniquely defined Hentschel-Procaccia dimension. Examples of such measures are given

    An automated method for the detection of pulmonary embolism in V/Q-scans

    No full text
    In this paper, a fully automatic method for the diagnosis of pulmonary embolism (PE) from V/Q-scans is presented. Image analysis is applied to the ventilation and the perfusion images obtained in the V/Q-scan. The difference of the ventilation and the perfusion is calculated after transformation and hot-spot reduction of the images. From this difference image the integral of the underperfused areas are used as features. With the aid of these features a simple test for PE is devised. The method is evaluated on two sets of patients. One set comprises 102 patients who have undergone both V/Q-scanning and angiography. The performance given as the area under the Receiver Operating Characteristic (ROC) curve is 0.85. Another set is made up by the 507 consecutive patients examined with V/Q-scanning at Lund University Hospital in Sweden. In this case, the reference was the consensus opinion of two radiologists, the ROC-area is 0.67. A fully automatic and reasonably robust expert system is developed to aid the radiologist in the interpretation of V/Q-scans for PE. (C) 2003 Elsevier B.V. All rights reserved

    5 frågor till Attila Frigyesi

    No full text

    Dimension spectra for multifractal measures with connections to nonparametric density estimation

    No full text
    We consider relations between Rényi's and Hentschel - Procaccia's definitions of generalized dimensions of a probability measure μ, and give conditions under which the two concepts are equivalent/different. Estimators of the dimension spectrum are developed, and strong consistency is established. Particular cases of our estimators are methods based on the sample correlation integral and box counting. Then we discuss the relation between generalized dimensions and kernel density estimators f̂. It was shown in Frigyesi and Hössjer (1998), that ∫ f̂1+q(x)dx diverges with increasing sample size and decreasing bandwidth if the marginal distribution μ has a singular part and q > 0. In this paper, we show that the rate of divergence depends on the qth generalized Rényi dimension of μ
    corecore