Search CORE

35 research outputs found

Impact of the Choice of Normalization Method on Molecular Cancer Class Discovery Using Nonnegative Matrix Factorization

Author: Haixuan Yang (147680)
Cathal Seoighe (2731)
Publication venue
Publication date: 01/01/1975
Field of study

<div>Nonnegative Matrix Factorization (NMF) has proved to be an effective method for unsupervised clustering analysis of gene expression data. By the nonnegativity constraint, NMF provides a decomposition of the data matrix into two matrices that have been used for clustering analysis. However, the decomposition is not unique. This allows different clustering results to be obtained, resulting in different interpretations of the decomposition. To alleviate this problem, some existing methods directly enforce uniqueness to some extent by adding regularization terms in the NMF objective function. Alternatively, various normalization methods have been applied to the factor matrices; however, the effects of the choice of normalization have not been carefully investigated. Here we investigate the performance of NMF for the task of cancer class discovery, under a wide range of normalization choices. After extensive evaluations, we observe that the maximum norm showed the best performance, although the maximum norm has not previously been used for NMF. Matlab codes are freely available from: <a href="http://maths.nuigalway.ie/~haixuanyang/pNMF/pNMF.htm" target="_blank">http://maths.nuigalway.ie/~haixuanyang/pNMF/pNMF.htm</a>.</div

OAKTrust Digital Repository (Texas A&M Univ)

The Francis Crick Institute

Dataset description and performance comparison in term of clustering accuracy in percentage.

Author: Cathal Seoighe (2731)
Haixuan Yang (147680)
Publication venue
Publication date
Field of study

Reported is the mean of clustering accuracies from 100 runs of Basic NMF together with the standard error of the mean. Also reported is the p-value produced by a paired two-sided t-test. Note that the proposed method is using ‘max’ normalization and using the filter.</p

The Francis Crick Institute

Cophenetic correlation.

Author: Haixuan Yang (147680)
Cathal Seoighe (2731)
Publication venue
Publication date: 01/01/1989
Field of study

(a) Leukemia. (b) CNS. (c) Medulloblastoma.</p

OAKTrust Digital Repository (Texas A&M Univ)

The Francis Crick Institute

Putative mutator allele on chromosome 11.

Author: Aylwyn Scally (109051)
Cathal Seoighe (2731)
Publication venue
Publication date
Field of study

Relationship between the number of de novo mutations in the offspring and the maternal number of highly derived haplotypes of the putative mutator allele on chromosome 11 (a). Location of the putative mutator locus on chromosome 11, defined as a peak in the difference between the maximum and interquartile mean number of derived alleles across haplotypes (b).</p

The Francis Crick Institute

Comparison of different normalization methods in term of clustering accuracy in percentage.

Author: Cathal Seoighe (2731)
Haixuan Yang (147680)
Publication venue
Publication date
Field of study

Reported is the mean of clustering accuracies from 100 runs of Basic NMF together with the standard error of the mean.</p

The Francis Crick Institute

Top 20 candidate mutator loci (relative to genome build hg19), ranked by residual.

Author: Aylwyn Scally (109051)
Cathal Seoighe (2731)
Publication venue
Publication date
Field of study

Top 20 candidate mutator loci (relative to genome build hg19), ranked by residual.</p

The Francis Crick Institute

Accuracy as a function of noise levels for datasets Leukemia (k = 2), Leukemia (k = 3), CNS (k = 4) and Medulloblastoma (k = 2) respectively.

Author: Cathal Seoighe (2731)
Haixuan Yang (147680)
Publication venue
Publication date
Field of study

For each noise level μ, NMFs are run 100 times on disturbed matrices. On each of such runs, a disturbed matrix A′ is generated by adding independent uniform noises: , where rij is a random number generated by a uniform distribution on the interval [0, max], and max is the maximum expression in A. Plotted is the mean of clustering accuracies from 100 runs together with an error bar representing a standard error of the mean. The post-processing method uses the maximum norm together with the filter.</p

The Francis Crick Institute

Cophenetic correlation.

Author: Cathal Seoighe (2731)
Haixuan Yang (147680)
Publication venue
Publication date
Field of study

(a) Leukemia. (b) CNS. (c) Medulloblastoma.</p

The Francis Crick Institute

Simulation study.

Author: Aylwyn Scally (109051)
Cathal Seoighe (2731)
Publication venue
Publication date
Field of study

An example of simulated data showing a subset of haplotypes with a peak in the number of derived alleles in a 10 Kb sliding window. A region of 100 Kb was simulated over 40,000 generations using a coalescent approach with recombination. In this example a mutator allele with ϕ = 5 was introduced 20,000 generations before the present and was assumed to be weakly deleterious, with a selective coefficient of -0.0002. The red and green lines show the maximum and trimmed mean number of derived alleles in the window. Individual sampled haplotypes are shown in grey.</p

The Francis Crick Institute

Clustering errors as a function of the number of features (genes) for datasets Leukemia (k = 2), Leukemia (k = 3), CNS (k = 4) and Medulloblastoma (k = 2) respectively.

Author: Cathal Seoighe (2731)
Haixuan Yang (147680)
Publication venue
Publication date
Field of study

On each of these datasets, the Basic NMF (and the post-processing method using the maximum norm together with the filter) is run on subsets of the full data with 1000 + 100d of most highly varying genes (d = 0, 1, 2, 3, …). Results are shown as continuous lines for clarity. Clustering error, the number of samples improperly clustered by an algorithm. Here the Basic NMF is the one minimizing a KL divergence in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0164880#pone.0164880.e001" target="_blank">Eq (1)</a>.</p

The Francis Crick Institute