34 research outputs found
Hierarchy of Gene Expression Data is Predictive of Future Breast Cancer Outcome
We calculate measures of hierarchy in gene and tissue networks of breast
cancer patients. We find that the likelihood of metastasis in the future is
correlated with increased values of network hierarchy for expression networks
of cancer-associated genes, due to correlated expression of cancer-specific
pathways. Conversely, future metastasis and quick relapse times are negatively
correlated with values of network hierarchy in the expression network of all
genes, due to dedifferentiation of gene pathways and circuits. These results
suggest that hierarchy of gene expression may be useful as an additional
biomarker for breast cancer prognosis.Comment: 14 pages, 5 figure
Do logarithmic proximity measures outperform plain ones in graph clustering?
We consider a number of graph kernels and proximity measures including
commute time kernel, regularized Laplacian kernel, heat kernel, exponential
diffusion kernel (also called "communicability"), etc., and the corresponding
distances as applied to clustering nodes in random graphs and several
well-known datasets. The model of generating random graphs involves edge
probabilities for the pairs of nodes that belong to the same class or different
predefined classes of nodes. It turns out that in most cases, logarithmic
measures (i.e., measures resulting after taking logarithm of the proximities)
perform better while distinguishing underlying classes than the "plain"
measures. A comparison in terms of reject curves of inter-class and intra-class
distances confirms this conclusion. A similar conclusion can be made for
several well-known datasets. A possible origin of this effect is that most
kernels have a multiplicative nature, while the nature of distances used in
cluster algorithms is an additive one (cf. the triangle inequality). The
logarithmic transformation is a tool to transform the first nature to the
second one. Moreover, some distances corresponding to the logarithmic measures
possess a meaningful cutpoint additivity property. In our experiments, the
leader is usually the logarithmic Communicability measure. However, we indicate
some more complicated cases in which other measures, typically, Communicability
and plain Walk, can be the winners.Comment: 11 pages, 5 tables, 9 figures. Accepted for publication in the
Proceedings of 6th International Conference on Network Analysis, May 26-28,
2016, Nizhny Novgorod, Russi
Hierarchy in Gene Expression is Predictive for Adult Acute Myeloid Leukemia
Cancer progresses with a change in the structure of the gene network in
normal cells. We define a measure of organizational hierarchy in gene networks
of affected cells in adult acute myeloid leukemia (AML) patients. With a
retrospective cohort analysis based on the gene expression profiles of 116
acute myeloid leukemia patients, we find that the likelihood of future cancer
relapse and the level of clinical risk are directly correlated with the level
of organization in the cancer related gene network. We also explore the
variation of the level of organization in the gene network with cancer
progression. We find that this variation is non-monotonic, which implies the
fitness landscape in the evolution of AML cancer cells is nontrivial. We
further find that the hierarchy in gene expression at the time of diagnosis may
be a useful biomarker in AML prognosis.Comment: 18 pages, 5 figures, to appear in Physical Biolog
A Statistical Toolbox For Mining And Modeling Spatial Data
Most data mining projects in spatial economics start with an evaluation of a set of attribute variables on a sample of spatial entities, looking for the existence and strength of spatial autocorrelation, based on the Moran鈥檚 and the Geary鈥檚 coefficients, the adequacy of which is rarely challenged, despite the fact that when reporting on their properties, many users seem likely to make mistakes and to foster confusion. My paper begins by a critical appraisal of the classical definition and rational of these indices. I argue that while intuitively founded, they are plagued by an inconsistency in their conception. Then, I propose a principled small change leading to corrected spatial autocorrelation coefficients, which strongly simplifies their relationship, and opens the way to an augmented toolbox of statistical methods of dimension reduction and data visualization, also useful for modeling purposes. A second section presents a formal framework, adapted from recent work in statistical learning, which gives theoretical support to our definition of corrected spatial autocorrelation coefficients. More specifically, the multivariate data mining methods presented here, are easily implementable on the existing (free) software, yield methods useful to exploit the proposed corrections in spatial data analysis practice, and, from a mathematical point of view, whose asymptotic behavior, already studied in a series of papers by Belkin & Niyogi, suggests that they own qualities of robustness and a limited sensitivity to the Modifiable Areal Unit Problem (MAUP), valuable in exploratory spatial data analysis
Improved Image Segmentation Algorithm Using Graph-Edges
In this paper an efficient algorithm for segment digital image has beendeveloped by measuring the evidence for a boundary between two regions in an imageusing (graph-edges). The regions in the image were sorted as components, where eachregion in an image represents a component in the graph. The region comparisonpredicate evaluates if there is evidence for a boundary between a pair of componentsby checking if the difference between the components, is large relative to the internaldifference within at least one of the components. A threshold function is used tocontrol the degree the difference between components must be larger than minimuminternal difference. An important characteristic of the method is its ability to preservedetail in important image regions while ignoring detail in unimportant regions. Theclassical methods depend just on external difference and ignore the internaldifference, when segment two neighboring regions
Large Scale Spectral Clustering Using Approximate Commute Time Embedding
Spectral clustering is a novel clustering method which can detect complex
shapes of data clusters. However, it requires the eigen decomposition of the
graph Laplacian matrix, which is proportion to and thus is not
suitable for large scale systems. Recently, many methods have been proposed to
accelerate the computational time of spectral clustering. These approximate
methods usually involve sampling techniques by which a lot information of the
original data may be lost. In this work, we propose a fast and accurate
spectral clustering approach using an approximate commute time embedding, which
is similar to the spectral embedding. The method does not require using any
sampling technique and computing any eigenvector at all. Instead it uses random
projection and a linear time solver to find the approximate embedding. The
experiments in several synthetic and real datasets show that the proposed
approach has better clustering quality and is faster than the state-of-the-art
approximate spectral clustering methods