118 research outputs found
The Minkowski central partition as a pointer to a suitable distance exponent and consensus partitioning
The Minkowski weighted K-means (MWK-means) is a recently developed clustering algorithm capable of computing feature weights. The cluster-specific weights in MWK-means follow the intuitive idea that a feature with low variance should have a greater weight than a feature with high variance. The final clustering found by this algorithm depends on the selection of the Minkowski distance exponent. This paper explores the possibility of using the central Minkowski partition in the ensemble of all Minkowski partitions for selecting an optimal value of the Minkowski exponent. The central Minkowski partition appears to be also a good consensus partition. Furthermore, we discovered some striking correlation results between the Minkowski profile, defined as a mapping of the Minkowski exponent values into the average similarity values of the optimal Minkowski partitions, and the Adjusted Rand Index vectors resulting from the comparison of the obtained partitions to the ground truth. Our findings were confirmed by a series of computational experiments involving synthetic Gaussian clusters and real-world data
Sparse p-Adic Data Coding for Computationally Efficient and Effective Big Data Analytics
We develop the theory and practical implementation of p-adic sparse coding of data. Rather than the standard, sparsifying criterion that uses the pseudo-norm, we use the p-adic
norm.We require that the hierarchy or tree be node-ranked, as is standard practice in agglomerative and other hierarchical clustering, but not necessarily with decision trees. In order to structure the data, all computational processing operations are direct reading of the data, or are bounded by a constant number of direct readings of the data, implying linear computational time. Through p-adic sparse data coding, efficient storage results, and for bounded p-adic norm stored data, search and retrieval are constant time operations. Examples show the effectiveness of this new approach to content-driven encoding and displaying of data
Community detection in graphs
The modern science of networks has brought significant advances to our
understanding of complex systems. One of the most relevant features of graphs
representing real systems is community structure, or clustering, i. e. the
organization of vertices in clusters, with many edges joining vertices of the
same cluster and comparatively few edges joining vertices of different
clusters. Such clusters, or communities, can be considered as fairly
independent compartments of a graph, playing a similar role like, e. g., the
tissues or the organs in the human body. Detecting communities is of great
importance in sociology, biology and computer science, disciplines where
systems are often represented as graphs. This problem is very hard and not yet
satisfactorily solved, despite the huge effort of a large interdisciplinary
community of scientists working on it over the past few years. We will attempt
a thorough exposition of the topic, from the definition of the main elements of
the problem, to the presentation of most methods developed, with a special
focus on techniques designed by statistical physicists, from the discussion of
crucial issues like the significance of clustering and how methods should be
tested and compared against each other, to the description of applications to
real networks.Comment: Review article. 103 pages, 42 figures, 2 tables. Two sections
expanded + minor modifications. Three figures + one table + references added.
Final version published in Physics Report
RecG directs DNA synthesis during double-strand break repair
Homologous recombination provides a mechanism of DNA double-strand break repair (DSBR) that requires an intact, homologous template for DNA synthesis. When DNA synthesis associated with DSBR is convergent, the broken DNA strands are replaced and repair is accurate. However, if divergent DNA synthesis is established, over-replication of flanking DNA may occur with deleterious consequences. The RecG protein of Escherichia coli is a helicase and translocase that can re-model 3-way and 4-way DNA structures such as replication forks and Holliday junctions. However, the primary role of RecG in live cells has remained elusive. Here we show that, in the absence of RecG, attempted DSBR is accompanied by divergent DNA replication at the site of an induced chromosomal DNA double-strand break. Furthermore, DNA double-stand ends are generated in a recG mutant at sites known to block replication forks. These double-strand ends, also trigger DSBR and the divergent DNA replication characteristic of this mutant, which can explain over-replication of the terminus region of the chromosome. The loss of DNA associated with unwinding joint molecules previously observed in the absence of RuvAB and RecG, is suppressed by a helicase deficient PriA mutation (priA300), arguing that the action of RecG ensures that PriA is bound correctly on D-loops to direct DNA replication rather than to unwind joint molecules. This has led us to put forward a revised model of homologous recombination in which the re-modelling of branched intermediates by RecG plays a fundamental role in directing DNA synthesis and thus maintaining genomic stability
Laplacian normalization for deriving thematic fuzzy clusters with an additive spectral approach
This paper presents a further investigation into computational properties of a novel fuzzy additive spectral clustering method, Fuzzy Additive Spectral clustering (FADDIS), recently introduced by authors. Specifically, we extend our analysis to âdifficultâ data structures from the recent literature and develop two synthetic data generators simulating affinity data of Gaussian clusters and genuine additive similarity data, with a controlled level of noise. The FADDIS is experimentally verified on these data in comparison with two state-of-the-art fuzzy clustering methods. The claimed ability of FADDIS to help in determining the right number of clusters is experimentally tested, and the role of the pseudo-inverse Laplacian data transformation in this is highlighted. A potentially useful extension of the method to biclustering is introduced
A hybrid cluster-lift method for the analysis of research activities
A hybrid of two novel methods - additive fuzzy spectral clustering and lifting method over a taxonomy - is applied to analyse the research activities of a department. To be specific, we concentrate on the Computer Sciences area represented by the ACM Computing Classification System (ACM-CCS), but the approach is applicable also to other taxonomies. Clusters of the taxonomy subjects are extracted using an original additive spectral clustering method involving a number of model-based stopping conditions. The clusters are parsimoniously lifted then to higher ranks of the taxonomy by minimizing the count of âhead subjectsâ along with their âgapsâ and âoffshootsâ. An example is given illustrating the method applied to real-world data
- âŠ