We face a need of discovering a pattern in locations of a great number of
points in a high-dimensional space. Goal is to group the close points together.
We are interested in a hierarchical structure, like a B-tree. B-Trees are
hierarchical, balanced, and they can be constructed dynamically. B-Tree
approach allows to determine the structure without any supervised learning or a
priori knowlwdge. The space is Euclidean and isotropic. Unfortunately, there
are no B-Tree implementations processing indices in a symmetrical and
isotropical way. Some implementations are based on constructing compound
asymmetrical indices from point coordinates; and the others split the nodes
along the coordinate hyper-planes. We need to process tens of millions of
points in a thousand-dimensional space. The application has to be scalable.
Ideally, a cluster should be an ellipsoid, but it would require to store O(n2)
ellipse axes. So, we are using multi-dimensional balls defined by the centers
and radii. Calculation of statistical values like the mean and the average
deviation, can be done in an incremental way. While adding a point to a tree,
the statistical values for nodes recalculated in O(1) time. We support both,
brute force O(2n) and greedy O(n2) split algorithms. Statistical and aggregated
node information also allows to manipulate (to search, to delete) aggregated
sets of closely located points. Hierarchical information retrieval. When
searching, the user is provided with the highest appropriate nodes in the tree
hierarchy, with the most important clusters emerging in the hierarchy
automatically. Then, if interested, the user may navigate down the tree to more
specific points. The system is implemented as a library of Java classes
representing Points, Sets of points with aggregated statistical information,
B-tree, and Nodes with a support of serialization and storage in a MySQL
database.Comment: 6 pages with 3 example

Rutishauser, Oliver

Sadikov, Victor

English

arXiv

We face a business need of discovering a pattern in locations of a great number of points in a high-dimensional space. We assume that there should be a certain structure, so that in some locations the points are close while in other locations the points are more dispersed. Our goal is to group the close points together. The process of grouping close objects is known under the name of clustering. 1.	We are particularly interested in a hierarchical structure. A plain structure may reduce the number of objects, but the data are still difficult to manage or present. 2.	The classical technique suited for the task at hand is a B-Tree. The key properties of the B-Tree are that it is hierarchical and balanced, and it can be dynamically constructed from the input data. In these terms, B-Tree has certain advantages over other clustering algorithms, where the number of clusters needs to be defined a priori. The BTree approach allows to hope that the structure of input data will be well determine without any supervised learning. 3.	The space is Euclidean and isotropic. This is the most challenging part of the project, because currently there are no B-Tree implementations processing indices in a symmetrical and isotropical way. Some known implementations are based on constructing compound asymmetrical indices from point coordinates, where the main index works as a key, while the function of other (999!) indices is lost; and the other known implementations split the nodes along the coordinate hyper-planes, sacrificing the isotropy of the original space. In the latter case the clusters become coordinate parallelepipeds, which is a rather artificial and unnecessary assumption. Our implementation of a B Tree for a high-dimensional space is based directly on concepts of factor analysis. 4.	We need to process a great deal of data, something like tens of millions of points in a thousand-dimensional space. The application has to be scalable, even though, technically, out task is not co

Isotropic Dynamic Hierarchical Clustering

Abstract

Similar works

Full text

Available Versions

Global Journal of Computer Science and Technology (GJCST)