Search CORE

130,411 research outputs found

New Techniques for Clustering Complex Objects

Author: Claudia Plant
Dissertation Im Fach Informatik
Hans-peter Kriegel
Heribert Mühlberger
Karin Kailing
Karin Kailing
Ludwig-maximilians-universität München
Prof Dr
Tag Der Einreichung
Thomas Müller
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 15/11/2004
Field of study

The tremendous amount of data produced nowadays in various application domains such as molecular biology or geography can only be fully exploited by efficient and effective data mining tools. One of the primary data mining tasks is clustering, which is the task of partitioning points of a data set into distinct groups (clusters) such that two points from one cluster are similar to each other whereas two points from distinct clusters are not. Due to modern database technology, e.g.object relational databases, a huge amount of complex objects from scientific, engineering or multimedia applications is stored in database systems. Modelling such complex data often results in very high-dimensional vector data ("feature vectors"). In the context of clustering, this causes a lot of fundamental problems, commonly subsumed under the term "Curse of Dimensionality". As a result, traditional clustering algorithms often fail to generate meaningful results, because in such high-dimensional feature spaces data does not cluster anymore. But usually, there are clusters embedded in lower dimensional subspaces, i.e. meaningful clusters can be found if only a certain subset of features is regarded for clustering. The subset of features may even be different for varying clusters. In this thesis, we present original extensions and enhancements of the density-based clustering notion to cope with high-dimensional data. In particular, we propose an algorithm called SUBCLU (density-connected Subspace Clustering) that extends DBSCAN (Density-Based Spatial Clustering of Applications with Noise) to the problem of subspace clustering. SUBCLU efficiently computes all clusters of arbitrary shape and size that would have been found if DBSCAN were applied to all possible subspaces of the feature space. Two subspace selection techniques called RIS (Ranking Interesting Subspaces) and SURFING (SUbspaces Relevant For clusterING) are proposed. They do not compute the subspace clusters directly, but generate a list of subspaces ranked by their clustering characteristics. A hierarchical clustering algorithm can be applied to these interesting subspaces in order to compute a hierarchical (subspace) clustering. In addition, we propose the algorithm 4C (Computing Correlation Connected Clusters) that extends the concepts of DBSCAN to compute density-based correlation clusters. 4C searches for groups of objects which exhibit an arbitrary but uniform correlation. Often, the traditional approach of modelling data as high-dimensional feature vectors is no longer able to capture the intuitive notion of similarity between complex objects. Thus, objects like chemical compounds, CAD drawings, XML data or color images are often modelled by using more complex representations like graphs or trees. If a metric distance function like the edit distance for graphs and trees is used as similarity measure, traditional clustering approaches like density-based clustering are applicable to those data. However, we face the problem that a single distance calculation can be very expensive. As clustering performs a lot of distance calculations, approaches like filter and refinement and metric indices get important. The second part of this thesis deals with special approaches for clustering in application domains with complex similarity models. We show, how appropriate filters can be used to enhance the performance of query processing and, thus, clustering of hierarchical objects. Furthermore, we describe how the two paradigms of filtering and metric indexing can be combined. As complex objects can often be represented by using different similarity models, a new clustering approach is presented that is able to cluster objects that provide several different complex representations

CiteSeerX

Digitale Hochschulschriften der LMU

Exploration of Large Digital Sky Surveys

Author: A. A. Mahabal
D. Curkendall
J. Jacob
Observatorio Nacional
P. Stolorz
R. Granat
R. J. Brunner
R. R De
R. R. Gal
Rio De Janeiro
S. C. Odewahn
S. Castro
S. G. Djorgovski
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2000
Field of study

We review some of the scientific opportunities and technical challenges posed by the exploration of the large digital sky surveys, in the context of a Virtual Observatory (VO). The VO paradigm will profoundly change the way observational astronomy is done. Clustering analysis techniques can be used to discover samples of rare, unusual, or even previously unknown types of astronomical objects and phenomena. Exploration of the previously poorly probed portions of the observable parameter space are especially promising. We illustrate some of the possible types of studies with examples drawn from DPOSS; much more complex and interesting applications are forthcoming. Development of the new tools needed for an efficient exploration of these vast data sets requires a synergy between astronomy and information sciences, with great potential returns for both fields.Comment: To appear in: Mining the Sky, eds. A. Banday et al., ESO Astrophysics Symposia, Berlin: Springer Verlag, in press (2001). Latex file, 18 pages, 6 encapsulated postscript figures, style files include

arXiv.org e-Print Archive

Efficient and Effective Similarity Search on Complex Objects

Author: Brecheisen Stefan
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 22/02/2007
Field of study

Due to the rapid development of computer technology and new methods for the extraction of data in the last few years, more and more applications of databases have emerged, for which an efficient and effective similarity search is of great importance. Application areas of similarity search include multimedia, computer aided engineering, marketing, image processing and many more. Special interest adheres to the task of finding similar objects in large amounts of data having complex representations. For example, set-valued objects as well as tree or graph structured objects are among these complex object representations. The grouping of similar objects, the so-called clustering, is a fundamental analysis technique, which allows to search through extensive data sets. The goal of this dissertation is to develop new efficient and effective methods for similarity search in large quantities of complex objects. Furthermore, the efficiency of existing density-based clustering algorithms is to be improved when applied to complex objects. The first part of this work motivates the use of vector sets for similarity modeling. For this purpose, a metric distance function is defined, which is suitable for various application ranges, but time-consuming to compute. Therefore, a filter refinement technology is suggested to efficiently process range queries and k-nearest neighbor queries, two basic query types within the field of similarity search. Several filter distances are presented, which approximate the exact object distance and can be computed efficiently. Moreover, a multi-step query processing approach is described, which can be directly integrated into the well-known density-based clustering algorithms DBSCAN and OPTICS. In the second part of this work, new application ranges for density-based hierarchical clustering using OPTICS are discussed. A prototype is introduced, which has been developed for these new application areas and is based on the aforementioned similarity models and accelerated clustering algorithms for complex objects. This prototype facilitates interactive semi-automatic cluster analysis and allows visual search for similar objects in multimedia databases. Another prototype extends these concepts and enables the user to analyze multi-represented and multi-instance data. Finally, the problem of music genre classification is addressed as another application supporting multi-represented and multi-instance data objects. An extensive experimental evaluation examines efficiency and effectiveness of the presented techniques using real-world data and points out advantages in comparison to conventional approaches

Digitale Hochschulschriften der LMU

Recommended from our members

Visual Analytics for Understanding Spatial Situations from Episodic Movement Data

Author: Andrienko G.
Andrienko N.
Hecker D.
Liebig T.
Stange H.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 14/03/2012
Field of study

Continuing advances in modern data acquisition techniques result in rapidly growing amounts of geo-referenced data about moving objects and in emergence of new data types. We define episodic movement data as a new complex data type to be considered in the research fields relevant to data analysis. In episodic movement data, position measurements may be separated by large time gaps, in which the positions of the moving objects are unknown and cannot be reliably reconstructed. Many of the existing methods for movement analysis are designed for data with fine temporal resolution and cannot be applied to discontinuous trajectories. We present an approach utilizing Visual Analytics methods to explore and understand the temporal variation of spatial situations derived from episodic movement data by means of spatio-temporal aggregation. The situations are defined in terms of the presence of moving objects in different places and in terms of flows (collective movements) among the places. The approach, which combines interactive visual displays with clustering of the spatial situations, is presented by example of a real dataset collected by Bluetooth sensors

City Research Online

Mining Structural Databases: An Evolutionary Multi-Objetive Conceptual Clustering Methodology

Author: Cordón Óscar
Harari Óscar
Romero Zaliz Rocío
Rubio Escudero Cristina
Val Coral del
Zwir Igor
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2006
Field of study

The increased availability of biological databases contain ing representations of complex objects permits access to vast amounts of data. In spite of the recent renewed interest in knowledge-discovery tech niques (or data mining), there is a dearth of data analysis methods in tended to facilitate understanding of the represented objects and related systems by their most representative features and those relationship de rived from these features (i.e., structural data). In this paper we propose a conceptual clustering methodology termed EMO-CC for Evolution ary Multi-Objective Conceptual Clustering that uses multi-objective and multi-modal optimization techniques based on Evolutionary Algorithms that uncover representative substructures from structural databases. Be sides, EMO-CC provides annotations of the uncovered substructures, and based on them, applies an unsupervised classification approach to retrieve new members of previously discovered substructures. We apply EMO-CC to the Gene Ontology database to recover interesting sub structures that describes problems from different points of view and use them to explain inmuno-inflammatory responses measured in terms of gene expression profiles derived from the analysis of longitudinal blood expression profiles of human volunteers treated with intravenous endo toxin compared to placebo

idUS. Depósito de Investigación Universidad de Sevilla

Virtual Astronomy, Information Technology, and the New Scientific Methodology

Author: Djorgovski S. G.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2005
Field of study

All sciences, including astronomy, are now entering the era of information abundance. The exponentially increasing volume and complexity of modern data sets promises to transform the scientific practice, but also poses a number of common technological challenges. The Virtual Observatory concept is the astronomical community's response to these challenges: it aims to harness the progress in information technology in the service of astronomy, and at the same time provide a valuable testbed for information technology and applied computer science. Challenges broadly fall into two categories: data handling (or "data farming"), including issues such as archives, intelligent storage, databases, interoperability, fast networks, etc., and data mining, data understanding, and knowledge discovery, which include issues such as automated clustering and classification, multivariate correlation searches, pattern recognition, visualization in highly hyperdimensional parameter spaces, etc., as well as various applications of machine learning in these contexts. Such techniques are forming a methodological foundation for science with massive and complex data sets in general, and are likely to have a much broather impact on the modern society, commerce, information economy, security, etc. There is a powerful emerging synergy between the computationally enabled science and the science-driven computing, which will drive the progress in science, scholarship, and many other venues in the 21st century

Crossref

Caltech Authors

Query processing of spatial objects: Complexity versus Redundancy

Author: A. Braun
B. Chazelle
B. Seeger
C. L. Lawson
F. P. Preparata
H.-P. Kriegel
J. Nievergelt
M. Schiwietz
R. Schneider
R. Schneider
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 01/01/1993
Field of study

The management of complex spatial objects in applications, such as geography and cartography, imposes stringent new requirements on spatial database systems, in particular on efficient query processing. As shown before, the performance of spatial query processing can be improved by decomposing complex spatial objects into simple components. Up to now, only decomposition techniques generating a linear number of very simple components, e.g. triangles or trapezoids, have been considered. In this paper, we will investigate the natural trade-off between the complexity of the components and the redundancy, i.e. the number of components, with respect to its effect on efficient query processing. In particular, we present two new decomposition methods generating a better balance between the complexity and the number of components than previously known techniques. We compare these new decomposition methods to the traditional undecomposed representation as well as to the well-known decomposition into convex polygons with respect to their performance in spatial query processing. This comparison points out that for a wide range of query selectivity the new decomposition techniques clearly outperform both the undecomposed representation and the convex decomposition method. More important than the absolute gain in performance by a factor of up to an order of magnitude is the robust performance of our new decomposition techniques over the whole range of query selectivity

Crossref

Open Access LMU

Exploration of Parameter Spaces in a Virtual Observatory

Author: Brunner R.
Curkendall D.
Djorgovski S. G.
Granat R.
Jacob J.
Mahabal A.
Stolorz P.
Williams R.
Publication venue: 'SPIE-Intl Soc Optical Eng'
Publication date: 01/01/2001
Field of study

Like every other field of intellectual endeavor, astronomy is being revolutionised by the advances in information technology. There is an ongoing exponential growth in the volume, quality, and complexity of astronomical data sets, mainly through large digital sky surveys and archives. The Virtual Observatory (VO) concept represents a scientific and technological framework needed to cope with this data flood. Systematic exploration of the observable parameter spaces, covered by large digital sky surveys spanning a range of wavelengths, will be one of the primary modes of research with a VO. This is where the truly new discoveries will be made, and new insights be gained about the already known astronomical objects and phenomena. We review some of the methodological challenges posed by the analysis of large and complex data sets expected in the VO-based research. The challenges are driven both by the size and the complexity of the data sets (billions of data vectors in parameter spaces of tens or hundreds of dimensions), by the heterogeneity of the data and measurement errors, including differences in basic survey parameters for the federated data sets (e.g., in the positional accuracy and resolution, wavelength coverage, time baseline, etc.), various selection effects, as well as the intrinsic clustering properties (functional form, topology) of the data distributions in the parameter spaces of observed attributes. Answering these challenges will require substantial collaborative efforts and partnerships between astronomers, computer scientists, and statisticians.Comment: Invited review, 10 pages, Latex file with 4 eps figures, style files included. To appear in Proc. SPIE, v. 4477 (2001

arXiv.org e-Print Archive