3,382 research outputs found
ELKI: A large open-source library for data analysis - ELKI Release 0.7.5 "Heidelberg"
This paper documents the release of the ELKI data mining framework, version
0.7.5.
ELKI is an open source (AGPLv3) data mining software written in Java. The
focus of ELKI is research in algorithms, with an emphasis on unsupervised
methods in cluster analysis and outlier detection. In order to achieve high
performance and scalability, ELKI offers data index structures such as the
R*-tree that can provide major performance gains. ELKI is designed to be easy
to extend for researchers and students in this domain, and welcomes
contributions of additional methods. ELKI aims at providing a large collection
of highly parameterizable algorithms, in order to allow easy and fair
evaluation and benchmarking of algorithms.
We will first outline the motivation for this release, the plans for the
future, and then give a brief overview over the new functionality in this
version. We also include an appendix presenting an overview on the overall
implemented functionality
Efficient Similarity Indexing and Searching in High Dimensions
Efficient indexing and searching of high dimensional data has been an area of
active research due to the growing exploitation of high dimensional data and
the vulnerability of traditional search methods to the curse of dimensionality.
This paper presents a new approach for fast and effective searching and
indexing of high dimensional features using random partitions of the feature
space. Experiments on both handwritten digits and 3-D shape descriptors have
shown the proposed algorithm to be highly effective and efficient in indexing
and searching real data sets of several hundred dimensions. We also compare its
performance to that of the state-of-the-art locality sensitive hashing
algorithm
Behavioral Simulations in MapReduce
In many scientific domains, researchers are turning to large-scale behavioral
simulations to better understand important real-world phenomena. While there
has been a great deal of work on simulation tools from the high-performance
computing community, behavioral simulations remain challenging to program and
automatically scale in parallel environments. In this paper we present BRACE
(Big Red Agent-based Computation Engine), which extends the MapReduce framework
to process these simulations efficiently across a cluster. We can leverage
spatial locality to treat behavioral simulations as iterated spatial joins and
greatly reduce the communication between nodes. In our experiments we achieve
nearly linear scale-up on several realistic simulations.
Though processing behavioral simulations in parallel as iterated spatial
joins can be very efficient, it can be much simpler for the domain scientists
to program the behavior of a single agent. Furthermore, many simulations
include a considerable amount of complex computation and message passing
between agents, which makes it important to optimize the performance of a
single node and the communication across nodes. To address both of these
challenges, BRACE includes a high-level language called BRASIL (the Big Red
Agent SImulation Language). BRASIL has object oriented features for programming
simulations, but can be compiled to a data-flow representation for automatic
parallelization and optimization. We show that by using various optimization
techniques, we can achieve both scalability and single-node performance similar
to that of a hand-coded simulation
Spatial Indexing of Large Multidimensional Databases
Scientific endeavors such as large astronomical surveys generate databases on
the terabyte scale. These, usually multidimensional databases must be
visualized and mined in order to find interesting objects or to extract
meaningful and qualitatively new relationships. Many statistical algorithms
required for these tasks run reasonably fast when operating on small sets of
in-memory data, but take noticeable performance hits when operating on large
databases that do not fit into memory. We utilize new software technologies to
develop and evaluate fast multidimensional indexing schemes that inherently
follow the underlying, highly non-uniform distribution of the data: they are
layered uniform grid indices, hierarchical binary space partitioning, and
sampled flat Voronoi tessellation of the data. Our working database is the
5-dimensional magnitude space of the Sloan Digital Sky Survey with more than
270 million data points, where we show that these techniques can dramatically
speed up data mining operations such as finding similar objects by example,
classifying objects or comparing extensive simulation sets with observations.
We are also developing tools to interact with the multidimensional database and
visualize the data at multiple resolutions in an adaptive manner.Comment: 12 pages, 16 figures; CIDR 200
GeoP2P: An adaptive peer-to-peer overlay for efficient search and update of spatial information
This paper proposes a fully decentralized peer-to-peer overlay structure
GeoP2P, to facilitate geographic location based search and retrieval of
information. Certain limitations of centralized geographic indexes favor
peer-to-peer organization of the information, which, in addition to avoiding
performance bottleneck, allows autonomy over local information. Peer-to-peer
systems for geographic or multidimensional range queries built on existing DHTs
suffer from the inaccuracy in linearization of the multidimensional space.
Other overlay structures that are based on hierarchical partitioning of the
search space are not scalable because they use special super-peers to represent
the nodes in the hierarchy. GeoP2P partitions the search space hierarchically,
maintains the overlay structure and performs the routing without the need of
any super-peers. Although similar fully-decentralized overlays have been
previously proposed, they lack the ability to dynamically grow and retract the
partition hierarchy when the number of peers change. GeoP2P provides such
adaptive features with minimum perturbation of the system state. Such
adaptation makes both the routing delay and the state size of each peer
logarithmic to the total number of peers, irrespective of the size of the
multidimensional space. Our analysis also reveals that the overlay structure
and the routing algorithm are generic and independent of several aspects of the
partitioning hierarchy, such as the geometric shape of the zones or the
dimensionality of the search space.Comment: 13 pages, Submitted to VLDB-2009 conferenc
SSDH: Semi-supervised Deep Hashing for Large Scale Image Retrieval
Hashing methods have been widely used for efficient similarity retrieval on
large scale image database. Traditional hashing methods learn hash functions to
generate binary codes from hand-crafted features, which achieve limited
accuracy since the hand-crafted features cannot optimally represent the image
content and preserve the semantic similarity. Recently, several deep hashing
methods have shown better performance because the deep architectures generate
more discriminative feature representations. However, these deep hashing
methods are mainly designed for supervised scenarios, which only exploit the
semantic similarity information, but ignore the underlying data structures. In
this paper, we propose the semi-supervised deep hashing (SSDH) approach, to
perform more effective hash function learning by simultaneously preserving
semantic similarity and underlying data structures. The main contributions are
as follows: (1) We propose a semi-supervised loss to jointly minimize the
empirical error on labeled data, as well as the embedding error on both labeled
and unlabeled data, which can preserve the semantic similarity and capture the
meaningful neighbors on the underlying data structures for effective hashing.
(2) A semi-supervised deep hashing network is designed to extensively exploit
both labeled and unlabeled data, in which we propose an online graph
construction method to benefit from the evolving deep features during training
to better capture semantic neighbors. To the best of our knowledge, the
proposed deep network is the first deep hashing method that can perform hash
code learning and feature learning simultaneously in a semi-supervised fashion.
Experimental results on 5 widely-used datasets show that our proposed approach
outperforms the state-of-the-art hashing methods.Comment: 14 pages, accepted by IEEE Transactions on Circuits and Systems for
Video Technolog
Improved Density-Based Spatio--Textual Clustering on Social Media
DBSCAN may not be sufficient when the input data type is heterogeneous in
terms of textual description. When we aim to discover clusters of geo-tagged
records relevant to a particular point-of-interest (POI) on social media,
examining only one type of input data (e.g., the tweets relevant to a POI) may
draw an incomplete picture of clusters due to noisy regions. To overcome this
problem, we introduce DBSTexC, a newly defined density-based clustering
algorithm using spatio--textual information. We first characterize POI-relevant
and POI-irrelevant tweets as the texts that include and do not include a POI
name or its semantically coherent variations, respectively. By leveraging the
proportion of POI-relevant and POI-irrelevant tweets, the proposed algorithm
demonstrates much higher clustering performance than the DBSCAN case in terms
of score and its variants. While DBSTexC performs exactly as
DBSCAN with the textually homogeneous inputs, it far outperforms DBSCAN with
the textually heterogeneous inputs. Furthermore, to further improve the
clustering quality by fully capturing the geographic distribution of tweets, we
present fuzzy DBSTexC (F-DBSTexC), an extension of DBSTexC, which incorporates
the notion of fuzzy clustering into the DBSTexC. We then demonstrate the
robustness of F-DBSTexC via intensive experiments. The computational complexity
of our algorithms is also analytically and numerically shown.Comment: 14 pages, 10 figures, 6 tables, Submitted for publication to the IEEE
Transactions on Knowledge and Data Engineerin
Efficient techniques for mining spatial databases
Clustering is one of the major tasks in data mining. In the last few years,
Clustering of spatial data has received a lot of research attention. Spatial
databases are components of many advanced information systems like geographic
information systems VLSI design systems. In this thesis, we introduce several
efficient algorithms for clustering spatial data. First, we present a
grid-based clustering algorithm that has several advantages and comparable
performance to the well known efficient clustering algorithm. The algorithm has
several advantages. The algorithm does not require many input parameters. It
requires only three parameters, the number of the points in the data space, the
number of the cells in the grid and a percentage. The number of the cells in
the grid reflects the accuracy that should be achieved by the algorithm. The
algorithm is capable of discovering clusters of arbitrary shapes. The
computational complexity of the algorithm is comparable to the complexity of
the most efficient clustering algorithm. The algorithm has been implemented and
tested against different ranges of database sizes. The performance results show
that the running time of the algorithm is superior to the most well known
algorithms (CLARANS [23]). The results show also that the performance of the
algorithm do not degrade as the number of the data points increases.Comment: 112 pages; M.Sc. thesis, Department of Mathematics, Faculty of
Science, Cairo University, 200
Anisotropic k-Nearest Neighbor Search Using Covariance Quadtree
We present a variant of the hyper-quadtree that divides a multidimensional
space according to the hyperplanes associated to the principal components of
the data in each hyperquadrant. Each of the hyper-quadrants is a
data partition in a -dimension subspace, whose intrinsic
dimensionality is reduced from the root dimensionality by
the principal components analysis, which discards the irrelevant eigenvalues of
the local covariance matrix. In the present method a component is irrelevant if
its length is smaller than, or comparable to, the local inter-data spacing.
Thus, the covariance hyper-quadtree is fully adaptive to the local
dimensionality. The proposed data-structure is used to compute the anisotropic
K nearest neighbors (kNN), supported by the Mahalanobis metric. As an
application, we used the present k nearest neighbors method to perform density
estimation over a noisy data distribution. Such estimation method can be
further incorporated to the smoothed particle hydrodynamics, allowing computer
simulations of anisotropic fluid flows.Comment: Work presented at the Minisymposia of Computational Geometry in the
joint events IX Argentinian Congress on Computational Mechanics, XXXI
Iberian-Latin-American Congress on Computational Methods in Engineering, II
South American Congress on Computational Mechanics, held in Buenos Aires in
15-18 November 2010; Mec\'anica Computacional (Computational Mechanics) Vol.
XXIX, 2010, ISSN 1666-607
The Sloan Digital Sky Survey and its Archive
The next-generation astronomy archives will cover most of the universe at
fine resolution in many wavelengths. One of the first of these projects, the
Sloan Digital Sky Survey (SDSS) will create a 5-wavelength catalog over 10,000
square degrees of the sky. The 200 million objects in the multi-terabyte
database will have mostly numerical attributes, defining a space of 100+
dimensions. Points in this space have highly correlated distributions. The
archive will enable astronomers to explore the data interactively. Data access
will be aided by multidimensional spatial indices. The data will be partitioned
in many ways. Small tag objects consisting of the most popular attributes speed
up frequent searches. Splitting the data among multiple servers enables
parallel, scalable I/O. Hashing techniques allow efficient clustering and
pairwise comparison algorithms. Randomly sampled subsets allow debugging
otherwise large queries at the desktop. Central servers will operate a data
pump that supports sweeping searches that touch most of the data.Comment: 10 pages, ADASS '99 conferenc
- …