349 research outputs found
Spatial database implementation of fuzzy region connection calculus for analysing the relationship of diseases
Analyzing huge amounts of spatial data plays an important role in many
emerging analysis and decision-making domains such as healthcare, urban
planning, agriculture and so on. For extracting meaningful knowledge from
geographical data, the relationships between spatial data objects need to be
analyzed. An important class of such relationships are topological relations
like the connectedness or overlap between regions. While real-world
geographical regions such as lakes or forests do not have exact boundaries and
are fuzzy, most of the existing analysis methods neglect this inherent feature
of topological relations. In this paper, we propose a method for handling the
topological relations in spatial databases based on fuzzy region connection
calculus (RCC). The proposed method is implemented in PostGIS spatial database
and evaluated in analyzing the relationship of diseases as an important
application domain. We also used our fuzzy RCC implementation for fuzzification
of the skyline operator in spatial databases. The results of the evaluation
show that our method provides a more realistic view of spatial relationships
and gives more flexibility to the data analyst to extract meaningful and
accurate results in comparison with the existing methods.Comment: ICEE201
Multi-Source Spatial Entity Linkage
Besides the traditional cartographic data sources, spatial information can
also be derived from location-based sources. However, even though different
location-based sources refer to the same physical world, each one has only
partial coverage of the spatial entities, describe them with different
attributes, and sometimes provide contradicting information. Hence, we
introduce the spatial entity linkage problem, which finds which pairs of
spatial entities belong to the same physical spatial entity. Our proposed
solution (QuadSky) starts with a time-efficient spatial blocking technique
(QuadFlex), compares pairwise the spatial entities in the same block, ranks the
pairs using Pareto optimality with the SkyRank algorithm, and finally,
classifies the pairs with our novel SkyEx-* family of algorithms that yield
0.85 precision and 0.85 recall for a manually labeled dataset of 1,500 pairs
and 0.87 precision and 0.6 recall for a semi-manually labeled dataset of
777,452 pairs. Moreover, we provide a theoretical guarantee and formalize the
SkyEx-FES algorithm that explores only 27% of the skylines without any loss in
F-measure. Furthermore, our fully unsupervised algorithm SkyEx-D approximates
the optimal result with an F-measure loss of just 0.01. Finally, QuadSky
provides the best trade-off between precision and recall, and the best
F-measure compared to the existing baselines and clustering techniques, and
approximates the results of supervised learning solutions
Efficient Computation of Subspace Skyline over Categorical Domains
Platforms such as AirBnB, Zillow, Yelp, and related sites have transformed
the way we search for accommodation, restaurants, etc. The underlying datasets
in such applications have numerous attributes that are mostly Boolean or
Categorical. Discovering the skyline of such datasets over a subset of
attributes would identify entries that stand out while enabling numerous
applications. There are only a few algorithms designed to compute the skyline
over categorical attributes, yet are applicable only when the number of
attributes is small.
In this paper, we place the problem of skyline discovery over categorical
attributes into perspective and design efficient algorithms for two cases. (i)
In the absence of indices, we propose two algorithms, ST-S and ST-P, that
exploits the categorical characteristics of the datasets, organizing tuples in
a tree data structure, supporting efficient dominance tests over the candidate
set. (ii) We then consider the existence of widely used precomputed sorted
lists. After discussing several approaches, and studying their limitations, we
propose TA-SKY, a novel threshold style algorithm that utilizes sorted lists.
Moreover, we further optimize TA-SKY and explore its progressive nature, making
it suitable for applications with strict interactive requirements. In addition
to the extensive theoretical analysis of the proposed algorithms, we conduct a
comprehensive experimental evaluation of the combination of real (including the
entire AirBnB data collection) and synthetic datasets to study the practicality
of the proposed algorithms. The results showcase the superior performance of
our techniques, outperforming applicable approaches by orders of magnitude
Elastic-PPQ: A two-level autonomic system for spatial preference query processing over dynamic data streams
Paradigms like Internet of Things and the most recent Internet of Everything are shifting the attention towards systems able to process unbounded sequences of items in the form of data streams. In the real world, data streams may be highly variable, exhibiting burstiness in the arrival rate and non-stationarities such as trends and cyclic behaviors. Furthermore, input items may be not ordered according to timestamps. This raises the complexity of stream processing systems, which must support elastic resource management and autonomic QoS control through sophisticated strategies and run-time mechanisms. In this paper we present Elastic-PPQ, a system for processing spatial preference queries over dynamic data streams. The key aspect of the system design is the existence of two adaptation levels handling workload variations at different time-scales. To address fast time-scale variations we design a fine regulatory mechanism of load balancing supported by a control-theoretic approach. The logic of the second adaptation level, targeting slower time-scale variations, is incorporated in a Fuzzy Logic Controller that makes scale in/out decisions of the system parallelism degree. The approach has been successfully evaluated under synthetic and real-world datasets
SkyLens: Visual analysis of skyline on multi-dimensional data
Skyline queries have wide-ranging applications in fields that involve
multi-criteria decision making, including tourism, retail industry, and human
resources. By automatically removing incompetent candidates, skyline queries
allow users to focus on a subset of superior data items (i.e., the skyline),
thus reducing the decision-making overhead. However, users are still required
to interpret and compare these superior items manually before making a
successful choice. This task is challenging because of two issues. First,
people usually have fuzzy, unstable, and inconsistent preferences when
presented with multiple candidates. Second, skyline queries do not reveal the
reasons for the superiority of certain skyline points in a multi-dimensional
space. To address these issues, we propose SkyLens, a visual analytic system
aiming at revealing the superiority of skyline points from different
perspectives and at different scales to aid users in their decision making. Two
scenarios demonstrate the usefulness of SkyLens on two datasets with a dozen of
attributes. A qualitative study is also conducted to show that users can
efficiently accomplish skyline understanding and comparison tasks with SkyLens.Comment: 10 pages. Accepted for publication at IEEE VIS 2017 (in proceedings
of VAST
Energy-Efficient β
As the first priority of query processing in wireless sensor networks is to save the limited energy of sensor nodes and in many sensing applications a part of skyline result is enough for the user’s requirement, calculating the exact skyline is not energy-efficient relatively. Therefore, a new approximate skyline query, β-approximate skyline query which is limited by a
guaranteed error bound, is proposed in this paper. With an objective to reduce the communication cost in evaluating
β-approximate skyline queries, we also propose an energy-efficient processing algorithm using mapping and filtering
strategies, named Actual Approximate Skyline (AAS). And more than that, an extended algorithm named Hypothetical Approximate Skyline (HAS) which replaces the real tuples with the hypothetical ones is proposed to further reduce the communication cost. Extensive experiments on synthetic data have demonstrated the efficiency and effectiveness of our proposed approaches with various experimental settings
Coping with new Challenges in Clustering and Biomedical Imaging
The last years have seen a tremendous increase of data acquisition in different scientific fields such as molecular biology, bioinformatics or biomedicine. Therefore, novel methods are needed for automatic data processing and analysis of this large amount of data. Data mining is the process of applying methods like clustering or classification to large databases in order to uncover hidden patterns. Clustering is the task of partitioning points of a data set into distinct groups in order to minimize the intra cluster similarity and to maximize the inter cluster similarity. In contrast to unsupervised learning like clustering, the classification problem is known as supervised learning that aims at the prediction of group membership of data objects on the basis of rules learned from a training set where the group membership is known.
Specialized methods have been proposed for hierarchical and partitioning clustering. However, these methods suffer from several drawbacks. In the first part of this work, new clustering methods are proposed that cope with problems from conventional clustering algorithms. ITCH (Information-Theoretic Cluster Hierarchies) is a hierarchical clustering method that is based on a hierarchical variant of the Minimum Description Length (MDL) principle which finds hierarchies of clusters without requiring input parameters. As ITCH may converge only to a local optimum we propose GACH (Genetic Algorithm for Finding Cluster Hierarchies) that combines the benefits from genetic algorithms with information-theory. In this way the search space is explored more effectively.
Furthermore, we propose INTEGRATE a novel clustering method for data with mixed numerical and categorical attributes. Supported by the MDL principle our method integrates the information provided by heterogeneous numerical and categorical attributes and thus naturally balances the influence of both sources of information. A competitive evaluation illustrates that INTEGRATE is more effective than existing clustering methods for mixed type data. Besides clustering methods for single data objects we provide a solution for clustering different data sets that are represented by their skylines. The skyline operator is a well-established database primitive for finding database objects which minimize two or more attributes with an unknown weighting between these attributes. In this thesis, we define a similarity measure, called SkyDist, for comparing skylines of different data sets that can directly be integrated into different data mining tasks such as clustering or classification. The experiments show that SkyDist in combination with different clustering algorithms can give useful insights into many applications.
In the second part, we focus on the analysis of high resolution magnetic resonance images (MRI) that are clinically relevant and may allow for an early detection and diagnosis of several diseases. In particular, we propose a framework for the classification of Alzheimer's disease in MR images combining the data mining steps of feature selection, clustering and classification. As a result, a set of highly selective features discriminating patients with Alzheimer and healthy people has been identified. However, the analysis of the high dimensional MR images is extremely time-consuming. Therefore we developed JGrid, a scalable distributed computing solution designed to allow for a large scale analysis of MRI and thus an optimized prediction of diagnosis. In another study we apply efficient algorithms for motif discovery to task-fMRI scans in order to identify patterns in the brain that are characteristic for patients with somatoform pain disorder. We find groups of brain compartments that occur frequently within the brain networks and discriminate well among healthy and diseased people
- …