54 research outputs found
On Feeding Business Systems with Linked Resources from the Web of Data
Business systems that are fed with data from the Web of Data require transparent interoperability. The Linked Data principles establish that different resources that represent the same real-world entities must be linked for such purpose. Link rules are paramount to transparent interoperability since they produce the links between resources. State-of-the-art link rules are learnt by genetic programming and build on comparing the values of the attributes of the resources. Unfortunately, this approach falls short in cases in which resources have similar values for their attributes, but represent different real-world entities. In this paper, we present a proposal that leverages a genetic programming that learns link rules and an ad-hoc filtering technique that boosts them to decide whether the links that they produce must be selected or not. Our analysis of the literature reveals that our approach is novel and our experimental analysis confirms that it helps improve the F1 score by increasing precision without a significant penalty on recall.Ministerio de Economía y Competitividad TIN2013-40848-RMinisterio de Economía y Competitividad TIN2016- 75394-
Similarity Join for Low- and High-Dimensional Data
The efficient processing of similarity joins is important for a large class of applications. The dimensionality of the data for these applications ranges from low to high. Most existing methods have focussed on the execution of high-dimensional joins over large amounts of diskbased data. The increasing sizes of main memory available on current computers, and the need for efficient processing of spatial joins suggest that spatial joins for a large class of problems can be processed in main memory. In this paper we develop two new spatial join algorithms, the Grid-join and EGO*-join, and study their performance in comparison to the state of the art algorithm, EGO-join, and the RSJ algorithm. Through evaluation we explore the domain of applicability of each algorithm and provide recommendations for the choice of join algorithm depending upon the dimensionality of the data as well as the critical ε parameter. We also point out the significance of the choice of this parameter for ensuring that the selectivity achieved is reasonable. The proposed EGO*-join algorithm always, often significantly, outperforms the EGO-join. For low-dimensional data the Grid-join outperform both the EGO- and EGO*- joins. An analysis of the cost of the Grid-join is presented and highly accurate cost estimator functions are developed. These are used to choose an appropriate grid size for optimal performance and can also be used by a query optimizer to compute the estimated cost of the Grid-join.
Efficient querying of constantly evolving data
This thesis addresses important challenges in the emerging areas of sensor (streaming data) databases and moving objects databases. It focuses on the important class of applications that are characterized by (a) constant change in the data values; (b) long-running (continuous) queries that have to be repeatedly evaluated as the data changes; (c) inherent imprecision in the data; and (d) need for near real-time results. The thesis addresses the scalability and performance challenges faced by these applications. The first part of the thesis studies the problem of scalable efficient processing of continuous range queries on moving objects. We introduce two novel highly scalable solutions to the problem: a disk-based technique called Velocity Constrained Indexing (VCI) and an in-memory technique called grid indexing. VCI is a technique for maintaining an index on moving objects that allows the index to be useful without constantly updating it as the data values change. For in-memory settings, we show the superiority of our grid indexing solution to other methods. The second part of the thesis covers the problem of similarity joins for low- and high-dimensional data. Two new similarity join algorithms are introduced: the Grid-join is for low-dimensional data and the EGO*-join is for high-dimensional data. Both algorithms show substantial improvement over the state of the art similarity join algorithms for low- and high-dimensional domains. Finally, the third part of the thesis presents an analysis and novel solutions of the important problem of handling the uncertainty inherent in the environments with constantly changing data. Probabilistic queries are introduced and a classification of queries is developed based on the nature of query result set. Algorithms are provided for solving typical probabilistic queries from each class. We show that, unlike standard queries, probabilistic queries have a notion of quality of answer. We introduce several metrics for measuring the quality as well as various update policies for improving it
Querying Imprecise Data in Moving Object Environments
In moving object environments it is infeasible for the database tracking the movement of objects to store the exact locations of objects at all times. Typically the location of an object is known with certainty only at the time of the update. The uncertainty in its location increases until the next update. In this environment, it is possible for queries to produce incorrect results based upon old data. However, if the degree of uncertainty is controlled, then the error of the answers to certain queries can be reduced. More generally, query answers can be augmented with probabilistic estimates of the validity of the answer. In this paper we study the execution of such probabilistic nearest-neighbor queries. The imprecision in answers to the queries is an inherent property of these applications due to uncertainty in the data, unlike the techniques for approximate nearestneighbor processing that trade accuracy for performance
Evaluating Probabilistic Queries over Imprecise Data
Many applications employ sensors for monitoring entities such as temperature and wind speed. A centralized database tracks these entities to enable query processing. Due to continuous changes in these values and limited resources (e.g., network bandwidth and battery power), it is often infeasible to store the exact values at all times. A similar situation exists for moving object environments that track the constantly changing locations of objects. In this environment, it is possible for database queries to produce incorrect or invalid results based upon old data. However, if the degree of error (or uncertainty) between the actual value and the database value is controlled, we can place more confidence in the answers to queries. More generally, query answers can be augmented with probabilistic estimates of the validity of the answers. In this chapter we study probabilistic query evaluation based upon uncertain data. A classification of queries is made based upon the nature of the result set. For each class, we develop algorithms for computing probabilistic answers. We address the important issue of measuring the quality of the answers to these queries, and provide algorithms for e#ciently pulling data from relevant sensors or moving objects in order to improve the quality of the executing queries. Extensive experiments are performed to examine the e#ectiveness of several data update policies. 1
- …