28,550 research outputs found
CSD: Discriminance with Conic Section for Improving Reverse k Nearest Neighbors Queries
The reverse nearest neighbor (RNN) query finds all points that have
the query point as one of their nearest neighbors (NN), where the NN
query finds the closest points to its query point. Based on the
characteristics of conic section, we propose a discriminance, named CSD (Conic
Section Discriminance), to determine points whether belong to the RNN set
without issuing any queries with non-constant computational complexity. By
using CSD, we also implement an efficient RNN algorithm CSD-RNN with a
computational complexity at . The comparative
experiments are conducted between CSD-RNN and other two state-of-the-art
RkNN algorithms, SLICE and VR-RNN. The experimental results indicate that
the efficiency of CSD-RNN is significantly higher than its competitors
Ontology-based explanation of classifiers
The rise of data mining and machine learning use in many applications has brought new challenges related to classification. Here, we deal with the following challenge: how to interpret and understand the reason behind a classifier's prediction. Indeed, understanding the behaviour of a classifier is widely recognized as a very important task for wide and safe adoption of machine learning and data mining technologies, especially in high-risk domains, and in dealing with bias.We present a preliminary work on a proposal of using the Ontology-Based Data Management paradigm for explaining the behavior of a classifier in terms of the concepts and the relations that are meaningful in the domain that is relevant for the classifier
Reverse spatial visual top-k query
With the wide application of mobile Internet techniques an location-based services (LBS), massive multimedia data with geo-tags has been generated and collected. In this paper, we investigate a novel type of spatial query problem, named reverse spatial visual top- query (RSVQ k ) that aims to retrieve a set of geo-images that have the query as one of the most relevant geo-images in both geographical proximity and visual similarity. Existing approaches for reverse top- queries are not suitable to address this problem because they cannot effectively process unstructured data, such as image. To this end, firstly we propose the definition of RSVQ k problem and introduce the similarity measurement. A novel hybrid index, named VR 2 -Tree is designed, which is a combination of visual representation of geo-image and R-Tree. Besides, an extension of VR 2 -Tree, called CVR 2 -Tree is introduced and then we discuss the calculation of lower/upper bound, and then propose the optimization technique via CVR 2 -Tree for further pruning. In addition, a search algorithm named RSVQ k algorithm is developed to support the efficient RSVQ k query. Comprehensive experiments are conducted on four geo-image datasets, and the results illustrate that our approach can address the RSVQ k problem effectively and efficiently
Source File Set Search for Clone-and-Own Reuse Analysis
Clone-and-own approach is a natural way of source code reuse for software
developers. To assess how known bugs and security vulnerabilities of a cloned
component affect an application, developers and security analysts need to
identify an original version of the component and understand how the cloned
component is different from the original one. Although developers may record
the original version information in a version control system and/or directory
names, such information is often either unavailable or incomplete. In this
research, we propose a code search method that takes as input a set of source
files and extracts all the components including similar files from a software
ecosystem (i.e., a collection of existing versions of software packages). Our
method employs an efficient file similarity computation using b-bit minwise
hashing technique. We use an aggregated file similarity for ranking components.
To evaluate the effectiveness of this tool, we analyzed 75 cloned components in
Firefox and Android source code. The tool took about two hours to report the
original components from 10 million files in Debian GNU/Linux packages. Recall
of the top-five components in the extracted lists is 0.907, while recall of a
baseline using SHA-1 file hash is 0.773, according to the ground truth recorded
in the source code repositories.Comment: 14th International Conference on Mining Software Repositorie
SQL Query Completion for Data Exploration
Within the big data tsunami, relational databases and SQL are still there and
remain mandatory in most of cases for accessing data. On the one hand, SQL is
easy-to-use by non specialists and allows to identify pertinent initial data at
the very beginning of the data exploration process. On the other hand, it is
not always so easy to formulate SQL queries: nowadays, it is more and more
frequent to have several databases available for one application domain, some
of them with hundreds of tables and/or attributes. Identifying the pertinent
conditions to select the desired data, or even identifying relevant attributes
is far from trivial. To make it easier to write SQL queries, we propose the
notion of SQL query completion: given a query, it suggests additional
conditions to be added to its WHERE clause. This completion is semantic, as it
relies on the data from the database, unlike current completion tools that are
mostly syntactic. Since the process can be repeated over and over again --
until the data analyst reaches her data of interest --, SQL query completion
facilitates the exploration of databases. SQL query completion has been
implemented in a SQL editor on top of a database management system. For the
evaluation, two questions need to be studied: first, does the completion speed
up the writing of SQL queries? Second , is the completion easily adopted by
users? A thorough experiment has been conducted on a group of 70 computer
science students divided in two groups (one with the completion and the other
one without) to answer those questions. The results are positive and very
promising
Km4City Ontology Building vs Data Harvesting and Cleaning for Smart-city Services
Presently, a very large number of public and private data sets are available
from local governments. In most cases, they are not semantically interoperable
and a huge human effort would be needed to create integrated ontologies and
knowledge base for smart city. Smart City ontology is not yet standardized, and
a lot of research work is needed to identify models that can easily support the
data reconciliation, the management of the complexity, to allow the data
reasoning. In this paper, a system for data ingestion and reconciliation of
smart cities related aspects as road graph, services available on the roads,
traffic sensors etc., is proposed. The system allows managing a big data volume
of data coming from a variety of sources considering both static and dynamic
data. These data are mapped to a smart-city ontology, called KM4City (Knowledge
Model for City), and stored into an RDF-Store where they are available for
applications via SPARQL queries to provide new services to the users via
specific applications of public administration and enterprises. The paper
presents the process adopted to produce the ontology and the big data
architecture for the knowledge base feeding on the basis of open and private
data, and the mechanisms adopted for the data verification, reconciliation and
validation. Some examples about the possible usage of the coherent big data
knowledge base produced are also offered and are accessible from the RDF-Store
and related services. The article also presented the work performed about
reconciliation algorithms and their comparative assessment and selection
High-Performance Reachability Query Processing under Index Size Restrictions
In this paper, we propose a scalable and highly efficient index structure for
the reachability problem over graphs. We build on the well-known node interval
labeling scheme where the set of vertices reachable from a particular node is
compactly encoded as a collection of node identifier ranges. We impose an
explicit bound on the size of the index and flexibly assign approximate
reachability ranges to nodes of the graph such that the number of index probes
to answer a query is minimized. The resulting tunable index structure generates
a better range labeling if the space budget is increased, thus providing a
direct control over the trade off between index size and the query processing
performance. By using a fast recursive querying method in conjunction with our
index structure, we show that in practice, reachability queries can be answered
in the order of microseconds on an off-the-shelf computer - even for the case
of massive-scale real world graphs. Our claims are supported by an extensive
set of experimental results using a multitude of benchmark and real-world
web-scale graph datasets.Comment: 30 page
- …