Search CORE

28,550 research outputs found

CSD: Discriminance with Conic Section for Improving Reverse k Nearest Neighbors Queries

Author: Bai Mingyuan
Gao Junbin
Li Yang
Liu Gang
Ming Zi
Ye Lixin
Publication venue
Publication date: 18/05/2020
Field of study

The reverse

k

nearest neighbor (R

k

NN) query finds all points that have the query point as one of their

k

nearest neighbors (

k

NN), where the

k

NN query finds the

k

closest points to its query point. Based on the characteristics of conic section, we propose a discriminance, named CSD (Conic Section Discriminance), to determine points whether belong to the R

k

NN set without issuing any queries with non-constant computational complexity. By using CSD, we also implement an efficient R

k

NN algorithm CSD-R

k

NN with a computational complexity at

O(k^{1.5}\cdot log\,k)

. The comparative experiments are conducted between CSD-R

k

NN and other two state-of-the-art RkNN algorithms, SLICE and VR-R

k

NN. The experimental results indicate that the efficiency of CSD-R

k

NN is significantly higher than its competitors

arXiv.org e-Print Archive

Ontology-based explanation of classifiers

Author: Catarci T.
Cima G.
Croce F.
Lenzerini M.
Publication venue: CEUR-WS
Publication date: 01/01/2020
Field of study

The rise of data mining and machine learning use in many applications has brought new challenges related to classification. Here, we deal with the following challenge: how to interpret and understand the reason behind a classifier's prediction. Indeed, understanding the behaviour of a classifier is widely recognized as a very important task for wide and safe adoption of machine learning and data mining technologies, especially in high-risk domains, and in dealing with bias.We present a preliminary work on a proposal of using the Ontology-Based Data Management paradigm for explaining the behavior of a classifier in terms of the concepts and the relations that are meaningful in the domain that is relevant for the classifier

Archivio della ricerca- Università di Roma La Sapienza

Reverse spatial visual top-k query

Author: Song Jiayu
Yu Hao
Yu Weiren
Zhang Chengyuan
Zhang Zuping
Zhu Lei
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 23/01/2020
Field of study

With the wide application of mobile Internet techniques an location-based services (LBS), massive multimedia data with geo-tags has been generated and collected. In this paper, we investigate a novel type of spatial query problem, named reverse spatial visual top-

k

query (RSVQ k ) that aims to retrieve a set of geo-images that have the query as one of the most relevant geo-images in both geographical proximity and visual similarity. Existing approaches for reverse top-

k

queries are not suitable to address this problem because they cannot effectively process unstructured data, such as image. To this end, firstly we propose the definition of RSVQ k problem and introduce the similarity measurement. A novel hybrid index, named VR 2 -Tree is designed, which is a combination of visual representation of geo-image and R-Tree. Besides, an extension of VR 2 -Tree, called CVR 2 -Tree is introduced and then we discuss the calculation of lower/upper bound, and then propose the optimization technique via CVR 2 -Tree for further pruning. In addition, a search algorithm named RSVQ k algorithm is developed to support the efficient RSVQ k query. Comprehensive experiments are conducted on four geo-image datasets, and the results illustrate that our approach can address the RSVQ k problem effectively and efficiently

Warwick Research Archives Portal Repository

Source File Set Search for Clone-and-Own Reuse Analysis

Author: Inoue Katsuro
Ishio Takashi
Ito Kaoru
Sakaguchi Yusuke
Publication venue
Publication date: 01/01/2017
Field of study

Clone-and-own approach is a natural way of source code reuse for software developers. To assess how known bugs and security vulnerabilities of a cloned component affect an application, developers and security analysts need to identify an original version of the component and understand how the cloned component is different from the original one. Although developers may record the original version information in a version control system and/or directory names, such information is often either unavailable or incomplete. In this research, we propose a code search method that takes as input a set of source files and extracts all the components including similar files from a software ecosystem (i.e., a collection of existing versions of software packages). Our method employs an efficient file similarity computation using b-bit minwise hashing technique. We use an aggregated file similarity for ranking components. To evaluate the effectiveness of this tool, we analyzed 75 cloned components in Firefox and Android source code. The tool took about two hours to report the original components from 10 million files in Debian GNU/Linux packages. Recall of the top-five components in the extracted lists is 0.907, while recall of a baseline using SHA-1 file hash is 0.773, according to the ground truth recorded in the source code repositories.Comment: 14th International Conference on Mining Software Repositorie

arXiv.org e-Print Archive

NAIST Academic Repository

Crossref

SQL Query Completion for Data Exploration

Author: Guilly Marie Le
Petit Jean-Marc
Scuturici Vasile-Marian
Publication venue
Publication date: 07/02/2018
Field of study

Within the big data tsunami, relational databases and SQL are still there and remain mandatory in most of cases for accessing data. On the one hand, SQL is easy-to-use by non specialists and allows to identify pertinent initial data at the very beginning of the data exploration process. On the other hand, it is not always so easy to formulate SQL queries: nowadays, it is more and more frequent to have several databases available for one application domain, some of them with hundreds of tables and/or attributes. Identifying the pertinent conditions to select the desired data, or even identifying relevant attributes is far from trivial. To make it easier to write SQL queries, we propose the notion of SQL query completion: given a query, it suggests additional conditions to be added to its WHERE clause. This completion is semantic, as it relies on the data from the database, unlike current completion tools that are mostly syntactic. Since the process can be repeated over and over again -- until the data analyst reaches her data of interest --, SQL query completion facilitates the exploration of databases. SQL query completion has been implemented in a SQL editor on top of a database management system. For the evaluation, two questions need to be studied: first, does the completion speed up the writing of SQL queries? Second , is the completion easily adopted by users? A thorough experiment has been conducted on a group of 70 computer science students divided in two groups (one with the completion and the other one without) to answer those questions. The results are positive and very promising

arXiv.org e-Print Archive

HAL

Hal-Diderot

Km4City Ontology Building vs Data Harvesting and Cleaning for Smart-city Services

Author: Bellini Pierfrancesco
Benigni Monica
Billero Riccardo
Nesi Paolo
Rauch Nadia
Publication venue: 'Elsevier BV'
Publication date: 01/01/2014
Field of study

Presently, a very large number of public and private data sets are available from local governments. In most cases, they are not semantically interoperable and a huge human effort would be needed to create integrated ontologies and knowledge base for smart city. Smart City ontology is not yet standardized, and a lot of research work is needed to identify models that can easily support the data reconciliation, the management of the complexity, to allow the data reasoning. In this paper, a system for data ingestion and reconciliation of smart cities related aspects as road graph, services available on the roads, traffic sensors etc., is proposed. The system allows managing a big data volume of data coming from a variety of sources considering both static and dynamic data. These data are mapped to a smart-city ontology, called KM4City (Knowledge Model for City), and stored into an RDF-Store where they are available for applications via SPARQL queries to provide new services to the users via specific applications of public administration and enterprises. The paper presents the process adopted to produce the ontology and the big data architecture for the knowledge base feeding on the basis of open and private data, and the mechanisms adopted for the data verification, reconciliation and validation. Some examples about the possible usage of the coherent big data knowledge base produced are also offered and are accessible from the RDF-Store and related services. The article also presented the work performed about reconciliation algorithms and their comparative assessment and selection

arXiv.org e-Print Archive

Elsevier - Publisher Connector

Florence Research

High-Performance Reachability Query Processing under Index Size Restrictions

Author: Anand Avishek
Bedathur Srikanta
Seufert Stephan
Weikum Gerhard
Publication venue
Publication date: 01/01/2012
Field of study

In this paper, we propose a scalable and highly efficient index structure for the reachability problem over graphs. We build on the well-known node interval labeling scheme where the set of vertices reachable from a particular node is compactly encoded as a collection of node identifier ranges. We impose an explicit bound on the size of the index and flexibly assign approximate reachability ranges to nodes of the graph such that the number of index probes to answer a query is minimized. The resulting tunable index structure generates a better range labeling if the space budget is increased, thus providing a direct control over the trade off between index size and the query processing performance. By using a fast recursive querying method in conjunction with our index structure, we show that in practice, reachability queries can be answered in the order of microseconds on an off-the-shelf computer - even for the case of massive-scale real world graphs. Our claims are supported by an extensive set of experimental results using a multitude of benchmark and real-world web-scale graph datasets.Comment: 30 page

arXiv.org e-Print Archive

MPG.PuRe