60,504 research outputs found
Multi-node approach for map data processing
OpenStreetMap (OSM) is a popular collaborative open-source project that offers free editable map across the whole world. However, this data often needs a further on-purpose processing to become the utmost valuable information to work with. That is why the main motivation of this paper is to propose a design for big data processing along with data mining leading to the obtaining of statistics with a focus on the detail of a traffic data as a result in order to create graphs representing a road network. To ensure our High-Performance Computing (HPC) platform routing algorithms work correctly, it is absolutely essential to prepare OSM data to be useful and applicable for above-mentioned graph, and to store this persistent data in both spatial database and HDF5 format.Web of Science8971049
CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks
Data quality affects machine learning (ML) model performances, and data
scientists spend considerable amount of time on data cleaning before model
training. However, to date, there does not exist a rigorous study on how
exactly cleaning affects ML -- ML community usually focuses on developing ML
algorithms that are robust to some particular noise types of certain
distributions, while database (DB) community has been mostly studying the
problem of data cleaning alone without considering how data is consumed by
downstream ML analytics. We propose a CleanML study that systematically
investigates the impact of data cleaning on ML classification tasks. The
open-source and extensible CleanML study currently includes 14 real-world
datasets with real errors, five common error types, seven different ML models,
and multiple cleaning algorithms for each error type (including both commonly
used algorithms in practice as well as state-of-the-art solutions in academic
literature). We control the randomness in ML experiments using statistical
hypothesis testing, and we also control false discovery rate in our experiments
using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a
systematic way to derive many interesting and nontrivial observations. We also
put forward multiple research directions for researchers.Comment: published in ICDE 202
Gravity optimised particle filter for hand tracking
This paper presents a gravity optimised particle filter (GOPF) where the magnitude of the gravitational force for every particle is proportional to its weight. GOPF attracts nearby particles and replicates new particles as if moving the particles towards the peak of the likelihood distribution, improving the sampling efficiency. GOPF is incorporated into a technique for hand features tracking. A fast approach to hand features detection and labelling using convexity defects is also presented. Experimental results show that GOPF outperforms the standard particle filter and its variants, as well as state-of-the-art CamShift guided particle filter using a significantly reduced number of particles
Shinren : Non-monotonic trust management for distributed systems
The open and dynamic nature of modern distributed systems and pervasive environments presents significant challenges to security management. One solution may be trust management which utilises the notion of trust in order to specify and interpret security policies and make decisions on security-related actions. Most trust management systems assume monotonicity where additional information can only result in the increasing of trust. The monotonic assumption oversimplifies the real world by not considering negative information, thus it cannot handle many real world scenarios. In this paper we present Shinren, a novel non-monotonic trust management system based on bilattice theory and the anyworld assumption. Shinren takes into account negative information and supports reasoning with incomplete information, uncertainty and inconsistency. Information from multiple sources such as credentials, recommendations, reputation and local knowledge can be used and combined in order to establish trust. Shinren also supports prioritisation which is important in decision making and resolving modality conflicts that are caused by non-monotonicity
Accuracy of Author Names in Bibliographic Data Sources: An Italian Case Study
We investigate the accuracy of how author names are reported in bibliographic records excerpted from four prominent sources: WoS, Scopus, PubMed, and CrossRef. We take as a case study 44,549 publications stored in the internal database of Sapienza University of Rome, one of the largest universities in Europe. While our results indicate generally good accuracy for all bibliographic data sources considered, we highlight a number of issues that undermine the accuracy for certain classes of author names, including compound names and names with diacritics, which are common features to Italian and other Western languages
- …