43 research outputs found
Provenance-based Reconciliation In Conflicting Data
Data fusion is the process of resolving conflicting data from multiple data sources. As the data sources are inherently heterogenous, there is a need for an expert to resolve the conflicting data. Traditional approach requires the expert to resolve a considerable amount of conflicts in order to acquire a high quality dataset. In this project, we consider how to acquire a high quality dataset while maintaining the expert effort minimal. At first, we achieve this goal by building a model which leverages the provenance of the data in reconciling conflicting data. Secondly, we improve our model by taking the dependency between data sources into account. In the end, we empirically show that our solution can significantly reduce the user effort while it can obtain a high quality dataset in comparison with traditional method
Crowdsourcing Literal Review
Our user feedback framework requires some robust techniques in order to tackle the scalability issue of schema matching network. One approach is employing crowd-sourcing/human computation models. Crowdsourcing is one of cutting-edge research areas which involves human computers to perform pre-defined tasks. In this literal review, we try to explore some certain concepts such as task, work-flow, feedback aggregation, quality control and reward system. We show that there are a lot of aspects which can be integrated into our user feedback framework
Fighting Rumours on Social Media
With the advance of social platforms, people are sharing contents in an unprecedented scale. This makes social platforms an ideal place for spreading rumors. As rumors may have negative impacts on the real world, many rumor detection techniques have been proposed. In this proposal, we summarize several works that focus on two important steps of rumor detection. The first step involves detecting controversial events from the data streams which are candidates for rumors. The aim of the second step is to find out the truth values of these events i.e. whether they are rumors or not. Although some techniques are able to achieve state-of-the-art results, they do not cope well with the streaming nature of social platforms. In addition, they usually leverage only one type of information available on social platforms such as only the posts. To overcome these limitations, we propose two research directions that emphasize on 1) detecting rumors in a progressive manner and 2) combining different types of information for better detection
Investigating Graph Embedding Methods for Cross-Platform Binary Code Similarity Detection
IoT devices are increasingly present, both in the industry and in consumer markets, but their security remains weak, which leads to an unprecedented number of attacks against them. In order to reduce the attack surface, one approach is to analyze the binary code of these devices to early detect whether they contain potential security vulnerabilities. More specifically, knowing some vulnerable function, we can determine whether the firmware of an IoT device contains some security flaw by searching for this function. However, searching for similar vulnerable functions is in general challenging due to the fact that the source code is often not openly available and that it can be compiled for different architectures, using different compilers and compilation settings. In order to handle these varying settings, we can compare the similarity between the graph embeddings derived from the binary functions. In this paper, inspired by the recent advances in deep learning, we propose a new method – GESS (graph embeddings for similarity search) – to derive graph embeddings, and we compare it with various state-of-the-art methods. Our empirical evaluation shows that GESS reaches an AUC of 0.979, thereby outperforming the best known approach. Furthermore, for a fixed low false positive rate, GESS provides a true positive rate (or recall) about 36% higher than the best previous approach. Finally, for a large search space, GESS provides a recall between 50% and 60% higher than the best previous approach
A comparison of network embedding approaches
Network embedding automatically learns to encode a graph into multi-dimensional vectors. The embedded representation appears to outperform hand-crafted features in many downstream machine learning tasks. There are a plethora of network embedding approaches in the last decade, based on the advances and successes of deep learning. However, there is no absolute winner as the network structure varies from application to application and the notion of connections in a graph has its own semantics in different domains. In this report, we compare different network embedding approaches in real and synthetic datasets, covering different graph structures. Although our prototype currently includes only two network embedding techniques, it can be easily extended due to our systematic evaluation methodology, and available source code
Towards Enabling Probabilistic Databases for Participatory Sensing
Participatory sensing has emerged as a new data collection paradigm, in which humans use their own devices (cell phone accelerometers, cameras, etc.) as sensors. This paradigm enables to collect a huge amount of data from the crowd for world-wide applications, without spending cost to buy dedicated sensors. Despite of this benefit, the data collected from human sensors are inherently uncertain due to no quality guarantee from the participants. Moreover, the participatory sensing data are time series that not only exhibit highly irregular dependencies on time, but also vary from sensor to sensor. To overcome these issues, we study in this paper the problem of creating probabilistic data from given (uncertain) time series collected by participatory sensors. We approach the problem in two steps. In the first step, we generate probabilistic times series from raw time series using a dynamical model from the time series literature. In the second step, we combine probabilistic time series from multiple sensors based on the mutual relationship between the reliability of the sensors and the quality of their data. Through extensive experimentation, we demonstrate the efficiency of our approach on both real data and synthetic data
Retaining data from streams of social platforms with minimal regret
Today's social platforms, such as Twitter and Facebook, continuously generate massive volumes of data. The resulting data streams exceed any reasonable limit for permanent storage, especially since data is often redundant, overlapping, sparse, and generally of low value. This calls for means to retain solely a small fraction of the data in an online manner. In this paper, we propose techniques to effectively decide which data to retain, such that the induced loss of information, the regret of neglecting certain data, is minimized. These techniques enable not only efficient processing of massive streaming data, but are also adaptive and address the dynamic nature of social media. Experiments on large-scale real-world datasets illustrate the feasibility of our approach in terms of both, runtime and information quality