513 research outputs found
Graph Convolutional Networks for Traffic Forecasting with Missing Values
Traffic forecasting has attracted widespread attention recently. In reality,
traffic data usually contains missing values due to sensor or communication
errors. The Spatio-temporal feature in traffic data brings more challenges for
processing such missing values, for which the classic techniques (e.g., data
imputations) are limited: 1) in temporal axis, the values can be randomly or
consecutively missing; 2) in spatial axis, the missing values can happen on one
single sensor or on multiple sensors simultaneously. Recent models powered by
Graph Neural Networks achieved satisfying performance on traffic forecasting
tasks. However, few of them are applicable to such a complex missing-value
context. To this end, we propose GCN-M, a Graph Convolutional Network model
with the ability to handle the complex missing values in the Spatio-temporal
context. Particularly, we jointly model the missing value processing and
traffic forecasting tasks, considering both local Spatio-temporal features and
global historical patterns in an attention-based memory network. We propose as
well a dynamic graph learning module based on the learned local-global
features. The experimental results on real-life datasets show the reliability
of our proposed method.Comment: To appear in Data Mining and Knowledge Discovery (DMKD), Springe
DGraph: A Large-Scale Financial Dataset for Graph Anomaly Detection
Graph Anomaly Detection (GAD) has recently become a hot research spot due to
its practicability and theoretical value. Since GAD emphasizes the application
and the rarity of anomalous samples, enriching the varieties of its datasets is
a fundamental work. Thus, this paper present DGraph, a real-world dynamic graph
in the finance domain. DGraph overcomes many limitations of current GAD
datasets. It contains about 3M nodes, 4M dynamic edges, and 1M ground-truth
nodes. We provide a comprehensive observation of DGraph, revealing that
anomalous nodes and normal nodes generally have different structures, neighbor
distribution, and temporal dynamics. Moreover, it suggests that those unlabeled
nodes are also essential for detecting fraudsters. Furthermore, we conduct
extensive experiments on DGraph. Observation and experiments demonstrate that
DGraph is propulsive to advance GAD research and enable in-depth exploration of
anomalous nodes.Comment: 9 page
Data Imputation Using Differential Dependency and Fuzzy Multi-Objective Linear Programming
Missing or incomplete data is a serious problem when it comes to collecting and analyzing data for forecasting, estimating, and decision making. Since data quality is so important in machine learning and its results, in most cases data imputation is much more appropriate than ignoring them. Missing data imputation is often based on considering equality, similarity, or distance of neighbors. Researchers use different approaches for neighbors\u27 equalities or similarities. Every approach has its advantages and limitations. Instead of equality, some researchers use inequalities together with a few relationships or similarity rules. In this thesis, after recalling some basic imputation methods, we discus about data imputation based on differential dependencies (DDs). DDs are conditional rules in which the closeness of the values of each pair of tuples in some attribute indicates the closeness of the values of those tuples in another attribute. Considering these rules, a few rows are created for each incomplete row and placed in the set of candidates for that row. Then from each set one row is selected such that they are not incompatible with each other. These selections are made by an integer linear programming (ILP) model. In this thesis, first, we propose an algorithm to generate DDs. Then in order to improve the previous approaches to increase the percentage of imputation, we suggest fuzzy relaxation that allows a little violation from DDs. Finally, we propose a multi-objective fuzzy linear programming to reach an imputation with more percentage of imputation in addition to decrease the summation of violations. A variety of datasets from “Kaggle” is used to support our approach
Local Embeddings for Relational Data Integration
Deep learning based techniques have been recently used with promising results
for data integration problems. Some methods directly use pre-trained embeddings
that were trained on a large corpus such as Wikipedia. However, they may not
always be an appropriate choice for enterprise datasets with custom vocabulary.
Other methods adapt techniques from natural language processing to obtain
embeddings for the enterprise's relational data. However, this approach blindly
treats a tuple as a sentence, thus losing a large amount of contextual
information present in the tuple.
We propose algorithms for obtaining local embeddings that are effective for
data integration tasks on relational databases. We make four major
contributions. First, we describe a compact graph-based representation that
allows the specification of a rich set of relationships inherent in the
relational world. Second, we propose how to derive sentences from such a graph
that effectively "describe" the similarity across elements (tokens, attributes,
rows) in the two datasets. The embeddings are learned based on such sentences.
Third, we propose effective optimization to improve the quality of the learned
embeddings and the performance of integration tasks. Finally, we propose a
diverse collection of criteria to evaluate relational embeddings and perform an
extensive set of experiments validating them against multiple baseline methods.
Our experiments show that our framework, EmbDI, produces meaningful results for
data integration tasks such as schema matching and entity resolution both in
supervised and unsupervised settings.Comment: Accepted to SIGMOD 2020 as Creating Embeddings of Heterogeneous
Relational Datasets for Data Integration Tasks. Code can be found at
https://gitlab.eurecom.fr/cappuzzo/embd
- …