78 research outputs found
Discriminative Distance-Based Network Indices with Application to Link Prediction
In large networks, using the length of shortest paths as the distance measure
has shortcomings. A well-studied shortcoming is that extending it to
disconnected graphs and directed graphs is controversial. The second
shortcoming is that a huge number of vertices may have exactly the same score.
The third shortcoming is that in many applications, the distance between two
vertices not only depends on the length of shortest paths, but also on the
number of shortest paths. In this paper, first we develop a new distance
measure between vertices of a graph that yields discriminative distance-based
centrality indices. This measure is proportional to the length of shortest
paths and inversely proportional to the number of shortest paths. We present
algorithms for exact computation of the proposed discriminative indices.
Second, we develop randomized algorithms that precisely estimate average
discriminative path length and average discriminative eccentricity and show
that they give -approximations of these indices. Third, we
perform extensive experiments over several real-world networks from different
domains. In our experiments, we first show that compared to the traditional
indices, discriminative indices have usually much more discriminability. Then,
we show that our randomized algorithms can very precisely estimate average
discriminative path length and average discriminative eccentricity, using only
few samples. Then, we show that real-world networks have usually a tiny average
discriminative path length, bounded by a constant (e.g., 2). Fourth, in order
to better motivate the usefulness of our proposed distance measure, we present
a novel link prediction method, that uses discriminative distance to decide
which vertices are more likely to form a link in future, and show its superior
performance compared to the well-known existing measures
Efficient Exact and Approximate Algorithms for Computing Betweenness Centrality in Directed Graphs
Graphs are an important tool to model data in different domains, including
social networks, bioinformatics and the world wide web. Most of the networks
formed in these domains are directed graphs, where all the edges have a
direction and they are not symmetric. Betweenness centrality is an important
index widely used to analyze networks. In this paper, first given a directed
network and a vertex , we propose a new exact algorithm to
compute betweenness score of . Our algorithm pre-computes a set
, which is used to prune a huge amount of computations that do
not contribute in the betweenness score of . Time complexity of our exact
algorithm depends on and it is respectively
and
for unweighted graphs and weighted graphs with positive weights.
is bounded from above by and in most cases, it
is a small constant. Then, for the cases where is large, we
present a simple randomized algorithm that samples from and
performs computations for only the sampled elements. We show that this
algorithm provides an -approximation of the betweenness
score of . Finally, we perform extensive experiments over several real-world
datasets from different domains for several randomly chosen vertices as well as
for the vertices with the highest betweenness scores. Our experiments reveal
that in most cases, our algorithm significantly outperforms the most efficient
existing randomized algorithms, in terms of both running time and accuracy. Our
experiments also show that our proposed algorithm computes betweenness scores
of all vertices in the sets of sizes 5, 10 and 15, much faster and more
accurate than the most efficient existing algorithms.Comment: arXiv admin note: text overlap with arXiv:1704.0735
Scikit-Multiflow: A Multi-output Streaming Framework
Scikit-multiflow is a multi-output/multi-label and stream data mining
framework for the Python programming language. Conceived to serve as a platform
to encourage democratization of stream learning research, it provides multiple
state of the art methods for stream learning, stream generators and evaluators.
scikit-multiflow builds upon popular open source frameworks including
scikit-learn, MOA and MEKA. Development follows the FOSS principles and quality
is enforced by complying with PEP8 guidelines and using continuous integration
and automatic testing. The source code is publicly available at
https://github.com/scikit-multiflow/scikit-multiflow.Comment: 5 pages, Open Source Softwar
Access Control in Social Networks: A reachability-Based Approach
Nowadays, social networks are attracting more and more users. These social network subscribers may share personal and sensitive information with a large number of possibly unknown other users, which is in constant evolution. This raises the need of giving users more control on the distribution of their shared content which can be accessed by a community far wider than they may expect. Our concern is to devise and enforce an appropriate access control model for online social networks that enables users to specify their privacy preferences in an expressive way, and, scales well over small, as well as, large social graphs (i.e., regardless to the size of the social graph). In this paper, we propose an access control model for online social networks based on connection characteristics between users, in an extended sense that includes indirect connections. This model provides a conditional access to shared resources based on reachability constraints, between the owner and the requester of a piece of information. Then, we describe the work that we have done to scale the access control enforcement performances over large social graphs. This paper describes PhD work carried out at Télécom ParisTech under the guidance of Talel Abdessalem
Adaptive XGBoost for evolving data streams
Boosting is an ensemble method that combines base models in a sequential manner to achieve high predictive accuracy. A popular learning algorithm based on this ensemble method is eXtreme Gradient Boosting (XGB). We present an adaptation of XGB for classification of evolving data streams. In this setting, new data arrives over time and the relationship between the class and the features may change in the process, thus exhibiting concept drift. The proposed method creates new members of the ensemble from mini-batches of data as new data becomes available. The maximum ensemble size is fixed, but learning does not stop when this size is reached because the ensemble is updated on new data to ensure consistency with the current concept. We also explore the use of concept drift detection to trigger a mechanism to update the ensemble. We test our method on real and synthetic data with concept drift and compare it against batch-incremental and instance-incremental classification methods for data streams
Adaptive random forests for evolving data stream classification
Random forests is currently one of the most used machine learning algorithms in the non-streaming (batch) setting. This preference is attributable to its high learning performance and low demands with respect to input preparation and hyper-parameter tuning. However, in the challenging context of evolving data streams, there is no random forests algorithm that can be considered state-of-the-art in comparison to bagging and boosting based algorithms. In this work, we present the adaptive random forest (ARF) algorithm for classification of evolving data streams. In contrast to previous attempts of replicating random forests for data stream learning, ARF includes an effective resampling method and adaptive operators that can cope with different types of concept drifts without complex optimizations for different data sets. We present experiments with a parallel implementation of ARF which has no degradation in terms of classification performance in comparison to a serial implementation, since trees and adaptive operators are independent from one another. Finally, we compare ARF with state-of-the-art algorithms in a traditional test-then-train evaluation and a novel delayed labelling evaluation, and show that ARF is accurate and uses a feasible amount of resources
River: Machine learning for streaming data in Python
River is a machine learning library for dynamic data streams and continual learning. It provides multiple state-of-the-art learning methods, data generators/transformers, performance metrics and evaluators for different stream learning problems. It is the result from the merger of two popular packages for stream learning in Python: Creme and scikit- multiow. River introduces a revamped architecture based on the lessons learnt from the seminal packages. River's ambition is to be the go-to library for doing machine learning on streaming data. Additionally, this open source package brings under the same um-brella a large community of practitioners and researchers. The source code is available at https://github.com/online-ml/river
- âŠ