117 research outputs found
Absorbing random-walk centrality: Theory and algorithms
We study a new notion of graph centrality based on absorbing random walks.
Given a graph and a set of query nodes , we aim to
identify the most central nodes in with respect to . Specifically,
we consider central nodes to be absorbing for random walks that start at the
query nodes . The goal is to find the set of central nodes that
minimizes the expected length of a random walk until absorption. The proposed
measure, which we call absorbing random-walk centrality, favors diverse
sets, as it is beneficial to place the absorbing nodes in different parts
of the graph so as to "intercept" random walks that start from different query
nodes.
Although similar problem definitions have been considered in the literature,
e.g., in information-retrieval settings where the goal is to diversify
web-search results, in this paper we study the problem formally and prove some
of its properties. We show that the problem is NP-hard, while the objective
function is monotone and supermodular, implying that a greedy algorithm
provides solutions with an approximation guarantee. On the other hand, the
greedy algorithm involves expensive matrix operations that make it prohibitive
to employ on large datasets. To confront this challenge, we develop more
efficient algorithms based on spectral clustering and on personalized PageRank.Comment: 11 pages, 11 figures, short paper to appear at ICDM 201
Markov Chain Monitoring
In networking applications, one often wishes to obtain estimates about the
number of objects at different parts of the network (e.g., the number of cars
at an intersection of a road network or the number of packets expected to reach
a node in a computer network) by monitoring the traffic in a small number of
network nodes or edges. We formalize this task by defining the 'Markov Chain
Monitoring' problem.
Given an initial distribution of items over the nodes of a Markov chain, we
wish to estimate the distribution of items at subsequent times. We do this by
asking a limited number of queries that retrieve, for example, how many items
transitioned to a specific node or over a specific edge at a particular time.
We consider different types of queries, each defining a different variant of
the Markov chain monitoring. For each variant, we design efficient algorithms
for choosing the queries that make our estimates as accurate as possible. In
our experiments with synthetic and real datasets we demonstrate the efficiency
and the efficacy of our algorithms in a variety of settings.Comment: 13 pages, 10 figures, 1 tabl
Hydrology of Contaminant Flow Regimes to Groundwater, Streams, and the Ocean Waters of Kaneohe Bay, Oahu.
M.S. Thesis. University of Hawaiʻi at Mānoa 2018
Certifiable Unlearning Pipelines for Logistic Regression: An Experimental Study
Machine unlearning is the task of updating machine learning (ML) models after a subset of the training data they were trained on is deleted. Methods for the task are desired to combine effectiveness and efficiency (i.e., they should effectively “unlearn” deleted data, but in a way that does not require excessive computational effort (e.g., a full retraining) for a small amount of deletions). Such a combination is typically achieved by tolerating some amount of approximation in the unlearning. In addition, laws and regulations in the spirit of “the right to be forgotten” have given rise to requirements for certifiability (i.e., the ability to demonstrate that the deleted data has indeed been unlearned by the ML model). In this paper, we present an experimental study of the three state-of-the-art approximate unlearning methods for logistic regression and demonstrate the trade-offs between efficiency, effectiveness and certifiability offered by each method. In implementing this study, we extend some of the existing works and describe a common unlearning pipeline to compare and evaluate the unlearning methods on six real-world datasets and a variety of settings. We provide insights into the effect of the quantity and distribution of the deleted data on ML models and the performance of each unlearning method in different settings. We also propose a practical online strategy to determine when the accumulated error from approximate unlearning is large enough to warrant a full retraining of the ML model
Joint Use of Node Attributes and Proximity for Node Classification
Node classification aims to infer unknown node labels from known labels and other node attributes. Standard approaches for this task assume homophily, whereby a node’s label is predicted from the labels of other nodes nearby in the network. However, there are also cases of networks where labels are better predicted from the individual attributes of each node rather than the labels of nearby nodes. Ideally, node classification methods should flexibly adapt to a range of settings wherein unknown labels are predicted either from labels of nearby nodes, or individual node attributes, or partly both. In this paper, we propose a principled approach, JANE, based on a generative probabilistic model that jointly weighs the role of attributes and node proximity via embeddings in predicting labels. Experiments on multiple network datasets demonstrate that JANE exhibits the desired combination of versatility and competitive performance compared to baselines.Peer reviewe
Cost-Effective Retraining of Machine Learning Models
It is important to retrain a machine learning (ML) model in order to maintain
its performance as the data changes over time. However, this can be costly as
it usually requires processing the entire dataset again. This creates a
trade-off between retraining too frequently, which leads to unnecessary
computing costs, and not retraining often enough, which results in stale and
inaccurate ML models. To address this challenge, we propose ML systems that
make automated and cost-effective decisions about when to retrain an ML model.
We aim to optimize the trade-off by considering the costs associated with each
decision. Our research focuses on determining whether to retrain or keep an
existing ML model based on various factors, including the data, the model, and
the predictive queries answered by the model. Our main contribution is a
Cost-Aware Retraining Algorithm called Cara, which optimizes the trade-off over
streams of data and queries. To evaluate the performance of Cara, we analyzed
synthetic datasets and demonstrated that Cara can adapt to different data
drifts and retraining costs while performing similarly to an optimal
retrospective algorithm. We also conducted experiments with real-world datasets
and showed that Cara achieves better accuracy than drift detection baselines
while making fewer retraining decisions, ultimately resulting in lower total
costs
- …