Search CORE

96 research outputs found

Absorbing random-walk centrality: Theory and algorithms

Author: Gionis Aristides
Mathioudakis Michael
Mavroforakis Charalampos
Publication venue
Publication date: 08/09/2015
Field of study

We study a new notion of graph centrality based on absorbing random walks. Given a graph

G=(V,E)

and a set of query nodes

Q\subseteq V

, we aim to identify the

k

most central nodes in

G

with respect to

Q

. Specifically, we consider central nodes to be absorbing for random walks that start at the query nodes

Q

. The goal is to find the set of

k

central nodes that minimizes the expected length of a random walk until absorption. The proposed measure, which we call

k

absorbing random-walk centrality, favors diverse sets, as it is beneficial to place the

k

absorbing nodes in different parts of the graph so as to "intercept" random walks that start from different query nodes. Although similar problem definitions have been considered in the literature, e.g., in information-retrieval settings where the goal is to diversify web-search results, in this paper we study the problem formally and prove some of its properties. We show that the problem is NP-hard, while the objective function is monotone and supermodular, implying that a greedy algorithm provides solutions with an approximation guarantee. On the other hand, the greedy algorithm involves expensive matrix operations that make it prohibitive to employ on large datasets. To confront this challenge, we develop more efficient algorithms based on spectral clustering and on personalized PageRank.Comment: 11 pages, 11 figures, short paper to appear at ICDM 201

arXiv.org e-Print Archive

Crossref

Markov Chain Monitoring

Author: Chaudhari Harshal A.
Mathioudakis Michael
Terzi Evimaria
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 23/01/2018
Field of study

In networking applications, one often wishes to obtain estimates about the number of objects at different parts of the network (e.g., the number of cars at an intersection of a road network or the number of packets expected to reach a node in a computer network) by monitoring the traffic in a small number of network nodes or edges. We formalize this task by defining the 'Markov Chain Monitoring' problem. Given an initial distribution of items over the nodes of a Markov chain, we wish to estimate the distribution of items at subsequent times. We do this by asking a limited number of queries that retrieve, for example, how many items transitioned to a specific node or over a specific edge at a particular time. We consider different types of queries, each defining a different variant of the Markov chain monitoring. For each variant, we design efficient algorithms for choosing the queries that make our estimates as accurate as possible. In our experiments with synthetic and real datasets we demonstrate the efficiency and the efficacy of our algorithms in a variety of settings.Comment: 13 pages, 10 figures, 1 tabl

arXiv.org e-Print Archive

Crossref

Joint Use of Node Attributes and Proximity for Node Classification

Author: Mathioudakis Michael
Merchant Arpit
Publication venue: Springer, Cham
Publication date: 01/01/2022
Field of study

Node classification aims to infer unknown node labels from known labels and other node attributes. Standard approaches for this task assume homophily, whereby a node’s label is predicted from the labels of other nodes nearby in the network. However, there are also cases of networks where labels are better predicted from the individual attributes of each node rather than the labels of nearby nodes. Ideally, node classification methods should flexibly adapt to a range of settings wherein unknown labels are predicted either from labels of nearby nodes, or individual node attributes, or partly both. In this paper, we propose a principled approach, JANE, based on a generative probabilistic model that jointly weighs the role of attributes and node proximity via embeddings in predicting labels. Experiments on multiple network datasets demonstrate that JANE exhibits the desired combination of versatility and competitive performance compared to baselines.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Hydrology of Contaminant Flow Regimes to Groundwater, Streams, and the Ocean Waters of Kaneohe Bay, Oahu.

Author: Mathioudakis Michael R., II
Publication venue: University of Hawaiʻi at Mānoa
Publication date: 01/08/2018
Field of study

M.S. Thesis. University of Hawaiʻi at Mānoa 2018

ScholarSpace at University of Hawai'i at Manoa

Certifiable Unlearning Pipelines for Logistic Regression: An Experimental Study

Author: Mahadevan Ananth
Mathioudakis Michael
Publication venue: Multidisciplinary Digital Publishing Institute
Publication date: 01/06/2022
Field of study

Machine unlearning is the task of updating machine learning (ML) models after a subset of the training data they were trained on is deleted. Methods for the task are desired to combine effectiveness and efficiency (i.e., they should effectively “unlearn” deleted data, but in a way that does not require excessive computational effort (e.g., a full retraining) for a small amount of deletions). Such a combination is typically achieved by tolerating some amount of approximation in the unlearning. In addition, laws and regulations in the spirit of “the right to be forgotten” have given rise to requirements for certifiability (i.e., the ability to demonstrate that the deleted data has indeed been unlearned by the ML model). In this paper, we present an experimental study of the three state-of-the-art approximate unlearning methods for logistic regression and demonstrate the trade-offs between efficiency, effectiveness and certifiability offered by each method. In implementing this study, we extend some of the existing works and describe a common unlearning pipeline to compare and evaluate the unlearning methods on six real-world datasets and a variety of settings. We provide insights into the effect of the quantity and distribution of the deleted data on ML models and the performance of each unlearning method in different settings. We also propose a practical online strategy to determine when the accumulated error from approximate unlearning is large enough to warrant a full retraining of the ML model

Directory of Open Access Journals

Helsingin yliopiston digitaalinen arkisto

Cost-Effective Retraining of Machine Learning Models

Author: Mahadevan Ananth
Mathioudakis Michael
Publication venue
Publication date: 06/10/2023
Field of study

It is important to retrain a machine learning (ML) model in order to maintain its performance as the data changes over time. However, this can be costly as it usually requires processing the entire dataset again. This creates a trade-off between retraining too frequently, which leads to unnecessary computing costs, and not retraining often enough, which results in stale and inaccurate ML models. To address this challenge, we propose ML systems that make automated and cost-effective decisions about when to retrain an ML model. We aim to optimize the trade-off by considering the costs associated with each decision. Our research focuses on determining whether to retrain or keep an existing ML model based on various factors, including the data, the model, and the predictive queries answered by the model. Our main contribution is a Cost-Aware Retraining Algorithm called Cara, which optimizes the trade-off over streams of data and queries. To evaluate the performance of Cara, we analyzed synthetic datasets and demonstrated that Cara can adapt to different data drifts and retraining costs while performing similarly to an optimal retrospective algorithm. We also conducted experiments with real-world datasets and showed that Cara achieves better accuracy than drift detection baselines while making fewer retraining decisions, ultimately resulting in lower total costs

arXiv.org e-Print Archive