Search CORE

76 research outputs found

Clustering on the Edge: Learning Structure in Graphs

Author: Barnes Matt
Dubrawski Artur
Publication venue
Publication date: 05/05/2016
Field of study

With the recent popularity of graphical clustering methods, there has been an increased focus on the information between samples. We show how learning cluster structure using edge features naturally and simultaneously determines the most likely number of clusters and addresses data scale issues. These results are particularly useful in instances where (a) there are a large number of clusters and (b) we have some labeled edges. Applications in this domain include image segmentation, community discovery and entity resolution. Our model is an extension of the planted partition model and our solution uses results of correlation clustering, which achieves a partition O(log(n))-close to the log-likelihood of the true clustering

arXiv.org e-Print Archive

System-Level Predictive Maintenance: Review of Research Literature and Gap Analysis

Author: Dubrawski Artur
Miller Kyle
Publication venue
Publication date: 11/05/2020
Field of study

This paper reviews current literature in the field of predictive maintenance from the system point of view. We differentiate the existing capabilities of condition estimation and failure risk forecasting as currently applied to simple components, from the capabilities needed to solve the same tasks for complex assets. System-level analysis faces more complex latent degradation states, it has to comprehensively account for active maintenance programs at each component level and consider coupling between different maintenance actions, while reflecting increased monetary and safety costs for system failures. As a result, methods that are effective for forecasting risk and informing maintenance decisions regarding individual components do not readily scale to provide reliable sub-system or system level insights. A novel holistic modeling approach is needed to incorporate available structural and physical knowledge and naturally handle the complexities of actively fielded and maintained assets.Comment: 24 pages, 3 figure

arXiv.org e-Print Archive

Performance Bounds for Pairwise Entity Resolution

Author: Barnes Matt
Dubrawski Artur
Miller Kyle
Publication venue
Publication date: 10/09/2015
Field of study

One significant challenge to scaling entity resolution algorithms to massive datasets is understanding how performance changes after moving beyond the realm of small, manually labeled reference datasets. Unlike traditional machine learning tasks, when an entity resolution algorithm performs well on small hold-out datasets, there is no guarantee this performance holds on larger hold-out datasets. We prove simple bounding properties between the performance of a match function on a small validation set and the performance of a pairwise entity resolution algorithm on arbitrarily sized datasets. Thus, our approach enables optimization of pairwise entity resolution algorithms for large datasets, using a small set of labeled data

arXiv.org e-Print Archive

Deep Survival Machines: Fully Parametric Survival Regression and Representation Learning for Censored Data with Competing Risks

Author: Dubrawski Artur
Li Xinyu Rachel
Nagpal Chirag
Publication venue
Publication date: 09/06/2021
Field of study

We describe a new approach to estimating relative risks in time-to-event prediction problems with censored data in a fully parametric manner. Our approach does not require making strong assumptions of constant proportional hazard of the underlying survival distribution, as required by the Cox-proportional hazard model. By jointly learning deep nonlinear representations of the input covariates, we demonstrate the benefits of our approach when used to estimate survival risks through extensive experimentation on multiple real world datasets with different levels of censoring. We further demonstrate advantages of our model in the competing risks scenario. To the best of our knowledge, this is the first work involving fully parametric estimation of survival times with competing risks in the presence of censoring.Comment: Also appeared in NeurIPS 2019 Workshop on Machine Learning for Healthcare (ML4H

arXiv.org e-Print Archive

Double Adaptive Stochastic Gradient Optimization

Author: Challu Cristian
Dubrawski Artur
Gutierrez Kin
Li Jin
Publication venue
Publication date: 06/11/2018
Field of study

Adaptive moment methods have been remarkably successful in deep learning optimization, particularly in the presence of noisy and/or sparse gradients. We further the advantages of adaptive moment techniques by proposing a family of double adaptive stochastic gradient methods~\textsc{DASGrad}. They leverage the complementary ideas of the adaptive moment algorithms widely used by deep learning community, and recent advances in adaptive probabilistic algorithms.We analyze the theoretical convergence improvements of our approach in a stochastic convex optimization setting, and provide empirical validation of our findings with convex and non convex objectives. We observe that the benefits of~\textsc{DASGrad} increase with the model complexity and variability of the gradients, and we explore the resulting utility in extensions of distribution-matching multitask learning

arXiv.org e-Print Archive

On the Interaction Effects Between Prediction and Clustering

Author: Barnes Matt
Dubrawski Artur
Publication venue
Publication date: 28/12/2018
Field of study

Machine learning systems increasingly depend on pipelines of multiple algorithms to provide high quality and well structured predictions. This paper argues interaction effects between clustering and prediction (e.g. classification, regression) algorithms can cause subtle adverse behaviors during cross-validation that may not be initially apparent. In particular, we focus on the problem of estimating the out-of-cluster (OOC) prediction loss given an approximate clustering with probabilistic error rate

p_0

. Traditional cross-validation techniques exhibit significant empirical bias in this setting, and the few attempts to estimate and correct for these effects are intractable on larger datasets. Further, no previous work has been able to characterize the conditions under which these empirical effects occur, and if they do, what properties they have. We precisely answer these questions by providing theoretical properties which hold in various settings, and prove that expected out-of-cluster loss behavior rapidly decays with even minor clustering errors. Fortunately, we are able to leverage these same properties to construct hypothesis tests and scalable estimators necessary for correcting the problem. Empirical results on benchmark datasets validate our theoretical results and demonstrate how scaling techniques provide solutions to new classes of problems

arXiv.org e-Print Archive

Pairwise Feedback for Data Programming

Author: Boecking Benedikt
Dubrawski Artur
Publication venue
Publication date: 16/12/2019
Field of study

The scalability of the labeling process and the attainable quality of labels have become limiting factors for many applications of machine learning. The programmatic creation of labeled datasets via the synthesis of noisy heuristics provides a promising avenue to address this problem. We propose to improve modeling of latent class variables in the programmatic creation of labeled datasets by incorporating pairwise feedback into the process. We discuss the ease with which such pairwise feedback can be obtained or generated in many application domains. Our experiments show that even a small number of sources of pairwise feedback can substantially improve the quality of the posterior estimate of the latent class variable.Comment: Presented at the NeurIPS 2019 workshop on Learning with Rich Experience: Integration of Learning Paradigm

arXiv.org e-Print Archive

Novel Prediction Techniques Based on Clusterwise Linear Regression

Author: Chen Jieshi
Dubrawski Artur
Gitman Igor
Lei Eric
Publication venue
Publication date: 28/04/2018
Field of study

In this paper we explore different regression models based on Clusterwise Linear Regression (CLR). CLR aims to find the partition of the data into

k

clusters, such that linear regressions fitted to each of the clusters minimize overall mean squared error on the whole data. The main obstacle preventing to use found regression models for prediction on the unseen test points is the absence of a reasonable way to obtain CLR cluster labels when the values of target variable are unknown. In this paper we propose two novel approaches on how to solve this problem. The first approach, predictive CLR builds a separate classification model to predict test CLR labels. The second approach, constrained CLR utilizes a set of user-specified constraints that enforce certain points to go to the same clusters. Assuming the constraint values are known for the test points, they can be directly used to assign CLR labels. We evaluate these two approaches on three UCI ML datasets as well as on a large corpus of health insurance claims. We show that both of the proposed algorithms significantly improve over the known CLR-based regression methods. Moreover, predictive CLR consistently outperforms linear regression and random forest, and shows comparable performance to support vector regression on UCI ML datasets. The constrained CLR approach achieves the best performance on the health insurance dataset, while enjoying only

\approx 20

times increased computational time over linear regression

arXiv.org e-Print Archive

An Entity Resolution approach to isolate instances of Human Trafficking online

Author: Boecking Benedikt
Dubrawski Artur
Miller Kyle
Nagpal Chirag
Publication venue
Publication date: 18/06/2017
Field of study

Human trafficking is a challenging law enforcement problem, and a large amount of such activity manifests itself on various online forums. Given the large, heterogeneous and noisy structure of this data, building models to predict instances of trafficking is an even more convolved a task. In this paper we propose and entity resolution pipeline using a notion of proxy labels, in order to extract clusters from this data with prior history of human trafficking activity. We apply this pipeline to 5M records from backpage.com and report on the performance of this approach, challenges in terms of scalability, and some significant domain specific characteristics of our resolved entities

arXiv.org e-Print Archive

Lass-0: sparse non-convex regression by local search

Author: De-Arteaga Maria
Dubrawski Artur
Herlands William
Neill Daniel
Publication venue
Publication date: 17/02/2016
Field of study

We compute approximate solutions to L0 regularized linear regression using L1 regularization, also known as the Lasso, as an initialization step. Our algorithm, the Lass-0 ("Lass-zero"), uses a computationally efficient stepwise search to determine a locally optimal L0 solution given any L1 regularization solution. We present theoretical results of consistency under orthogonality and appropriate handling of redundant features. Empirically, we use synthetic data to demonstrate that Lass-0 solutions are closer to the true sparse support than L1 regularization models. Additionally, in real-world data Lass-0 finds more parsimonious solutions than L1 regularization while maintaining similar predictive accuracy.Comment: 8 pages, 1 figure. NIPS 2015 Workshop of Optimization (OPT2015

arXiv.org e-Print Archive