Search CORE

5,906 research outputs found

Predictive Liability Models and Visualizations of High Dimensional Retail Employee Data

Author: Borowczak Mike
Yang Richard R.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2018
Field of study

Employee theft and dishonesty is a major contributor to loss in the retail industry. Retailers have reported the need for more automated analytic tools to assess the liability of their employees. In this work, we train and optimize several machine learning models for regression prediction and analysis on this data, which will help retailers identify and manage risky employees. Since the data we use is very high dimensional, we use feature selection techniques to identify the most contributing factors to an employee's assessed risk. We also use dimension reduction and data embedding techniques to present this dataset in a easy to interpret format

arXiv.org e-Print Archive

Crossref

Hyperparameter Learning via Distributional Transfer

Author: Chan Lucian
Huang Junzhou
Law Ho Chung Leon
Sejdinovic Dino
Zhao Peilin
Publication venue
Publication date: 01/01/2019
Field of study

Bayesian optimisation is a popular technique for hyperparameter learning but typically requires initial exploration even in cases where similar prior tasks have been solved. We propose to transfer information across tasks using learnt representations of training datasets used in those tasks. This results in a joint Gaussian process model on hyperparameters and data representations. Representations make use of the framework of distribution embeddings into reproducing kernel Hilbert spaces. The developed method has a faster convergence compared to existing baselines, in some cases requiring only a few evaluations of the target objective

arXiv.org e-Print Archive

Oxford University Research Archive

Covariate dimension reduction for survival data via the Gaussian process latent variable model

Author: Cox
Curtis
Ek
Ek
Eleftheriadis
Engler
Gal
Gao
Goeman
Ibrahim
Ishwaran
Ishwaran
Ishwaran
Ishwaran
Lawrence
Lu
Martino
Park
Rasmussen
Shon
Sohn
Tibshirani
Uno
Urtasun
Vanhatalo
Witten
Publication venue
Publication date: 01/11/2015
Field of study

The analysis of high dimensional survival data is challenging, primarily due to the problem of overfitting which occurs when spurious relationships are inferred from data that subsequently fail to exist in test data. Here we propose a novel method of extracting a low dimensional representation of covariates in survival data by combining the popular Gaussian Process Latent Variable Model (GPLVM) with a Weibull Proportional Hazards Model (WPHM). The combined model offers a flexible non-linear probabilistic method of detecting and extracting any intrinsic low dimensional structure from high dimensional data. By reducing the covariate dimension we aim to diminish the risk of overfitting and increase the robustness and accuracy with which we infer relationships between covariates and survival outcomes. In addition, we can simultaneously combine information from multiple data sources by expressing multiple datasets in terms of the same low dimensional space. We present results from several simulation studies that illustrate a reduction in overfitting and an increase in predictive performance, as well as successful detection of intrinsic dimensionality. We provide evidence that it is advantageous to combine dimensionality reduction with survival outcomes rather than performing unsupervised dimensionality reduction on its own. Finally, we use our model to analyse experimental gene expression data and detect and extract a low dimensional representation that allows us to distinguish high and low risk groups with superior accuracy compared to doing regression on the original high dimensional data

arXiv.org e-Print Archive

Crossref

King's Research Portal

BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees

Author: Anderson M. R.
Bardenet R.
Bergstra J. S.
Bottou L.
Crankshaw D.
Derezinski M.
Drineas P.
Duchi J.
Feurer M.
Gittens A.
Le Q. V.
Lin C.-J.
Lucic M.
Maclaurin D.
Martens J.
Mozafari B.
Musco C.
Ogawa K.
Pedregosa F.
R
Recht B.
Salakhutdinov R.
Tieleman T.
Tipping M. E.
Weimer M.
Xing E.
Yang A. Y.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 26/12/2018
Field of study

The rising volume of datasets has made training machine learning (ML) models a major computational cost in the enterprise. Given the iterative nature of model and parameter tuning, many analysts use a small sample of their entire data during their initial stage of analysis to make quick decisions (e.g., what features or hyperparameters to use) and use the entire dataset only in later stages (i.e., when they have converged to a specific model). This sampling, however, is performed in an ad-hoc fashion. Most practitioners cannot precisely capture the effect of sampling on the quality of their model, and eventually on their decision-making process during the tuning phase. Moreover, without systematic support for sampling operators, many optimizations and reuse opportunities are lost. In this paper, we introduce BlinkML, a system for fast, quality-guaranteed ML training. BlinkML allows users to make error-computation tradeoffs: instead of training a model on their full data (i.e., full model), BlinkML can quickly train an approximate model with quality guarantees using a sample. The quality guarantees ensure that, with high probability, the approximate model makes the same predictions as the full model. BlinkML currently supports any ML model that relies on maximum likelihood estimation (MLE), which includes Generalized Linear Models (e.g., linear regression, logistic regression, max entropy classifier, Poisson regression) as well as PPCA (Probabilistic Principal Component Analysis). Our experiments show that BlinkML can speed up the training of large-scale ML tasks by 6.26x-629x while guaranteeing the same predictions, with 95% probability, as the full model.Comment: 22 pages, SIGMOD 201

arXiv.org e-Print Archive

Crossref

Time Series Cluster Kernel for Learning Similarities between Multivariate Time Series with Missing Data

Author: Bianchi Filippo Maria
Jenssen Robert
Mikalsen Karl Øyvind
Soguero-Ruiz Cristina
Publication venue
Publication date: 01/01/2017
Field of study

Similarity-based approaches represent a promising direction for time series analysis. However, many such methods rely on parameter tuning, and some have shortcomings if the time series are multivariate (MTS), due to dependencies between attributes, or the time series contain missing data. In this paper, we address these challenges within the powerful context of kernel methods by proposing the robust \emph{time series cluster kernel} (TCK). The approach taken leverages the missing data handling properties of Gaussian mixture models (GMM) augmented with informative prior distributions. An ensemble learning approach is exploited to ensure robustness to parameters by combining the clustering results of many GMM to form the final kernel. We evaluate the TCK on synthetic and real data and compare to other state-of-the-art techniques. The experimental results demonstrate that the TCK is robust to parameter choices, provides competitive results for MTS without missing data and outstanding results for missing data.Comment: 23 pages, 6 figure

arXiv.org e-Print Archive

Munin - Open Research Archive

NORA - Norwegian Open Research Archives