Automatic discovery of the statistical types of variables in a dataset

Ghahramani, Z; Valera, I

Automatic discovery of the statistical types of variables in a dataset

Authors: Z Ghahramani
I Valera
Publication date: 1 January 2017
Publisher: 34th International Conference on Machine Learning, ICML 2017
Doi

Abstract

A common practice in statistics and machine learning is to assume that the statistical data types (e.g., ordinal, categorical or real-valued) of variables, and usually also the likelihood model, is known. However, as the availability of real- world data increases, this assumption becomes too restrictive. Data are often heterogeneous, complex, and improperly or incompletely documented. Surprisingly, despite their practical importance, there is still a lack of tools to automatically discover the statistical types of, as well as appropriate likelihood (noise) models for, the variables in a dataset. In this paper, we fill this gap by proposing a Bayesian method, which accurately discovers the statistical data types in both synthetic and real data.Humboldt Research Fellowship for Postdoctoral Researchers, which funded this research during her stay at the Max Planck Institute for Software Systems. ATI Grant EP/N510129/1 EPSRC Grant EP/N014162/1 Googl

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Sustaining member

Apollo (Cambridge)

oai:www.repository.cam.ac.uk:1...

Last time updated on 05/12/2017

Sustaining member

Apollo (Cambridge)

oai:www.repository.cam.ac.uk:1...

Last time updated on 12/01/2019