39 research outputs found
Recommended from our members
Converting to Optimization in Machine Learning: Perturb-and-MAP, Differential Privacy, and Program Synthesis
On a mathematical level, most computational problems encountered in machine learning are instances of one of four abstract, fundamental problems: sampling, integration, optimization, and search.
Thanks to the rich history of the respective mathematical fields, disparate methods with different properties have been developed for these four problem classes.
As a result it can be beneficial to convert a problem from one abstract class into a problem of a different class, because the latter might come with insights, techniques, and algorithms well suited to the particular problem at hand.
In particular, this thesis contributes four new methods and generalizations of existing methods for converting specific non-optimization machine learning tasks into optimization problems with more appealing properties.
The first example is partition function estimation (an integration problem), where an existing algorithm -- the Gumbel trick -- for converting to the MAP optimization problem is generalized into a more general family of algorithms, such that other instances of this family have better statistical properties.
Second, this family of algorithms is further generalized to another integration problem, the problem of estimating Rényi entropies.
The third example shows how an intractable sampling problem arising when wishing to publicly release a database containing sensitive data in a safe ("differentially private") manner can be converted into an optimization problem using the theory of Reproducing Kernel Hilbert Spaces.
Finally, the fourth case study casts the challenging discrete search problem of program synthesis from input-output examples as a supervised learning task that can be efficiently tackled using gradient-based optimization.
In all four instances, the conversions result in novel algorithms with desirable properties.
In the first instance, new generalizations of the Gumbel trick can be used to construct statistical estimators of the partition function that achieve the same estimation error while using up to 40% fewer samples.
The second instance shows that unbiased estimators of the Rényi entropy can be constructed in the Perturb-and-MAP framework.
The main contribution of the third instance is theoretical: the conversion shows that it is possible to construct an algorithm for releasing synthetic databases that approximate databases containing sensitive data in a mathematically precise sense, and to prove results about their approximation errors.
Finally, the fourth conversion yields an algorithm for synthesising program source code from input-output examples that is able to solve test problems 1-3 orders of magnitude faster than a wide range of baselines
Generating Realistic Synthetic Population Datasets
Modern studies of societal phenomena rely on the availability of large datasets capturing attributes and activities of synthetic, city-level, populations. For instance, in epidemiology, synthetic population datasets are necessary to study disease propagation and intervention measures before implementation. In social science, synthetic population datasets are needed to understand how policy decisions might affect preferences and behaviors of individuals. In public health, synthetic population datasets are necessary to capture diagnostic and procedural characteristics of patient records without violating confidentialities of individuals. To generate such datasets over a large set of categorical variables, we propose the use of the maximum entropy principle to formalize a generative model such that in a statistically well-founded way we can optimally utilize given prior information about the data, and are unbiased otherwise. An efficient inference algorithm is designed to estimate the maximum entropy model, and we demonstrate how our approach is adept at estimating underlying data distributions. We evaluate this approach against both simulated data and on US census datasets, and demonstrate its feasibility using an epidemic simulation application
Non-Imaging Medical Data Synthesis for Trustworthy AI: A Comprehensive Survey
Data quality is the key factor for the development of trustworthy AI in
healthcare. A large volume of curated datasets with controlled confounding
factors can help improve the accuracy, robustness and privacy of downstream AI
algorithms. However, access to good quality datasets is limited by the
technical difficulty of data acquisition and large-scale sharing of healthcare
data is hindered by strict ethical restrictions. Data synthesis algorithms,
which generate data with a similar distribution as real clinical data, can
serve as a potential solution to address the scarcity of good quality data
during the development of trustworthy AI. However, state-of-the-art data
synthesis algorithms, especially deep learning algorithms, focus more on
imaging data while neglecting the synthesis of non-imaging healthcare data,
including clinical measurements, medical signals and waveforms, and electronic
healthcare records (EHRs). Thus, in this paper, we will review the synthesis
algorithms, particularly for non-imaging medical data, with the aim of
providing trustworthy AI in this domain. This tutorial-styled review paper will
provide comprehensive descriptions of non-imaging medical data synthesis on
aspects including algorithms, evaluations, limitations and future research
directions.Comment: 35 pages, Submitted to ACM Computing Survey
Partially Synthesised Dataset to Improve Prediction Accuracy (Case Study: Prediction of Heart Diseases)
The real world data sources, such as statistical agencies, library data-banks and research institutes are the major data sources for researchers. Using this type of data involves several advantages including, the improvement of credibility and validity of the experiment and more importantly, it is related to a real world problems and typically unbiased. However, this type of data is most likely unavailable or inaccessible for everyone due to the following reasons. First, privacy and confidentiality concerns, since the data must to be protected on legal and ethical basis. Second, collecting real world data is costly and time consuming. Third, the data may be unavailable, particularly in the newly arises research subjects. Therefore, many studies have attributed the use of fully and/or partially synthesised data instead of real world data due to simplicity of creation, requires a relatively small amount of time and sufficient quantity can be generated to fit the requirements. In this context, this study introduces the use of partially synthesised data to improve the prediction of heart diseases from risk factors. We are proposing the generation of partially synthetic data from agreed principles using rule-based method, in which an extra risk factor will be added to the real-world data. In the conducted experiment, more than 85% of the data was derived from observed values (i.e., real-world data), while the remaining data has been synthetically generated using a rule-based method and in accordance with the World Health Organisation criteria. The analysis revealed an improvement of the variance in the data using the first two principal components of partially synthesised data. A further evaluation has been con-ducted using five popular supervised machine-learning classifiers. In which, partially synthesised data considerably improves the prediction of heart diseases. Where the majority of classifiers have approximately doubled their predictive performance using an extra risk factor