9 research outputs found
Missing Data Imputation using Optimal Transport
Missing data is a crucial issue when applying machine learning algorithms to
real-world datasets. Starting from the simple assumption that two batches
extracted randomly from the same dataset should share the same distribution, we
leverage optimal transport distances to quantify that criterion and turn it
into a loss function to impute missing data values. We propose practical
methods to minimize these losses using end-to-end learning, that can exploit or
not parametric assumptions on the underlying distributions of values. We
evaluate our methods on datasets from the UCI repository, in MCAR, MAR and MNAR
settings. These experiments show that OT-based methods match or out-perform
state-of-the-art imputation methods, even for high percentages of missing
values
A novel feature selection framework for incomplete data
Feature selection on incomplete datasets is an exceptionally challenging
task. Existing methods address this challenge by first employing imputation
methods to complete the incomplete data and then conducting feature selection
based on the imputed data. Since imputation and feature selection are entirely
independent steps, the importance of features cannot be considered during
imputation. However, in real-world scenarios or datasets, different features
have varying degrees of importance. To address this, we propose a novel
incomplete data feature selection framework that considers feature importance.
The framework mainly consists of two alternating iterative stages: the M-stage
and the W-stage. In the M-stage, missing values are imputed based on a given
feature importance vector and multiple initial imputation results. In the
W-stage, an improved reliefF algorithm is employed to learn the feature
importance vector based on the imputed data. Specifically, the feature
importance vector obtained in the current iteration of the W-stage serves as
input for the next iteration of the M-stage. Experimental results on both
artificially generated and real incomplete datasets demonstrate that the
proposed method outperforms other approaches significantly
Enhancing imputation techniques performance utilizing uncertainty aware predictors and adversarial learning
One crucial problem for applying machine learning algorithms to real-world datasets is missing data. The objective of data imputation is to fill the missing values in a dataset to resemble the completed dataset as accurately as possible. Many methods are proposed in the literature that mostly differs on the objective function and types of the variables considered. The performance of traditional machine learning methods is low when there is a nonlinear and complex relationship between features. Recently, deep learning methods are introduced to estimate data distribution and generate values for missing entries. However, these methods are originally developed for large datasets and custom data types such as image, video, and text. Thus, adopting these methods for small and structured datasets that are prevalent in real-world applications is not straightforward and often yields unsatisfactory results. Also, both types of methods do not consider uncertainty in the imputed data. We address these issues by developing a simple neural network-based architecture that works well with small and tabular datasets and utilizing a novel adversarial strategy to estimate the uncertainty of imputed data. The estimated uncertainty scores of features are then passed to the imputer module, and it fills the missing values by paying more attention to more reliable feature values. It results in an uncertainty-aware imputer with a promising performance. Extensive experiments conducted on some real-world datasets confirm that the proposed methods considerably outperform state-of-the-art imputers. Meanwhile, their execution time is not costly compared to peer state-of-the-art methods
Вирішення транспортної задачі методами машинного навчання
Магістерська дисертація: 87 с., 27 рисунків, 24 таблиці, 21
джерело.
В роботі розглянута класична задача оптимального
транспортування. Проведено дослідження відомих методів її вирішення,
їх переваги та недоліки, необхідні умови існування оптимального
розв’язку. Окрім цього, був запропонований машинний метод
вирішення задачі з побудовою та навчанням моделі на основі
генеративної нейронної мережі.
В роботі було розглянуто загальні відомості про методи вирішення
задачі оптимального транспортування при її незбалансованості та
масштабованості. Було виконано аналіз результатів трьох різних типів
задач, вирішених методом машинного навчання.
Об’єктом дослідження є класична задача оптимального
транспортування у трьох різних видах.
Предметом дослідження є методи машинного навчання, зокрема
генеративна змагальна нейронна мережа.Master’s thesis: 87 pages, 27 figures, 24 tables, 21 sources.
Theme: The classical problem of optimal transportation. The conducted
research solves it by known methods, their advantages and disadvantages, the
necessary conditions for the existence of an optimal solution. This was a
proposed machine method for solving problems with the construction and
model of learning based on a generative neural network.
The paper considered general information on the method of solving the
problem of optimal transportation with its unbalance and scalability. The
results of three different types of problems solved by the machine learning
method were analyzed.
The subject of the study is the classical problem of optimal
transportation in three different types.
The subject of research is the methods of machine learning, in
particular the generative competitive neural network
Immersive analytics for oncology patient cohorts
This thesis proposes a novel interactive immersive analytics tool and methods to interrogate the cancer patient cohort in an immersive virtual environment, namely Virtual Reality to Observe Oncology data Models (VROOM). The overall objective is to develop an immersive analytics platform, which includes a data analytics pipeline from raw gene expression data to immersive visualisation on virtual and augmented reality platforms utilising a game engine. Unity3D has been used to implement the visualisation. Work in this thesis could provide oncologists and clinicians with an interactive visualisation and visual analytics platform that helps them to drive their analysis in treatment efficacy and achieve the goal of evidence-based personalised medicine. The thesis integrates the latest discovery and development in cancer patients’ prognoses, immersive technologies, machine learning, decision support system and interactive visualisation to form an immersive analytics platform of complex genomic data. For this thesis, the experimental paradigm that will be followed is in understanding transcriptomics in cancer samples. This thesis specifically investigates gene expression data to determine the biological similarity revealed by the patient's tumour samples' transcriptomic profiles revealing the active genes in different patients. In summary, the thesis contributes to i) a novel immersive analytics platform for patient cohort data interrogation in similarity space where the similarity space is based on the patient's biological and genomic similarity; ii) an effective immersive environment optimisation design based on the usability study of exocentric and egocentric visualisation, audio and sound design optimisation; iii) an integration of trusted and familiar 2D biomedical visual analytics methods into the immersive environment; iv) novel use of the game theory as the decision-making system engine to help the analytics process, and application of the optimal transport theory in missing data imputation to ensure the preservation of data distribution; and v) case studies to showcase the real-world application of the visualisation and its effectiveness
Missing Data Imputation using Optimal Transport
International audienceMissing data is a crucial issue when applying machine learning algorithms to real-world datasets. Starting from the simple assumption that two batches extracted randomly from the same dataset should share the same distribution, we leverage optimal transport distances to quantify that criterion and turn it into a loss function to impute missing data values. We propose practical methods to minimize these losses using end-to-end learning, that can exploit or not parametric assumptions on the underlying distributions of values. We evaluate our methods on datasets from the UCI repository, in MCAR, MAR and MNAR settings. These experiments show that OT-based methods match or out-perform state-of-the-art imputation methods, even for high percentages of missing values