50 research outputs found
A Missing Value Filling Model Based on Feature Fusion Enhanced Autoencoder
With the advent of the big data era, the data quality problem is becoming
more critical. Among many factors, data with missing values is one primary
issue, and thus developing effective imputation models is a key topic in the
research community. Recently, a major research direction is to employ neural
network models such as self-organizing mappings or automatic encoders for
filling missing values. However, these classical methods can hardly discover
interrelated features and common features simultaneously among data attributes.
Especially, it is a very typical problem for classical autoencoders that they
often learn invalid constant mappings, which dramatically hurts the filling
performance. To solve the above-mentioned problems, we propose a
missing-value-filling model based on a feature-fusion-enhanced autoencoder. We
first incorporate into an autoencoder a hidden layer that consists of
de-tracking neurons and radial basis function neurons, which can enhance the
ability of learning interrelated features and common features. Besides, we
develop a missing value filling strategy based on dynamic clustering that is
incorporated into an iterative optimization process. This design can enhance
the multi-dimensional feature fusion ability and thus improves the dynamic
collaborative missing-value-filling performance. The effectiveness of the
proposed model is validated by extensive experiments compared to a variety of
baseline methods on thirteen data sets
Computational intelligence techniques for missing data imputation
Despite considerable advances in missing data imputation techniques over the last three decades, the
problem of missing data remains largely unsolved. Many techniques have emerged in the literature
as candidate solutions, including the Expectation Maximisation (EM), and the combination of autoassociative
neural networks and genetic algorithms (NN-GA). The merits of both these techniques
have been discussed at length in the literature, but have never been compared to each other. This
thesis contributes to knowledge by firstly, conducting a comparative study of these two techniques..
The significance of the difference in performance of the methods is presented. Secondly, predictive
analysis methods suitable for the missing data problem are presented. The predictive analysis in
this problem is aimed at determining if data in question are predictable and hence, to help in
choosing the estimation techniques accordingly. Thirdly, a novel treatment of missing data for online
condition monitoring problems is presented. An ensemble of three autoencoders together with
hybrid Genetic Algorithms (GA) and fast simulated annealing was used to approximate missing
data. Several significant insights were deduced from the simulation results. It was deduced that for
the problem of missing data using computational intelligence approaches, the choice of optimisation
methods plays a significant role in prediction. Although, it was observed that hybrid GA and Fast
Simulated Annealing (FSA) can converge to the same search space and to almost the same values
they differ significantly in duration. This unique contribution has demonstrated that a particular
interest has to be paid to the choice of optimisation techniques and their decision boundaries.
iii
Another unique contribution of this work was not only to demonstrate that a dynamic programming
is applicable in the problem of missing data, but to also show that it is efficient in addressing the
problem of missing data. An NN-GA model was built to impute missing data, using the principle
of dynamic programing. This approach makes it possible to modularise the problem of missing
data, for maximum efficiency. With the advancements in parallel computing, various modules of
the problem could be solved by different processors, working together in parallel. Furthermore, a
method for imputing missing data in non-stationary time series data that learns incrementally even
when there is a concept drift is proposed. This method works by measuring the heteroskedasticity
to detect concept drift and explores an online learning technique. New direction for research, where
missing data can be estimated for nonstationary applications are opened by the introduction of this
novel method. Thus, this thesis has uniquely opened the doors of research to this area. Many
other methods need to be developed so that they can be compared to the unique existing approach
proposed in this thesis.
Another novel technique for dealing with missing data for on-line condition monitoring problem was
also presented and studied. The problem of classifying in the presence of missing data was addressed,
where no attempts are made to recover the missing values. The problem domain was then extended
to regression. The proposed technique performs better than the NN-GA approach, both in accuracy
and time efficiency during testing. The advantage of the proposed technique is that it eliminates
the need for finding the best estimate of the data, and hence, saves time. Lastly, instead of using
complicated techniques to estimate missing values, an imputation approach based on rough sets is
explored. Empirical results obtained using both real and synthetic data are given and they provide a
valuable and promising insight to the problem of missing data. The work, has significantly confirmed
that rough sets can be reliable for missing data estimation in larger and real databases
Recovering Loss to Followup Information Using Denoising Autoencoders
Loss to followup is a significant issue in healthcare and has serious
consequences for a study's validity and cost. Methods available at present for
recovering loss to followup information are restricted by their expressive
capabilities and struggle to model highly non-linear relations and complex
interactions. In this paper we propose a model based on overcomplete denoising
autoencoders to recover loss to followup information. Designed to work with
high volume data, results on various simulated and real life datasets show our
model is appropriate under varying dataset and loss to followup conditions and
outperforms the state-of-the-art methods by a wide margin ( in some
scenarios) while preserving the dataset utility for final analysis.Comment: Copyright IEEE 2017, IEEE International Conference on Big Data (Big
Data
Autoencoder for clinical data analysis and classification : data imputation, dimensional reduction, and pattern recognition
Over the last decade, research has focused on machine learning and data mining to develop frameworks that can improve data analysis and output performance; to build accurate decision support systems that benefit from real-life datasets. This leads to the field of clinical data analysis, which has attracted a significant amount of interest in the computing, information systems, and medical fields. To create and develop models by machine learning algorithms, there is a need for a particular type of data for the existing algorithms to build an efficient model. Clinical datasets pose several issues that can affect the classification of the dataset: missing values, high dimensionality, and class imbalance. In order to build a framework for mining the data, it is necessary first to preprocess data, by eliminating patients’ records that have too many missing values, imputing missing values, addressing high dimensionality, and classifying the data for decision support.This thesis investigates a real clinical dataset to solve their challenges. Autoencoder is employed as a tool that can compress data mining methodology, by extracting features and classifying data in one model. The first step in data mining methodology is to impute missing values, so several imputation methods are analysed and employed. Then high dimensionality is demonstrated and used to discard irrelevant and redundant features, in order to improve prediction accuracy and reduce computational complexity. Class imbalance is manipulated to investigate the effect on feature selection algorithms and classification algorithms.The first stage of analysis is to investigate the role of the missing values. Results found that techniques based on class separation will outperform other techniques in predictive ability. The next stage is to investigate the high dimensionality and a class imbalance. However it was found a small set of features that can improve the classification performance, the balancing class does not affect the performance as much as imbalance class
Learning to Reconstruct Missing Data from Spatiotemporal Graphs with Sparse Observations
Modeling multivariate time series as temporal signals over a (possibly
dynamic) graph is an effective representational framework that allows for
developing models for time series analysis. In fact, discrete sequences of
graphs can be processed by autoregressive graph neural networks to recursively
learn representations at each discrete point in time and space. Spatiotemporal
graphs are often highly sparse, with time series characterized by multiple,
concurrent, and long sequences of missing data, e.g., due to the unreliable
underlying sensor network. In this context, autoregressive models can be
brittle and exhibit unstable learning dynamics. The objective of this paper is,
then, to tackle the problem of learning effective models to reconstruct, i.e.,
impute, missing data points by conditioning the reconstruction only on the
available observations. In particular, we propose a novel class of
attention-based architectures that, given a set of highly sparse discrete
observations, learn a representation for points in time and space by exploiting
a spatiotemporal propagation architecture aligned with the imputation task.
Representations are trained end-to-end to reconstruct observations w.r.t. the
corresponding sensor and its neighboring nodes. Compared to the state of the
art, our model handles sparse data without propagating prediction errors or
requiring a bidirectional model to encode forward and backward time
dependencies. Empirical results on representative benchmarks show the
effectiveness of the proposed method.Comment: Accepted at NeurIPS 202
Machine Learning Methods To Identify Hidden Phenotypes In The Electronic Health Record
The widespread adoption of Electronic Health Records (EHRs) means an unprecedented amount of patient treatment and outcome data is available to researchers. Research is a tertiary priority in the EHR, where the priorities are patient care and billing. Because of this, the data is not standardized or formatted in a manner easily adapted to machine learning approaches. Data may be missing for a large variety of reasons ranging from individual input styles to differences in clinical decision making, for example, which lab tests to issue. Few patients are annotated at a research quality, limiting sample size and presenting a moving gold standard. Patient progression over time is key to understanding many diseases but many machine learning algorithms require a snapshot, at a single time point, to create a usable vector form. In this dissertation, we develop new machine learning methods and computational workflows to extract hidden phenotypes from the Electronic Health Record (EHR). In Part 1, we use a semi-supervised deep learning approach to compensate for the low number of research quality labels present in the EHR. In Part 2, we examine and provide recommendations for characterizing and managing the large amount of missing data inherent to EHR data. In Part 3, we present an adversarial approach to generate synthetic data that closely resembles the original data while protecting subject privacy. We also introduce a workflow to enable reproducible research even when data cannot be shared. In Part 4, we introduce a novel strategy to first extract sequential data from the EHR and then demonstrate the ability to model these sequences with deep learning
Designing the next generation intelligent transportation sensor system using big data driven machine learning techniques
Accurate traffic data collection is essential for supporting advanced traffic management system operations. This study investigated a large-scale data-driven sequential traffic sensor health monitoring (TSHM) module that can be used to monitor sensor health conditions over large traffic networks. Our proposed module consists of three sequential steps for detecting different types of abnormal sensor issues. The first step detects sensors with abnormally high missing data rates, while the second step uses clustering anomaly detection to detect sensors reporting abnormal records. The final step introduces a novel Bayesian changepoint modeling technique to detect sensors reporting abnormal traffic data fluctuations by assuming a constant vehicle length distribution based on average effective vehicle length (AEVL). Our proposed method is then compared with two benchmark algorithms to show its efficacy. Results obtained by applying our method to the statewide traffic sensor data of Iowa show it can successfully detect different classes of sensor issues. This demonstrates that sequential TSHM modules can help transportation agencies determine traffic sensors’ exact problems, thereby enabling them to take the required corrective steps.
The second research objective will focus on the traffic data imputation after we discard the anomaly/missing data collected from failure traffic sensors. Sufficient high-quality traffic data are a crucial component of various Intelligent Transportation System (ITS) applications and research related to congestion prediction, speed prediction, incident detection, and other traffic operation tasks. Nonetheless, missing traffic data are a common issue in sensor data which is inevitable due to several reasons, such as malfunctioning, poor maintenance or calibration, and intermittent communications. Such missing data issues often make data analysis and decision-making complicated and challenging. In this study, we have developed a generative adversarial network (GAN) based traffic sensor data imputation framework (TSDIGAN) to efficiently reconstruct the missing data by generating realistic synthetic data. In recent years, GANs have shown impressive success in image data generation. However, generating traffic data by taking advantage of GAN based modeling is a challenging task, since traffic data have strong time dependency. To address this problem, we propose a novel time-dependent encoding method called the Gramian Angular Summation Field (GASF) that converts the problem of traffic time-series data generation into that of image generation. We have evaluated and tested our proposed model using the benchmark dataset provided by Caltrans Performance Management Systems (PeMS). This study shows that the proposed model can significantly improve the traffic data imputation accuracy in terms of Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) compared to state-of-the-art models on the benchmark dataset. Further, the model achieves reasonably high accuracy in imputation tasks even under a very high missing data rate (\u3e50%), which shows the robustness and efficiency of the proposed model.
Besides the loop and radar sensors, traffic cameras have shown great ability to provide insightful traffic information using the image and video processing techniques. Therefore, the third and final part of this work aimed to introduce an end to end real-time cloud-enabled traffic video analysis (IVA) framework to support the development of the future smart city. As Artificial intelligence (AI) growing rapidly, Computer vision (CV) techniques are expected to significantly improve the development of intelligent transportation systems (ITS), which are anticipated to be a key component of future Smart City (SC) frameworks. Powered by computer vision techniques, the converting of existing traffic cameras into connected ``smart sensors called intelligent video analysis (IVA) systems has shown the great capability of producing insightful data to support ITS applications. However, developing such IVA systems for large-scale, real-time application deserves further study, as the current research efforts are focused more on model effectiveness instead of model efficiency. Therefore, we have introduced a real-time, large-scale, cloud-enabled traffic video analysis framework using NVIDIA DeepStream, which is a streaming analysis toolkit for AI-based video and image analysis. In this study, we have evaluated the technical and economic feasibility of our proposed framework to help traffic agency to build IVA systems more efficiently. Our study shows that the daily operating cost for our proposed framework on Google Cloud Platform (GCP) is less than $0.14 per camera, and that, compared with manual inspections, our framework achieves an average vehicle-counting accuracy of 83.7% on sunny days
Deep learning for the early detection of harmful algal blooms and improving water quality monitoring
Climate change will affect how water sources are managed and monitored. The frequency of algal blooms will increase with climate change as it presents favourable conditions for the reproduction of phytoplankton. During monitoring, possible sensory failures in monitoring systems result in partially filled data which may affect critical systems. Therefore, imputation becomes necessary to decrease error and increase data quality. This work investigates two issues in water quality data analysis: improving data quality and anomaly detection. It consists of three main topics: data imputation, early algal bloom detection using in-situ data and early algal bloom detection using multiple modalities.The data imputation problem is addressed by experimenting with various methods with a water quality dataset that includes four locations around the North Sea and the Irish Sea with different characteristics and high miss rates, testing model generalisability. A novel neural network architecture with self-attention is proposed in which imputation is done in a single pass, reducing execution time. The self-attention components increase the interpretability of the imputation process at each stage of the network, providing knowledge to domain experts.After data curation, algal activity is predicted using transformer networks, between 1 to 7 days ahead, and the importance of the input with regard to the output of the prediction model is explained using SHAP, aiming to explain model behaviour to domain experts which is overlooked in previous approaches. The prediction model improves bloom detection performance by 5% on average and the explanation summarizes the complex structure of the model to input-output relationships. Performance improvements on the initial unimodal bloom detection model are made by incorporating multiple modalities into the detection process which were only used for validation purposes previously. The problem of missing data is also tackled by using coordinated representations, replacing low quality in-situ data with satellite data and vice versa, instead of imputation which may result in biased results
Delivering Reliable AI to Clinical Contexts: Addressing the Challenge of Missing Data
Clinical data are essential in the medical domain, ensuring quality of care and improving
decision-making. However, their heterogeneous and incomplete nature leads to an ubiquity
of data quality problems, particularly missing values. Inevitable challenges arise in
delivering reliable Decision Support Systems (DSSs), as missing data yield negative effects
on the learning process of Machine Learning models. The interest in developing missing
value imputation strategies has been growing, in an endeavour to overcome this issue.
This dissertation aimed to study missing data and their relationships with observed
values, and to lateremploy that information in a technique that addresses the predicaments
posed by incomplete datasets in real-world scenarios. Moreover, the concept of correlation
was explored within the context of missing value imputation, a promising but rather
overlooked approach in biomedical research.
First, a comprehensive correlational study was performed, which considered key
aspects from missing data analysis. Afterwards, the gathered knowledge was leveraged to
create three novel correlation-based imputation techniques. Thesewere not only validated
on datasets with a controlled and synthetic missingness, but also on real-world medical
datasets. Their performance was evaluated against competing imputation methods, both
traditional and state-of-the-art.
The contributions of this dissertation encompass a systematic view of theoretical concepts
regarding the analysis and handling of missing values. Additionally, an extensive
literature review concerning missing data imputation was conducted, which comprised a
comparative study of ten methods under diverse missingness conditions. The proposed
techniques exhibited similar results when compared to their competitors, sometimes
even superior in terms of imputation precision and classification performance, evaluated
through the Mean Absolute Error and the Area Under the Receiver Operating Characteristic
curve, respectively. Therefore, this dissertation corroborates the potential of correlation
to improve the robustness of DSSs to missing values, and provides answers to current
flaws shared by correlation-based imputation strategies in real-world medical problems.Dados clínicos são essenciais para assegurar cuidados médicos de qualidade e melhorar
a tomada de decisões. Contudo, a sua natureza heterogénea e incompleta cria uma
ubiquidade de problemas de qualidade, nomeadamente pela existência de valores em
falta. Esta condição origina desafios inevitáveis para a disponibilização de Sistemas de
Apoio à Decisão (SADs) fiáveis, já que dados em falta acarretam efeitos negativos no treino
de modelos de Aprendizagem Automática. O interesse no desenvolvimento de estratégias
de imputação de valores em falta tem vindo a crescer, num esforço para superar esta
adversidade.
Esta dissertação visou estudar o problema dos dados em falta através das relações
que estes apresentam com os valores observados. Esta informação foi depois utilizada
no desenvolvimento de técnicas para colmatar os problemas impostos por dados incompletos
em cenários reais. Ademais, o conceito de correlação foi explorado no contexto da
imputação de valores em falta, já que, apesar de promissor, tem vindo a ser negligenciado
em investigação biomédica.
Em primeiro lugar, foi realizado um estudo correlacional abrangente que contemplou
aspetos fundamentais da análise de dados em falta. Posteriormente, o conhecimento recolhido
foi aplicado na criação de três novas técnicas de imputação baseadas na correlação.
Estas foram validadas não só em conjuntos de dados com incompletude controlada e
sintética, mas também em conjuntos de dados médicos reais. O seu desempenho foi
avaliado e comparado a métodos de imputação tanto tradicionais como de estado-de-arte.
As contribuições desta dissertação passam pela sistematização de conceitos teóricos
relativos à análise e tratamento de dados em falta. Adicionalmente, realizou-se uma
extensa revisão da literatura referente à imputação de dados, que compreendeu um
estudo comparativo de dez métodos sob diversas condições de incompletude. As técnicas
propostas exibiram resultados semelhantes aos dos restantes métodos, por vezes até
superiores em termos de precisão da imputação e de performance da classificação. Assim,
esta dissertação corrobora o potencial da utilização da correlação na melhoria da robustez
de SADs a dados em falta, e fornece respostas a algumas das atuais falhas partilhadas por
estratégias de imputação baseadas em correlação quando aplicadas a casos médicos reais