Search CORE

509 research outputs found

Time series cluster kernels to exploit informative missingness and incomplete label information

Author: Bianchi Filippo Maria
Jenssen Robert
Mikalsen Karl Øyvind
Revhaug Arthur
Ruiz Cristina Soguero
Publication venue: 'Elsevier BV'
Publication date: 01/01/2021
Field of study

The time series cluster kernel (TCK) provides a powerful tool for analysing multivariate time series subject to missing data. TCK is designed using an ensemble learning approach in which Bayesian mixture models form the base models. Because of the Bayesian approach, TCK can naturally deal with missing values without resorting to imputation and the ensemble strategy ensures robustness to hyperparameters, making it particularly well suited for unsupervised learning. However, TCK assumes missing at random and that the underlying missingness mechanism is ignorable, i.e. uninformative, an assumption that does not hold in many real-world applications, such as e.g. medicine. To overcome this limitation, we present a kernel capable of exploiting the potentially rich information in the missing values and patterns, as well as the information from the observed data. In our approach, we create a representation of the missing pattern, which is incorporated into mixed mode mixture models in such a way that the information provided by the missing patterns is effectively exploited. Moreover, we also propose a semi-supervised kernel, capable of taking advantage of incomplete label information to learn more accurate similarities. Experiments on benchmark data, as well as a real-world case study of patients described by longitudinal electronic health record data who potentially suffer from hospital-acquired infections, demonstrate the effectiveness of the proposed method

Munin - Open Research Archive

NORA - Norwegian Open Research Archives

Multi-Output Gaussian Processes for Crowdsourced Traffic Data Imputation

Author: Henrickson Kristian
Pereira Francisco C.
Rodrigues Filipe
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 28/04/2019
Field of study

Traffic speed data imputation is a fundamental challenge for data-driven transport analysis. In recent years, with the ubiquity of GPS-enabled devices and the widespread use of crowdsourcing alternatives for the collection of traffic data, transportation professionals increasingly look to such user-generated data for many analysis, planning, and decision support applications. However, due to the mechanics of the data collection process, crowdsourced traffic data such as probe-vehicle data is highly prone to missing observations, making accurate imputation crucial for the success of any application that makes use of that type of data. In this article, we propose the use of multi-output Gaussian processes (GPs) to model the complex spatial and temporal patterns in crowdsourced traffic data. While the Bayesian nonparametric formalism of GPs allows us to model observation uncertainty, the multi-output extension based on convolution processes effectively enables us to capture complex spatial dependencies between nearby road segments. Using 6 months of crowdsourced traffic speed data or "probe vehicle data" for several locations in Copenhagen, the proposed approach is empirically shown to significantly outperform popular state-of-the-art imputation methods.Comment: 10 pages, IEEE Transactions on Intelligent Transportation Systems, 201

arXiv.org e-Print Archive

Online Research Database In Technology

A method for comparing multiple imputation techniques: A case study on the U.S. national COVID cohort collaborative.

Author: Blau Hannah
Bramante Carolyn T
Buse John B
Callahan Tiffany J
Casiraghi Elena
Chan Lauren E
Coleman Ben D
Evans Michael D
Hall Margaret
Huling Jared D
Johnson Steven G
Laraway Bryan
Moffitt Richard A
Notaro Marco
Paccanaro Alberto
Raymond Shao Yu
Reese Justin
Robinson Peter N
Stürmer Til
Tronieri Jena S
Valentini Giorgio
Wilkins Kenneth J
Wong Rachel
Publication venue: The Mouseion at the JAXlibrary
Publication date: 01/03/2023
Field of study

Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful for assessing associations between patients’ predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases, whose removal may introduce severe bias. Several multiple imputation algorithms have been proposed to attempt to recover the missing information under an assumed missingness mechanism. Each algorithm presents strengths and weaknesses, and there is currently no consensus on which multiple imputation algorithm works best in a given scenario. Furthermore, the selection of each algorithm’s pa- rameters and data-related modeling choices are also both crucial and challenging

The Jackson Laboratory: The Mouseion at the JAXlibrary

Machine Learning and Integrative Analysis of Biomedical Big Data.

Author: Choi Howard
Chung Neo Christopher
Mirza Bilal
Ping Peipei
Wang Jie
Wang Wei
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

Multidisciplinary Digital Publishing Institute

Ezid

Directory of Open Access Journals

eScholarship - University of California

A Kernel to Exploit Informative Missingness in Multivariate Time Series from EHRs

Author: A Sharafoddini
AP Dempster
AR Donders
C Soguero-Ruiz
D Agniel
FM Bianchi
G Branagan
H Snijders
J Shao
JL Schafer
KØ Mikalsen
MG Baydogan
Q Li
RJ Little
SS Lewis
SS Magill
Y Halpern
Z Che
Z Ma
Z Zhang
Publication venue
Publication date: 01/01/2020
Field of study

A large fraction of the electronic health records (EHRs) consists of clinical measurements collected over time, such as lab tests and vital signs, which provide important information about a patient's health status. These sequences of clinical measurements are naturally represented as time series, characterized by multiple variables and large amounts of missing data, which complicate the analysis. In this work, we propose a novel kernel which is capable of exploiting both the information from the observed values as well the information hidden in the missing patterns in multivariate time series (MTS) originating e.g. from EHRs. The kernel, called TCK

_{IM}

, is designed using an ensemble learning strategy in which the base models are novel mixed mode Bayesian mixture models which can effectively exploit informative missingness without having to resort to imputation methods. Moreover, the ensemble approach ensures robustness to hyperparameters and therefore TCK

_{IM}

is particularly well suited if there is a lack of labels - a known challenge in medical applications. Experiments on three real-world clinical datasets demonstrate the effectiveness of the proposed kernel.Comment: 2020 International Workshop on Health Intelligence, AAAI-20. arXiv admin note: text overlap with arXiv:1907.0525

arXiv.org e-Print Archive

Crossref

Munin - Open Research Archive