977 research outputs found

    Machine Learning Models for High-dimensional Biomedical Data

    Get PDF
    abstract: The recent technological advances enable the collection of various complex, heterogeneous and high-dimensional data in biomedical domains. The increasing availability of the high-dimensional biomedical data creates the needs of new machine learning models for effective data analysis and knowledge discovery. This dissertation introduces several unsupervised and supervised methods to help understand the data, discover the patterns and improve the decision making. All the proposed methods can generalize to other industrial fields. The first topic of this dissertation focuses on the data clustering. Data clustering is often the first step for analyzing a dataset without the label information. Clustering high-dimensional data with mixed categorical and numeric attributes remains a challenging, yet important task. A clustering algorithm based on tree ensembles, CRAFTER, is proposed to tackle this task in a scalable manner. The second part of this dissertation aims to develop data representation methods for genome sequencing data, a special type of high-dimensional data in the biomedical domain. The proposed data representation method, Bag-of-Segments, can summarize the key characteristics of the genome sequence into a small number of features with good interpretability. The third part of this dissertation introduces an end-to-end deep neural network model, GCRNN, for time series classification with emphasis on both the accuracy and the interpretation. GCRNN contains a convolutional network component to extract high-level features, and a recurrent network component to enhance the modeling of the temporal characteristics. A feed-forward fully connected network with the sparse group lasso regularization is used to generate the final classification and provide good interpretability. The last topic centers around the dimensionality reduction methods for time series data. A good dimensionality reduction method is important for the storage, decision making and pattern visualization for time series data. The CRNN autoencoder is proposed to not only achieve low reconstruction error, but also generate discriminative features. A variational version of this autoencoder has great potential for applications such as anomaly detection and process control.Dissertation/ThesisDoctoral Dissertation Industrial Engineering 201

    Identification of nonlinear time-varying systems using an online sliding-window and common model structure selection (CMSS) approach with applications to EEG

    Get PDF
    The identification of nonlinear time-varying systems using linear-in-the-parameter models is investigated. A new efficient Common Model Structure Selection (CMSS) algorithm is proposed to select a common model structure. The main idea and key procedure is: First, generate K 1 data sets (the first K data sets are used for training, and theK 1 th one is used for testing) using an online sliding window method; then detect significant model terms to form a common model structure which fits over all the K training data sets using the new proposed CMSS approach. Finally, estimate and refine the time-varying parameters for the identified common-structured model using a Recursive Least Squares (RLS) parameter estimation method. The new method can effectively detect and adaptively track the transient variation of nonstationary signals. Two examples are presented to illustrate the effectiveness of the new approach including an application to an EEG data set

    Tiled Sparse Coding in Eigenspaces for Image Classification

    Get PDF
    The automation in the diagnosis of medical images is currently a challenging task. The use of Computer Aided Diagnosis (CAD) systems can be a powerful tool for clinicians, especially in situations when hospitals are overflowed. These tools are usually based on artificial intelligence (AI), a field that has been recently revolutionized by deep learning approaches. These alternatives usually obtain a large performance based on complex solutions, leading to a high computational cost and the need of having large databases. In this work, we propose a classification framework based on sparse coding. Images are first partitioned into different tiles, and a dictionary is built after applying PCA to these tiles. The original signals are then transformed as a linear combination of the elements of the dictionary. Then, they are reconstructed by iteratively deactivating the elements associated with each component. Classification is finally performed employing as features the subsequent reconstruction errors. Performance is evaluated in a real context where distinguishing between four different pathologies: control versus bacterial pneumonia versus viral pneumonia versus COVID-19. Our system differentiates between pneumonia patients and controls with an accuracy of 97.74%, whereas in the 4-class context the accuracy is 86.73%. The excellent results and the pioneering use of sparse coding in this scenario evidence that our proposal can assist clinicians when their workload is high.MCIN/ AEI/10.13039/501100011033/FEDER “Una manera de hacer Europa” under the RTI2018- 098913-B100 projectConsejería de890 Economía, Innovación, Ciencia y Empleo (Junta de Andalucía)FEDER under CV20-45250, A- TIC-080-UGR18, B-TIC-586-UGR20 and P20-00525 project

    Online Machine Learning for Inference from Multivariate Time-series

    Get PDF
    Inference and data analysis over networks have become significant areas of research due to the increasing prevalence of interconnected systems and the growing volume of data they produce. Many of these systems generate data in the form of multivariate time series, which are collections of time series data that are observed simultaneously across multiple variables. For example, EEG measurements of the brain produce multivariate time series data that record the electrical activity of different brain regions over time. Cyber-physical systems generate multivariate time series that capture the behaviour of physical systems in response to cybernetic inputs. Similarly, financial time series reflect the dynamics of multiple financial instruments or market indices over time. Through the analysis of these time series, one can uncover important details about the behavior of the system, detect patterns, and make predictions. Therefore, designing effective methods for data analysis and inference over networks of multivariate time series is a crucial area of research with numerous applications across various fields. In this Ph.D. Thesis, our focus is on identifying the directed relationships between time series and leveraging this information to design algorithms for data prediction as well as missing data imputation. This Ph.D. thesis is organized as a compendium of papers, which consists of seven chapters and appendices. The first chapter is dedicated to motivation and literature survey, whereas in the second chapter, we present the fundamental concepts that readers should understand to grasp the material presented in the dissertation with ease. In the third chapter, we present three online nonlinear topology identification algorithms, namely NL-TISO, RFNL-TISO, and RFNL-TIRSO. In this chapter, we assume the data is generated from a sparse nonlinear vector autoregressive model (VAR), and propose online data-driven solutions for identifying nonlinear VAR topology. We also provide convergence guarantees in terms of dynamic regret for the proposed algorithm RFNL-TIRSO. Chapters four and five of the dissertation delve into the issue of missing data and explore how the learned topology can be leveraged to address this challenge. Chapter five is distinct from other chapters in its exclusive focus on edge flow data and introduces an online imputation strategy based on a simplicial complex framework that leverages the known network structure in addition to the learned topology. Chapter six of the dissertation takes a different approach, assuming that the data is generated from nonlinear structural equation models. In this chapter, we propose an online topology identification algorithm using a time-structured approach, incorporating information from both the data and the model evolution. The algorithm is shown to have convergence guarantees achieved by bounding the dynamic regret. Finally, chapter seven of the dissertation provides concluding remarks and outlines potential future research directions.publishedVersio

    Automatic detection of epileptic seizure onset and termination using intracranial EEG

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 87-90).This thesis addresses the problem of real-time epileptic seizure detection from intracranial EEG (IEEG). One difficulty in creating an approach that can be used for many patients is the heterogeneity of seizure IEEG patterns across different patients and even within a patient. In addition, simultaneously maximizing sensitivity and minimizing latency and false detection rates has been challenging as these are competing objectives. Automated machine learning systems provide a mechanism for dealing with these hurdles. Here we present and evaluate an algorithm for real-time seizure onset detection from IEEG using a machine-learning approach that permits a patient-specific solution. We extract temporal and spectral features across all intracranial EEG channels. A pattern recognition component is trained using these feature vectors and tested against unseen continuous data from the same patient. When tested on more than 875 hours of IEEG data from 10 patients, the algorithm detected 97% of 67 test seizures of several types with a median detection delay of 5 seconds and a median false alarm rate of 0.6 false alarms per 24-hour period. The sensitivity was 100% for 8 out of 10 patients. These results indicate that a sensitive, specific and relatively short-latency detection system based on machine learning can be employed for seizure detection tailored to individual patients. In addition, we describe and evaluate an algorithm for the detection of the cessation of seizure activity within IEEG. Seizure end detection algorithms can enable important clinical applications such as the delivery of therapy to ameliorate post-ictal symptoms, the detection of status epilepticus, and the estimation of seizure duration. Our machine-learning-based approach is patient-specific. The algorithm is designed to search for the termination of electrographic seizure activity once a seizure has been discovered by a seizure onset detector. When tested on 65 seizures, 88% of all seizure ends were detected within 15 seconds of the time determined by a clinical expert to represent the electrographic end of a seizure. We explore the effects of channel pre-selection on seizure onset detection. We evaluate and present the results from a seizure detector that has been restricted to use only a small subset of the channels available. These channels are manually chosen to be those that show the earliest ictal activity. The results indicate that performance can suffer in many cases when the algorithm uses a small set of selected channels, often in the form of an increase in false alarm rate. This suggests that the inclusion of a full channel set allows the system to leverage information that is not readily apparent to a clinical reader (from regions seemingly not involved in the onset) to better differentiate ictal and inter-ictal patterns. Finally, we present and evaluate an algorithm for patient-specific feature extraction, where the feature extraction process for a given patient leverages the training data available for that patient. The results from an evaluation of a detector that supplemented the original spectral energy features with features computed in a patient-specific manner show a significant improvement in 3 out of 5 patients. The results suggest that this is a promising avenue for further improvement in the performance of the seizure onset detector.by Alaa Amin Kharbouch.Ph.D

    Decomposition methods for machine learning with small, incomplete or noisy datasets

    Get PDF
    In many machine learning applications, measurements are sometimes incomplete or noisy resulting in missing features. In other cases, and for different reasons, the datasets are originally small, and therefore, more data samples are required to derive useful supervised or unsupervised classification methods. Correct handling of incomplete, noisy or small datasets in machine learning is a fundamental and classic challenge. In this article, we provide a unified review of recently proposed methods based on signal decomposition for missing features imputation (data completion), classification of noisy samples and artificial generation of new data samples (data augmentation). We illustrate the application of these signal decomposition methods in diverse selected practical machine learning examples including: brain computer interface, epileptic intracranial electroencephalogram signals classification, face recognition/verification and water networks data analysis. We show that a signal decomposition approach can provide valuable tools to improve machine learning performance with low quality datasets.Fil: Caiafa, César Federico. Provincia de Buenos Aires. Gobernación. Comisión de Investigaciones Científicas. Instituto Argentino de Radioastronomía. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata. Instituto Argentino de Radioastronomía; ArgentinaFil: Sole Casals, Jordi. Center for Advanced Intelligence; JapónFil: Marti Puig, Pere. University of Catalonia; EspañaFil: Sun, Zhe. RIKEN; JapónFil: Tanaka,Toshihisa. Tokyo University of Agriculture and Technology; Japó

    Biomedical time series analysis based on bag-of-words model

    Full text link
    This research proposes a number of new methods for biomedical time series classification and clustering based on a novel Bag-of-Words (BoW) representation. It is anticipated that the objective and automatic biomedical time series clustering and classification technologies developed in this work will potentially benefit a wide range of applications, such as biomedical data management, archiving, retrieving, and disease diagnosis and prognosis in the future

    Feature-based time-series analysis

    Full text link
    This work presents an introduction to feature-based time-series analysis. The time series as a data type is first described, along with an overview of the interdisciplinary time-series analysis literature. I then summarize the range of feature-based representations for time series that have been developed to aid interpretable insights into time-series structure. Particular emphasis is given to emerging research that facilitates wide comparison of feature-based representations that allow us to understand the properties of a time-series dataset that make it suited to a particular feature-based representation or analysis algorithm. The future of time-series analysis is likely to embrace approaches that exploit machine learning methods to partially automate human learning to aid understanding of the complex dynamical patterns in the time series we measure from the world.Comment: 28 pages, 9 figure
    corecore