200 research outputs found

    Permutation Entropy and Bubble Entropy: Possible interactions and synergies between order and sorting relations

    Full text link
    [EN] Despite its widely demonstrated usefulness, there is still room for improvement in the basic Permutation Entropy (PE) algorithm, as several subsequent studies have proposed in the recent years. For example, some improved PE variants try to address possible PE weaknesses, such as its only focus on ordinal information, and not on amplitude, or the possible detrimental impact of equal values in subsequences due to motif ambiguity. Other evolved PE methods try to reduce the influence of input parameters. A good representative of this last point is the Bubble Entropy (BE) method. BE is based on sorting relations instead of ordinal patterns, and its promising capabilities have not been extensively assessed yet. The objective of the present study was to comparatively assess the classification performance of this new method, and study and exploit the possible synergies between PE and BE. The claimed superior performance of BE over PE was first evaluated by conducting a series of time series classification tests over a varied and diverse experimental set. The results of this assessment apparently suggested that there is a complementary relationship between PE and BE, instead of a superior/inferior relationship. A second set of experiments using PE and BE simultaneously as the input features of a clustering algorithm, demonstrated that with a proper algorithm configuration, classification accuracy and robustness can benefit from both measures.Cuesta Frau, D.; Vargas-Rojo, B. (2020). Permutation Entropy and Bubble Entropy: Possible interactions and synergies between order and sorting relations. Mathematical Biosciences and Engineering. 17(2):1637-1658. https://doi.org/10.3934/mbe.2020086S163716581721. C. Bandt and B. Pompe, Permutation entropy: A natural complexity measure for time series, Phys. Rev. Lett., 88 (2002), 174102.2. M. Zanin, L. Zunino, O. A. Rosso and D. Papo, Permutation entropy and its main biomedical and econophysics applications: A review, Entropy, 14 (2012), 1553-1577.14. F. Siokis, Credit market jitters in the course of the financial crisis: A permutation entropy approach in measuring informational efficiency in financial assets, Phys. A Statist. Mechan. Appl., 499 (2018).15. A. F. Bariviera, L. Zunino, M. B. Guercio, L. Martinez and O. Rosso, Efficiency and credit ratings: A permutation-information-theory analysis, J. Statist. Mechan. Theory Exper., 2013 (2013), P08007.16. A. F. Bariviera, M. B. Guercio, L. Martinez and O. Rosso, A permutation information theory tour through different interest rate maturities: the libor case, Philos. Transact. Royal Soc. A Math. Phys. Eng. Sci., 373 (2015).20. B. Fadlallah, B. Chen, A. Keil and J. Príncipe, Weighted-permutation entropy: A complexity measure for time series incorporating amplitude information, Phys. Rev. E, 87 (2013), 022911.Deng, B., Cai, L., Li, S., Wang, R., Yu, H., Chen, Y., & Wang, J. (2016). Multivariate multi-scale weighted permutation entropy analysis of EEG complexity for Alzheimer’s disease. Cognitive Neurodynamics, 11(3), 217-231. doi:10.1007/s11571-016-9418-924. D. Cuesta-Frau, Permutation entropy: Influence of amplitude information on time series classification performance, Math. Biosci. Eng., 5 (2019), 1-16.25. F. Traversaro, M. Risk, O. Rosso and F. Redelico, An empirical evaluation of alternative methods of estimation for Permutation Entropy in time series with tied values, arXiv e-prints, arXiv:1707.01517 (2017).26. D. Cuesta-Frau, M. Varela-Entrecanales, A. Molina-Picó and B. Vargas, Patterns with equal values in permutation entropy: Do they really matter for biosignal classification?, Complexity, 2018 (2018), 1-15.29. D. Cuesta-Frau, A. Molina-Picó, B. Vargas and P. González, Permutation entropy: Enhancing discriminating power by using relative frequencies vector of ordinal patterns instead of their shannon entropy, Entropy, 21 (2019).30. H. Azami and J. Escudero, Amplitude-aware permutation entropy: Illustration in spike detection and signal segmentation, Comput. Meth. Program. Biomed., 128 (2016), 40-51.32. G. Manis, M. Aktaruzzaman and R. Sassi, Bubble entropy: An entropy almost free of parameters, IEEE Transact. Biomed. Eng., 64 (2017), 2711-2718.34. L. Zunino, F. Olivares, F. Scholkmann and O. A. Rosso, Permutation entropy based time series analysis: Equalities in the input signal can lead to false conclusions, Phys. Lett. A, 381 (2017), 1883-1892.38. D. E. Lake, J. S. Richman, M. P. Griffin and J. R. Moorman, Sample entropy analysis of neonatal heart rate variability, Am. J. Physiology-Regulatory Integrat. Comparat. Physiol., 283 (2002), R789-R797, PMID: 12185014.41. I. Unal, Defining an Optimal Cut-Point Value in ROC Analysis: An Alternative Approach, Comput. Math. Methods Med., 2017 (2017), 14.47. A. K. Jain, M. N. Murty and P. J. Flynn, Data clustering: A review, ACM Comput. Surv., 31 (1999), 264-323.51. J. Sander, M. Ester, H.-P. Kriegel and X. Xu, Density-based clustering in spatial databases: The algorithm gdbscan and its applications, Data Min. Knowl. Discov., 2 (1998), 169-194.52. J. Wu, Advances in K-means Clustering: A Data Mining Thinking, Springer Publishing Company, Incorporated, 2012.53. S. Panda, S. Sahu, P. Jena and S. Chattopadhyay, Comparing fuzzy-c means and k-means clustering techniques: A comprehensive study, in Advances in Computer Science, Engineering & Applications (eds. D. C. Wyld, J. Zizka and D. Nagamalai), Springer Berlin Heidelberg, Berlin, Heidelberg, 2012, 451-460.54. A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, et al., PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals, Circulation, 101 (2000), 215-220.58. R. G. Andrzejak, K. Lehnertz, F. Mormann, C. Rieke, P. David and C. E. Elger, Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state, Phys. Rev. E, 64 (2001), 061907.60. N. Iyengar, C. K. Peng, R. Morin, A. L. Goldberger and L. A. Lipsitz, Age-related alterations in the fractal scaling of cardiac interbeat interval dynamics, Am. J. Physiology-Regulatory Integrat. Comparat. Physiol., 271 (1996), R1078-R1084, PMID: 8898003

    Deep Representation Learning of Electronic Health Records to Unlock Patient Stratification at Scale

    Full text link
    Deriving disease subtypes from electronic health records (EHRs) can guide next-generation personalized medicine. However, challenges in summarizing and representing patient data prevent widespread practice of scalable EHR-based stratification analysis. Here we present an unsupervised framework based on deep learning to process heterogeneous EHRs and derive patient representations that can efficiently and effectively enable patient stratification at scale. We considered EHRs of 1,608,741 patients from a diverse hospital cohort comprising of a total of 57,464 clinical concepts. We introduce a representation learning model based on word embeddings, convolutional neural networks, and autoencoders (i.e., ConvAE) to transform patient trajectories into low-dimensional latent vectors. We evaluated these representations as broadly enabling patient stratification by applying hierarchical clustering to different multi-disease and disease-specific patient cohorts. ConvAE significantly outperformed several baselines in a clustering task to identify patients with different complex conditions, with 2.61 entropy and 0.31 purity average scores. When applied to stratify patients within a certain condition, ConvAE led to various clinically relevant subtypes for different disorders, including type 2 diabetes, Parkinson's disease and Alzheimer's disease, largely related to comorbidities, disease progression, and symptom severity. With these results, we demonstrate that ConvAE can generate patient representations that lead to clinically meaningful insights. This scalable framework can help better understand varying etiologies in heterogeneous sub-populations and unlock patterns for EHR-based research in the realm of personalized medicine.Comment: C.F. and R.M. share senior authorshi

    Machine Learning for Physiological Time Series: Representing and Controlling Blood Glucose for Diabetes Management

    Full text link
    Type 1 diabetes is a chronic health condition affecting over one million patients in the US, where blood glucose (sugar) levels are not well regulated by the body. Researchers have sought to use physiological data (e.g., blood glucose measurements) collected from wearable devices to manage this disease, either by forecasting future blood glucose levels for predictive alarms, or by automating insulin delivery for blood glucose management. However, the application of machine learning (ML) to these data is hampered by latent context, limited supervision and complex temporal dependencies. To address these challenges, we develop and evaluate novel ML approaches in the context of i) representing physiological time series, particularly for forecasting blood glucose values and ii) decision making for when and how much insulin to deliver. When learning representations, we leverage the structure of the physiological sequence as an implicit information stream. In particular, we a) incorporate latent context when predicting adverse events by jointly modeling patterns in the data and the context those patterns occurred under, b) propose novel types of self-supervision to handle limited data and c) propose deep models that predict functions underlying trajectories to encode temporal dependencies. In the context of decision making, we use reinforcement learning (RL) for blood glucose management. Through the use of an FDA-approved simulator of the glucoregulatory system, we achieve strong performance using deep RL with and without human intervention. However, the success of RL typically depends on realistic simulators or experimental real-world deployment, neither of which are currently practical for problems in health. Thus, we propose techniques for leveraging imperfect simulators and observational data. Beyond diabetes, representing and managing physiological signals is an important problem. By adapting techniques to better leverage the structure inherent in the data we can help overcome these challenges.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163134/1/ifox_1.pd

    Data Mining Techniques for Complex User-Generated Data

    Get PDF
    Nowadays, the amount of collected information is continuously growing in a variety of different domains. Data mining techniques are powerful instruments to effectively analyze these large data collections and extract hidden and useful knowledge. Vast amount of User-Generated Data (UGD) is being created every day, such as user behavior, user-generated content, user exploitation of available services and user mobility in different domains. Some common critical issues arise for the UGD analysis process such as the large dataset cardinality and dimensionality, the variable data distribution and inherent sparseness, and the heterogeneous data to model the different facets of the targeted domain. Consequently, the extraction of useful knowledge from such data collections is a challenging task, and proper data mining solutions should be devised for the problem under analysis. In this thesis work, we focus on the design and development of innovative solutions to support data mining activities over User-Generated Data characterised by different critical issues, via the integration of different data mining techniques in a unified frame- work. Real datasets coming from three example domains characterized by the above critical issues are considered as reference cases, i.e., health care, social network, and ur- ban environment domains. Experimental results show the effectiveness of the proposed approaches to discover useful knowledge from different domains

    LSTM Networks for Detection and Classification of Anomalies in Raw Sensor Data

    Get PDF
    In order to ensure the validity of sensor data, it must be thoroughly analyzed for various types of anomalies. Traditional machine learning methods of anomaly detections in sensor data are based on domain-specific feature engineering. A typical approach is to use domain knowledge to analyze sensor data and manually create statistics-based features, which are then used to train the machine learning models to detect and classify the anomalies. Although this methodology is used in practice, it has a significant drawback due to the fact that feature extraction is usually labor intensive and requires considerable effort from domain experts. An alternative approach is to use deep learning algorithms. Research has shown that modern deep neural networks are very effective in automated extraction of abstract features from raw data in classification tasks. Long short-term memory networks, or LSTMs in short, are a special kind of recurrent neural networks that are capable of learning long-term dependencies. These networks have proved to be especially effective in the classification of raw time-series data in various domains. This dissertation systematically investigates the effectiveness of the LSTM model for anomaly detection and classification in raw time-series sensor data. As a proof of concept, this work used time-series data of sensors that measure blood glucose levels. A large number of time-series sequences was created based on a genuine medical diabetes dataset. Anomalous series were constructed by six methods that interspersed patterns of common anomaly types in the data. An LSTM network model was trained with k-fold cross-validation on both anomalous and valid series to classify raw time-series sequences into one of seven classes: non-anomalous, and classes corresponding to each of the six anomaly types. As a control, the accuracy of detection and classification of the LSTM was compared to that of four traditional machine learning classifiers: support vector machines, Random Forests, naive Bayes, and shallow neural networks. The performance of all the classifiers was evaluated based on nine metrics: precision, recall, and the F1-score, each measured in micro, macro and weighted perspective. While the traditional models were trained on vectors of features, derived from the raw data, that were based on knowledge of common sources of anomaly, the LSTM was trained on raw time-series data. Experimental results indicate that the performance of the LSTM was comparable to the best traditional classifiers by achieving 99% accuracy in all 9 metrics. The model requires no labor-intensive feature engineering, and the fine-tuning of its architecture and hyper-parameters can be made in a fully automated way. This study, therefore, finds LSTM networks an effective solution to anomaly detection and classification in sensor data

    Contributions to time series data mining towards the detection of outliers/anomalies

    Get PDF
    148 p.Los recientes avances tecnológicos han supuesto un gran progreso en la recogida de datos, permitiendo recopilar una gran cantidad de datos a lo largo del tiempo. Estos datos se presentan comúnmente en forma de series temporales, donde las observaciones se han registrado de forma cronológica y están correlacionadas en el tiempo. A menudo, estas dependencias temporales contienen información significativa y útil, por lo que, en los últimos años, ha surgido un gran interés por extraer dicha información. En particular, el área de investigación que se centra en esta tarea se denomina minería de datos de series temporales.La comunidad de investigadores de esta área se ha dedicado a resolver diferentes tareas como por ejemplo la clasificación, la predicción, el clustering o agrupamiento y la detección de valores atípicos/anomalías. Los valores atípicos o anomalías son aquellas observaciones que no siguen el comportamiento esperado en una serie temporal. Estos valores atípicos o anómalos suelen representar mediciones no deseadas o eventos de interés, y, por lo tanto, detectarlos suele ser relevante ya que pueden empeorar la calidad de los datos o reflejar fenómenos interesantes para el analista.Esta tesis presenta varias contribuciones en el campo de la minería de datos de series temporales, más específicamente sobre la detección de valores atípicos o anomalías. Estas contribuciones se pueden dividir en dos partes o bloques. Por una parte, la tesis presenta contribuciones en el campo de la detección de valores atípicos o anomalías en series temporales. Para ello, se ofrece una revisión de las técnicas en la literatura, y se presenta una nueva técnica de detección de anomalías en series temporales univariantes para la detección de fugas de agua, basada en el aprendizaje autosupervisado. Por otra parte, la tesis también introduce contribuciones relacionadas con el tratamiento de las series temporales con valores perdidos y demuestra su aplicabilidad en el campo de la detección de anomalías

    Algorithms for time series clustering applied to biomedical signals

    Get PDF
    Thesis submitted in the fulfillment of the requirements for the Degree of Master in Biomedical EngineeringThe increasing number of biomedical systems and applications for human body understanding creates a need for information extraction tools to use in biosignals. It’s important to comprehend the changes in the biosignal’s morphology over time, as they often contain critical information on the condition of the subject or the status of the experiment. The creation of tools that automatically analyze and extract relevant attributes from biosignals, providing important information to the user, has a significant value in the biosignal’s processing field. The present dissertation introduces new algorithms for time series clustering, where we are able to separate and organize unlabeled data into different groups whose signals are similar to each other. Signal processing algorithms were developed for the detection of a meanwave, which represents the signal’s morphology and behavior. The algorithm designed computes the meanwave by separating and averaging all cycles of a cyclic continuous signal. To increase the quality of information given by the meanwave, a set of wave-alignment techniques was also developed and its relevance was evaluated in a real database. To evaluate our algorithm’s applicability in time series clustering, a distance metric created with the information of the automatic meanwave was designed and its measurements were given as input to a K-Means clustering algorithm. With that purpose, we collected a series of data with two different modes in it. The produced algorithm successfully separates two modes in the collected data with 99.3% of efficiency. The results of this clustering procedure were compared to a mechanism widely used in this area, which models the data and uses the distance between its cepstral coefficients to measure the similarity between the time series.The algorithms were also validated in different study projects. These projects show the variety of contexts in which our algorithms have high applicability and are suitable answers to overcome the problems of exhaustive signal analysis and expert intervention. The algorithms produced are signal-independent, and therefore can be applied to any type of signal providing it is a cyclic signal. The fact that this approach doesn’t require any prior information and the preliminary good performance make these algorithms powerful tools for biosignals analysis and classification

    Multivariate Functional Principal Component Analysis for Data Observed on Different (Dimensional) Domains

    Full text link
    Existing approaches for multivariate functional principal component analysis are restricted to data on the same one-dimensional interval. The presented approach focuses on multivariate functional data on different domains that may differ in dimension, e.g. functions and images. The theoretical basis for multivariate functional principal component analysis is given in terms of a Karhunen-Lo\`eve Theorem. For the practically relevant case of a finite Karhunen-Lo\`eve representation, a relationship between univariate and multivariate functional principal component analysis is established. This offers an estimation strategy to calculate multivariate functional principal components and scores based on their univariate counterparts. For the resulting estimators, asymptotic results are derived. The approach can be extended to finite univariate expansions in general, not necessarily orthonormal bases. It is also applicable for sparse functional data or data with measurement error. A flexible R-implementation is available on CRAN. The new method is shown to be competitive to existing approaches for data observed on a common one-dimensional domain. The motivating application is a neuroimaging study, where the goal is to explore how longitudinal trajectories of a neuropsychological test score covary with FDG-PET brain scans at baseline. Supplementary material, including detailed proofs, additional simulation results and software is available online.Comment: Revised Version. R-Code for the online appendix is available in the .zip file associated with this article in subdirectory "/Software". The software associated with this article is available on CRAN (packages funData and MFPCA

    Data Mining

    Get PDF
    Data mining is a branch of computer science that is used to automatically extract meaningful, useful knowledge and previously unknown, hidden, interesting patterns from a large amount of data to support the decision-making process. This book presents recent theoretical and practical advances in the field of data mining. It discusses a number of data mining methods, including classification, clustering, and association rule mining. This book brings together many different successful data mining studies in various areas such as health, banking, education, software engineering, animal science, and the environment
    • …
    corecore