667 research outputs found

    Toeplitz Inverse Covariance-Based Clustering of Multivariate Time Series Data

    Full text link
    Subsequence clustering of multivariate time series is a useful tool for discovering repeated patterns in temporal data. Once these patterns have been discovered, seemingly complicated datasets can be interpreted as a temporal sequence of only a small number of states, or clusters. For example, raw sensor data from a fitness-tracking application can be expressed as a timeline of a select few actions (i.e., walking, sitting, running). However, discovering these patterns is challenging because it requires simultaneous segmentation and clustering of the time series. Furthermore, interpreting the resulting clusters is difficult, especially when the data is high-dimensional. Here we propose a new method of model-based clustering, which we call Toeplitz Inverse Covariance-based Clustering (TICC). Each cluster in the TICC method is defined by a correlation network, or Markov random field (MRF), characterizing the interdependencies between different observations in a typical subsequence of that cluster. Based on this graphical representation, TICC simultaneously segments and clusters the time series data. We solve the TICC problem through alternating minimization, using a variation of the expectation maximization (EM) algorithm. We derive closed-form solutions to efficiently solve the two resulting subproblems in a scalable way, through dynamic programming and the alternating direction method of multipliers (ADMM), respectively. We validate our approach by comparing TICC to several state-of-the-art baselines in a series of synthetic experiments, and we then demonstrate on an automobile sensor dataset how TICC can be used to learn interpretable clusters in real-world scenarios.Comment: This revised version fixes two small typos in the published versio

    A Review of Subsequence Time Series Clustering

    Get PDF
    Clustering of subsequence time series remains an open issue in time series clustering. Subsequence time series clustering is used in different fields, such as e-commerce, outlier detection, speech recognition, biological systems, DNA recognition, and text mining. One of the useful fields in the domain of subsequence time series clustering is pattern recognition. To improve this field, a sequence of time series data is used. This paper reviews some definitions and backgrounds related to subsequence time series clustering. The categorization of the literature reviews is divided into three groups: preproof, interproof, and postproof period. Moreover, various state-of-the-art approaches in performing subsequence time series clustering are discussed under each of the following categories. The strengths and weaknesses of the employed methods are evaluated as potential issues for future studies

    Diagnosis and Prognosis of Occupational disorders based on Machine Learn- ing Techniques applied to Occupational Profiles

    Get PDF
    Work-related disorders have a global influence on people’s well-being and quality of life and are a financial burden for organizations because they reduce productivity, increase absenteeism, and promote early retirement. Work-related musculoskeletal disorders, in particular, represent a significant fraction of the total in all occupational contexts. In automotive and industrial settings where workers are exposed to work-related muscu- loskeletal disorders risk factors, occupational physicians are responsible for monitoring workers’ health protection profiles. Occupational technicians report in the Occupational Health Protection Profiles database to understand which exposure to occupational work- related musculoskeletal disorder risk factors should be ensured for a given worker. Occu- pational Health Protection Profiles databases describe the occupational physician states, and which exposure the physicians considers necessary to ensure the worker’s health protection in terms of their functional work ability. The application of Human-Centered explainable artificial intelligence can support the decision making to go from worker’s Functional Work Ability to explanations by integrating explainability into medical (re- striction) and supporting in two decision contexts: prognosis and diagnosis of individual, work related and organizational risk condition. Although previous machine learning ap- proaches provided good predictions, their application in an actual occupational setting is limited because their predictions are difficult to interpret and hence, not actionable. In this thesis, injured body parts in which the ability changed in a worker’s functional work ability status are targeted. On the one hand, artificial intelligence algorithms can help technical teams, occupational physicians, and ergonomists determine a worker’s workplace risk via the diagnosis and prognosis of body part(s) injuries; on the other hand, these approaches can help prevent work-related musculoskeletal disorders by identifying which processes are lacking in working condition improvement and which workplaces have a better match between the remaining functional work abilities. A sample of 2025 for the prognosis part (from the years of 2019 to 2020) and 7857 for the prognosis part of Occupational Health Protection Profiles based on Functional Work Ability textual re- ports in the Portuguese language in automotive industry factory. Machine learning-based Natural Language Processing methods were implemented to extract standardized infor- mation. The prognosis and diagnosis of Occupational Health Protection Profiles factors were developed in reliable Human-Centered explainable artificial intelligence system to promote a trustworthy Human-Centered explainable artificial intelligence system (enti- tled Industrial microErgo application). The most suitable regression models to predict the next medical appointment for the injured body regions were the models based on CatBoost regression, with R square and an RMSLE of 0.84 and 1.23 weeks, respectively. In parallel, CatBoost’s best regression model for most body parts is the prediction of the next injured body parts based on these two errors. This information can help tech- nical industrial teams understand potential risk factors for Occupational Health Protec- tion Profiles and identify warning signs of the early stages of musculoskeletal disorders.Os transtornos relacionados ao trabalho têm influência global no bem-estar e na quali- dade de vida das pessoas e são um ônus financeiro para as organizações, pois reduzem a produtividade, aumentam o absenteísmo e promovem a aposentadoria precoce. Os distúr- bios osteomusculares relacionados ao trabalho, em particular, representam uma fração significativa do total em todos os contextos ocupacionais. Em ambientes automotivos e industriais onde os trabalhadores estão expostos a fatores de risco de distúrbios osteomus- culares relacionados ao trabalho, os médicos do trabalho são responsáveis por monitorar os perfis de proteção à saúde dos trabalhadores. Os técnicos do trabalho reportam-se à base de dados dos Perfis de Proteção da Saúde Ocupacional para compreender quais os fatores de risco de exposição a perturbações músculo-esqueléticas relacionadas com o tra- balho que devem ser assegurados para um determinado trabalhador. As bases de dados de Perfis de Proteção à Saúde Ocupacional descrevem os estados do médico do trabalho e quais exposições os médicos consideram necessária para garantir a proteção da saúde do trabalhador em termos de sua capacidade funcional para o trabalho. A aplicação da inteligência artificial explicável centrada no ser humano pode apoiar a tomada de decisão para ir da capacidade funcional de trabalho do trabalhador às explicações, integrando a explicabilidade à médica (restrição) e apoiando em dois contextos de decisão: prognóstico e diagnóstico da condição de risco individual, relacionado ao trabalho e organizacional . Embora as abordagens anteriores de aprendizado de máquina tenham fornecido boas pre- visões, sua aplicação em um ambiente ocupacional real é limitada porque suas previsões são difíceis de interpretar e portanto, não acionável. Nesta tese, as partes do corpo lesiona- das nas quais a habilidade mudou no estado de capacidade funcional para o trabalho do trabalhador são visadas. Por um lado, os algoritmos de inteligência artificial podem aju- dar as equipes técnicas, médicos do trabalho e ergonomistas a determinar o risco no local de trabalho de um trabalhador por meio do diagnóstico e prognóstico de lesões em partes do corpo; por outro lado, essas abordagens podem ajudar a prevenir distúrbios muscu- loesqueléticos relacionados ao trabalho, identificando quais processos estão faltando na melhoria das condições de trabalho e quais locais de trabalho têm uma melhor correspon- dência entre as habilidades funcionais restantes do trabalho. Para esta tese, foi utilizada uma base de dados com Perfis de Proteção à Saúde Ocupacional, que se baseiam em relató- rios textuais de Aptidão para o Trabalho em língua portuguesa, de uma fábrica da indús- tria automóvel (Auto Europa). Uma amostra de 2025 ficheiros foi utilizada para a parte de prognóstico (de 2019 a 2020) e uma amostra de 7857 ficheiros foi utilizada para a parte de diagnóstico. . Aprendizado de máquina- métodos baseados em Processamento de Lingua- gem Natural foram implementados para extrair informações padronizadas. O prognóstico e diagnóstico dos fatores de Perfis de Proteção à Saúde Ocupacional foram desenvolvidos em um sistema confiável de inteligência artificial explicável centrado no ser humano (inti- tulado Industrial microErgo application). Os modelos de regressão mais adequados para prever a próxima consulta médica para as regiões do corpo lesionadas foram os modelos baseados na regressão CatBoost, com R quadrado e RMSLE de 0,84 e 1,23 semanas, res- pectivamente. Em paralelo, a previsão das próximas partes do corpo lesionadas com base nesses dois erros relatados pelo CatBoost como o melhor modelo de regressão para a mai- oria das partes do corpo. Essas informações podem ajudar as equipes técnicas industriais a entender os possíveis fatores de risco para os Perfis de Proteção à Saúde Ocupacio- nal e identificar sinais de alerta dos estágios iniciais de distúrbios musculoesqueléticos

    ICA model order selection of task co-activation networks

    Get PDF
    Independent component analysis (ICA) has become a widely used method for extracting functional networks in the brain during rest and task. Historically, preferred ICA dimensionality has widely varied within the neuroimaging community, but typically varies between 20 and 100 components. This can be problematic when comparing results across multiple studies because of the impact ICA dimensionality has on the topology of its resultant components. Recent studies have demonstrated that ICA can be applied to peak activation coordinates archived in a large neuroimaging database (i.e., BrainMap Database) to yield whole-brain task-based co-activation networks. A strength of applying ICA to BrainMap data is that the vast amount of metadata in BrainMap can be used to quantitatively assess tasks and cognitive processes contributing to each component. In this study, we investigated the effect of model order on the distribution of functional properties across networks as a method for identifying the most informative decompositions of BrainMap-based ICA components. Our findings suggest dimensionality of 20 for low model order ICA to examine large-scale brain networks, and dimensionality of 70 to provide insight into how large-scale networks fractionate into sub-networks. We also provide a functional and organizational assessment of visual, motor, emotion, and interoceptive task co-activation networks as they fractionate from low to high model-orders

    Biomedical Discovery Acceleration, with Applications to Craniofacial Development

    Get PDF
    The profusion of high-throughput instruments and the explosion of new results in the scientific literature, particularly in molecular biomedicine, is both a blessing and a curse to the bench researcher. Even knowledgeable and experienced scientists can benefit from computational tools that help navigate this vast and rapidly evolving terrain. In this paper, we describe a novel computational approach to this challenge, a knowledge-based system that combines reading, reasoning, and reporting methods to facilitate analysis of experimental data. Reading methods extract information from external resources, either by parsing structured data or using biomedical language processing to extract information from unstructured data, and track knowledge provenance. Reasoning methods enrich the knowledge that results from reading by, for example, noting two genes that are annotated to the same ontology term or database entry. Reasoning is also used to combine all sources into a knowledge network that represents the integration of all sorts of relationships between a pair of genes, and to calculate a combined reliability score. Reporting methods combine the knowledge network with a congruent network constructed from experimental data and visualize the combined network in a tool that facilitates the knowledge-based analysis of that data. An implementation of this approach, called the Hanalyzer, is demonstrated on a large-scale gene expression array dataset relevant to craniofacial development. The use of the tool was critical in the creation of hypotheses regarding the roles of four genes never previously characterized as involved in craniofacial development; each of these hypotheses was validated by further experimental work

    Biomarker lists stability in genomic studies: analysis and improvement by prior biological knowledge integration into the learning process

    Get PDF
    The analysis of high-throughput sequencing, microarray and mass spectrometry data has been demonstrated extremely helpful for the identification of those genes and proteins, called biomarkers, helpful for answering to both diagnostic/prognostic and functional questions. In this context, robustness of the results is critical both to understand the biological mechanisms underlying diseases and to gain sufficient reliability for clinical/pharmaceutical applications. Recently, different studies have proved that the lists of identified biomarkers are poorly reproducible, making the validation of biomarkers as robust predictors of a disease a still open issue. The reasons of these differences are referable to both data dimensions (few subjects with respect to the number of features) and heterogeneity of complex diseases, characterized by alterations of multiple regulatory pathways and of the interplay between different genes and the environment. Typically in an experimental design, data to analyze come from different subjects and different phenotypes (e.g. normal and pathological). The most widely used methodologies for the identification of significant genes related to a disease from microarray data are based on computing differential gene expression between different phenotypes by univariate statistical tests. Such approach provides information on the effect of specific genes as independent features, whereas it is now recognized that the interplay among weakly up/down regulated genes, although not significantly differentially expressed, might be extremely important to characterize a disease status. Machine learning algorithms are, in principle, able to identify multivariate nonlinear combinations of features and have thus the possibility to select a more complete set of experimentally relevant features. In this context, supervised classification methods are often used to select biomarkers, and different methods, like discriminant analysis, random forests and support vector machines among others, have been used, especially in cancer studies. Although high accuracy is often achieved in classification approaches, the reproducibility of biomarker lists still remains an open issue, since many possible sets of biological features (i.e. genes or proteins) can be considered equally relevant in terms of prediction, thus it is in principle possible to have a lack of stability even by achieving the best accuracy. This thesis represents a study of several computational aspects related to biomarker discovery in genomic studies: from the classification and feature selection strategies to the type and the reliability of the biological information used, proposing new approaches able to cope with the problem of the reproducibility of biomarker lists. The study has highlighted that, although reasonable and comparable classification accuracy can be achieved by different methods, further developments are necessary to achieve robust biomarker lists stability, because of the high number of features and the high correlation among them. In particular, this thesis proposes two different approaches to improve biomarker lists stability by using prior information related to biological interplay and functional correlation among the analyzed features. Both approaches were able to improve biomarker selection. The first approach, using prior information to divide the application of the method into different subproblems, improves results interpretability and offers an alternative way to assess lists reproducibility. The second, integrating prior information in the kernel function of the learning algorithm, improves lists stability. Finally, the interpretability of results is strongly affected by the quality of the biological information available and the analysis of the heterogeneities performed in the Gene Ontology database has revealed the importance of providing new methods able to verify the reliability of the biological properties which are assigned to a specific feature, discriminating missing or less specific information from possible inconsistencies among the annotations. These aspects will be more and more deepened in the future, as the new sequencing technologies will monitor an increasing number of features and the number of functional annotations from genomic databases will considerably grow in the next years.L’analisi di dati high-throughput basata sull’utilizzo di tecnologie di sequencing, microarray e spettrometria di massa si è dimostrata estremamente utile per l’identificazione di quei geni e proteine, chiamati biomarcatori, utili per rispondere a quesiti sia di tipo diagnostico/prognostico che funzionale. In tale contesto, la stabilità dei risultati è cruciale sia per capire i meccanismi biologici che caratterizzano le malattie sia per ottenere una sufficiente affidabilità per applicazioni in campo clinico/farmaceutico. Recentemente, diversi studi hanno dimostrato che le liste di biomarcatori identificati sono scarsamente riproducibili, rendendo la validazione di tali biomarcatori come indicatori stabili di una malattia un problema ancora aperto. Le ragioni di queste differenze sono imputabili sia alla dimensione dei dataset (pochi soggetti rispetto al numero di variabili) sia all’eterogeneità di malattie complesse, caratterizzate da alterazioni di più pathway di regolazione e delle interazioni tra diversi geni e l’ambiente. Tipicamente in un disegno sperimentale, i dati da analizzare provengono da diversi soggetti e diversi fenotipi (e.g. normali e patologici). Le metodologie maggiormente utilizzate per l’identificazione di geni legati ad una malattia si basano sull’analisi differenziale dell’espressione genica tra i diversi fenotipi usando test statistici univariati. Tale approccio fornisce le informazioni sull’effetto di specifici geni considerati come variabili indipendenti tra loro, mentre è ormai noto che l’interazione tra geni debolmente up/down regolati, sebbene non differenzialmente espressi, potrebbe rivelarsi estremamente importante per caratterizzare lo stato di una malattia. Gli algoritmi di machine learning sono, in linea di principio, capaci di identificare combinazioni non lineari delle variabili e hanno quindi la possibilità di selezionare un insieme più dettagliato di geni che sono sperimentalmente rilevanti. In tale contesto, i metodi di classificazione supervisionata vengono spesso utilizzati per selezionare i biomarcatori, e diversi approcci, quali discriminant analysis, random forests e support vector machines tra altri, sono stati utilizzati, soprattutto in studi oncologici. Sebbene con tali approcci di classificazione si ottenga un alto livello di accuratezza di predizione, la riproducibilità delle liste di biomarcatori rimane ancora una questione aperta, dato che esistono molteplici set di variabili biologiche (i.e. geni o proteine) che possono essere considerati ugualmente rilevanti in termini di predizione. Quindi in teoria è possibile avere un’insufficiente stabilità anche raggiungendo il massimo livello di accuratezza. Questa tesi rappresenta uno studio su diversi aspetti computazionali legati all’identificazione di biomarcatori in genomica: dalle strategie di classificazione e di feature selection adottate alla tipologia e affidabilità dell’informazione biologica utilizzata, proponendo nuovi approcci in grado di affrontare il problema della riproducibilità delle liste di biomarcatori. Tale studio ha evidenziato che sebbene un’accettabile e comparabile accuratezza nella predizione può essere ottenuta attraverso diversi metodi, ulteriori sviluppi sono necessari per raggiungere una robusta stabilità nelle liste di biomarcatori, a causa dell’alto numero di variabili e dell’alto livello di correlazione tra loro. In particolare, questa tesi propone due diversi approcci per migliorare la stabilità delle liste di biomarcatori usando l’informazione a priori legata alle interazioni biologiche e alla correlazione funzionale tra le features analizzate. Entrambi gli approcci sono stati in grado di migliorare la selezione di biomarcatori. Il primo approccio, usando l’informazione a priori per dividere l’applicazione del metodo in diversi sottoproblemi, migliora l’interpretabilità dei risultati e offre un modo alternativo per verificare la riproducibilità delle liste. Il secondo, integrando l’informazione a priori in una funzione kernel dell’algoritmo di learning, migliora la stabilità delle liste. Infine, l’interpretabilità dei risultati è fortemente influenzata dalla qualità dell’informazione biologica disponibile e l’analisi delle eterogeneità delle annotazioni effettuata sul database Gene Ontology rivela l’importanza di fornire nuovi metodi in grado di verificare l’attendibilità delle proprietà biologiche che vengono assegnate ad una specifica variabile, distinguendo la mancanza o la minore specificità di informazione da possibili inconsistenze tra le annotazioni. Questi aspetti verranno sempre più approfonditi in futuro, dato che le nuove tecnologie di sequencing monitoreranno un maggior numero di variabili e il numero di annotazioni funzionali derivanti dai database genomici crescer`a considerevolmente nei prossimi anni

    Discovery of Type 2 Diabetes Trajectories from Electronic Health Records

    Get PDF
    University of Minnesota Ph.D. dissertation. September 2020. Major: Health Informatics. Advisor: Gyorgy Simon. 1 computer file (PDF); xiii, 110 pages.Type 2 diabetes (T2D) is one of the fastest growing public health concerns in the United States. There were 30.3 million patients (9.4% of the US populations) suffering from diabetes in 2015. Diabetes, which is the seventh leading cause of death in the United States, is known to be a non-reversible (incurable) chronic disease, leading to severe complications, including chronic kidney disease, amputation, blindness, and various cardiac and vascular diseases. Early identification of patients at high risk is regarded as the most effective clinical tool to prevent or delay the development of diabetes, allowing patients to change their life style or to receive medication earlier. In turn, these interventions can help decrease the risk of diabetes by 30-60%. Many studies have been conducted aiming at the early identification of patients at high risk in the clinical settings. These studies typically only consider the patient's current state at the time of the assessment and do not fully utilize all available information such as patient's medical history. Past history is important. It has been shown that laboratory results and vital signs can differ between diabetic and non-diabetic patients as many as 15-20 years before the onset of diabetes. We have also shown in our study that the order in which patients develop diabetes-related comorbidities is predictive of their diabetes risk even after adjusting for the severity of the comorbidities. In this thesis, we develop multiple novel methods to discover T2D trajectories from Electronic Health Records (EHR). We define trajectory as an order of in which diseases developed. We aim to discover typical and atypical trajectories where typical trajectories represent predominant patterns of progressions and atypical trajectories refer to the rest of the trajectories. Revealing trajectories can allow us to divide patients into subpopulations that can uncover the underlying etiology of diabetes. More importantly, by assessing the risk correctly and by a better understanding of the heterogeneity of diabetes, we can provide better care. Since data collected from EHR poses several challenges to directly identify trajectories from EHR data, we devise four specific studies to address the challenges: First, we propose a new knowledge-driven representation for clinical data mining, second, we demonstrate a method for estimating the onset time of slow-onset diseases from intermittently observable laboratory results in the specific context of T2D, third, we present a method to infer trajectories, the sequence of comorbidities potentially leading up to a particular disease of interest, and finally, we propose a novel method to discover multiple trajectories from EHR data. The patterns we discovered from above four studies address a clinical issue, are clinically verifiable and are amenable to deployment in practice to improve the quality of individual patient care towards promoting public health in the United States

    Low-dimensional representations of neural time-series data with applications to peripheral nerve decoding

    Get PDF
    Bioelectronic medicines, implanted devices that influence physiological states by peripheral neuromodulation, have promise as a new way of treating diverse conditions from rheumatism to diabetes. We here explore ways of creating nerve-based feedback for the implanted systems to act in a dynamically adapting closed loop. In a first empirical component, we carried out decoding studies on in vivo recordings of cat and rat bladder afferents. In a low-resolution data-set, we selected informative frequency bands of the neural activity using information theory to then relate to bladder pressure. In a second high-resolution dataset, we analysed the population code for bladder pressure, again using information theory, and proposed an informed decoding approach that promises enhanced robustness and automatic re-calibration by creating a low-dimensional population vector. Coming from a different direction of more general time-series analysis, we embedded a set of peripheral nerve recordings in a space of main firing characteristics by dimensionality reduction in a high-dimensional feature-space and automatically proposed single efficiently implementable estimators for each identified characteristic. For bioelectronic medicines, this feature-based pre-processing method enables an online signal characterisation of low-resolution data where spike sorting is impossible but simple power-measures discard informative structure. Analyses were based on surrogate data from a self-developed and flexibly adaptable computer model that we made publicly available. The wider utility of two feature-based analysis methods developed in this work was demonstrated on a variety of datasets from across science and industry. (1) Our feature-based generation of interpretable low-dimensional embeddings for unknown time-series datasets answers a need for simplifying and harvesting the growing body of sequential data that characterises modern science. (2) We propose an additional, supervised pipeline to tailor feature subsets to collections of classification problems. On a literature standard library of time-series classification tasks, we distilled 22 generically useful estimators and made them easily accessible.Open Acces

    Dynamical structure in neural population activity

    Get PDF
    The question of how the collective activity of neural populations in the brain gives rise to complex behaviour is fundamental to neuroscience. At the core of this question lie considerations about how neural circuits can perform computations that enable sensory perception, motor control, and decision making. It is thought that such computations are implemented by the dynamical evolution of distributed activity in recurrent circuits. Thus, identifying and interpreting dynamical structure in neural population activity is a key challenge towards a better understanding of neural computation. In this thesis, I make several contributions in addressing this challenge. First, I develop two novel methods for neural data analysis. Both methods aim to extract trajectories of low-dimensional computational state variables directly from the unbinned spike-times of simultaneously recorded neurons on single trials. The first method separates inter-trial variability in the low-dimensional trajectory from variability in the timing of progression along its path, and thus offers a quantification of inter-trial variability in the underlying computational process. The second method simultaneously learns a low-dimensional portrait of the underlying nonlinear dynamics of the circuit, as well as the system's fixed points and locally linearised dynamics around them. This approach facilitates extracting interpretable low-dimensional hypotheses about computation directly from data. Second, I turn to the question of how low-dimensional dynamical structure may be embedded within a high-dimensional neurobiological circuit with excitatory and inhibitory cell-types. I analyse how such circuit-level features shape population activity, with particular focus on responses to targeted optogenetic perturbations of the circuit. Third, I consider the problem of implementing multiple computations in a single dynamical system. I address this in the framework of multi-task learning in recurrently connected networks and demonstrate that a careful organisation of low-dimensional, activity-defined subspaces within the network can help to avoid interference across tasks
    • …
    corecore