28 research outputs found

    What are the true clusters?

    Get PDF
    Constructivist philosophy and Hasok Chang's active scientific realism are used to argue that the idea of "truth" in cluster analysis depends on the context and the clustering aims. Different characteristics of clusterings are required in different situations. Researchers should be explicit about on what requirements and what idea of "true clusters" their research is based, because clustering becomes scientific not through uniqueness but through transparent and open communication. The idea of "natural kinds" is a human construct, but it highlights the human experience that the reality outside the observer's control seems to make certain distinctions between categories inevitable. Various desirable characteristics of clusterings and various approaches to define a context-dependent truth are listed, and I discuss what impact these ideas can have on the comparison of clustering methods, and the choice of a clustering methods and related decisions in practice

    Exploring behaviour patterns with self-organizing map for personalised mental stress detection

    Get PDF
    Abstract. Stress is an important health problem and the cause for many illnesses and working days lost. It is often measured with different questionnaires that capture only the current stress levels and may come in too late for early prevention. They are also prone to subjective inaccuracies since the feeling of stress, and the physiological response to it, have been found to be individual. Real-time stress detectors, trained on biosignals like heart rate variability, exist but majority of them employ supervised learning which requires collecting a large amount of labelled data from each system user. Commonly, they are tested in situations where the stress response is deliberately induced (e.g. laboratory). Thus they may not generalise to real-life conditions where more general behavioural data could be used. In this study the issues with labelling and individuality are addressed by fitting unsupervised stress detection models at several personalisation levels. The method explored, the Self-Organizing Map, is combined with different clustering algorithms to find personal, semi-personal and general behaviour patterns that are converted to stress predictions. Laboratory biosignal-data are used for method validation. To provide an always-on type stress detection, real-life behavioural data consisting of biosignals and smartphone data are experimented on. The results show that personalisation does improve the predictions. The best classification performance for the laboratory data was found with the fully personalised model (F1-score 0.89 vs. 0.45 with the general model) but for the real-life data there was no big difference between fully personal (F1-score 0.57) and general model as long as the behaviour patterns were mapped to stress individually (F1-score 0.60). While the scores also validate the feasibility of SOM for mental stress detection, further research is needed to determine the most suitable and practical level of personalisation and an unambiguous mapping between behaviour patterns and stress.Tiivistelmä. Stressi on merkittävä terveysongelma ja syynä useisiin sairauksiin sekä työpoissaoloihin. Sitä mitataan usein erilaisilla kyselyillä, jotka kuvaavat vain hetkellistä stressitasoa ja joihin voidaan vastata liian myöhään ennaltaehkäisyn kannalta. Kyselyt ovat myös alttiita subjektiivisille epätarkkuuksille, koska stressintunteen, ja stressinaikaisten fysiologisten reaktioiden, on havaittu olevan yksilöllisiä. Reaaliaikaisia, biosignaalien kuten sykevälivaihtelun analyysiin perustuvia, stressintunnistimia on olemassa, mutta pääosin ne käyttävät ohjatun oppimisen menetelmiä, mikä vaatii jokaiselta järjestelmän käyttäjältä suuren stressintunteella merkityn aineiston. Stressintunnistimia myös usein testataan tilanteissa, joissa stressi on tahallisesti aiheutettua (esimerkiksi laboratoriossa). Siten ne eivät yleisty tosielämän tarpeisiin, jolloin voidaan käyttää yleisempää käyttäytymistä kuvaavaa aineistoa. Tässä tutkimuksessa vastataan datan merkintäongelmaan sekä yksilöllisyyden huomioimiseen käyttäen ohjaamattoman oppimisen stressintunnistusmalleja eri yksilöimisen tasoilla. Käytetty menetelmä, itseorganisoituva kartta, yhdistetään eri ryhmittelyalgoritmeihin tavoitteena löytää henkilökohtaiset, osin henkilökohtaiset sekä yleiset käyttäytymismallit, jotka muunnetaan stressiennusteiksi. Menetelmän sopivuuden vahvistamiseksi käytetään laboratoriossa kerättyä biosignaalidataa. Menetelmää sovelletaan myös tosielämän stressintunnistukseen biosignaaleista ja älypuhelimen käyttödatasta koostuvalla käyttäytymisaineistolla. Tulokset osoittavat, että yksilöiminen parantaa ennustetarkkuutta. Laboratorio-aineistolla paras luokittelutarkkuus löydettiin täysin yksilöllisellä mallilla (F1-pistemäärä 0.89, kun yleisellä 0.45). Tosielämän aineistolla täysin yksilöllisen (F1-pistemäärä 0.57) ja yleisen mallin, jossa käyttäytymismallien ja stressin välinen kuvaus määrättiin yksilöidysti (F1-pistemäärä 0.60), välinen ero ei ollut suuri. Vaikka tulokset vahvistavatkin itseorganisoituvan kartan sopivuuden psyykkisen stressin tunnistamisessa, lisätutkimusta tarvitaan määräämään soveltuvin ja käytännöllisin yksilöimisen taso sekä yksikäsitteinen kuvaus käyttäytymismallien ja stressin välille

    Deep learning for clustering of multivariate clinical patient trajectories with missing values

    Get PDF
    BACKGROUND: Precision medicine requires a stratification of patients by disease presentation that is sufficiently informative to allow for selecting treatments on a per-patient basis. For many diseases, such as neurological disorders, this stratification problem translates into a complex problem of clustering multivariate and relatively short time series because (i) these diseases are multifactorial and not well described by single clinical outcome variables and (ii) disease progression needs to be monitored over time. Additionally, clinical data often additionally are hindered by the presence of many missing values, further complicating any clustering attempts. FINDINGS: The problem of clustering multivariate short time series with many missing values is generally not well addressed in the literature. In this work, we propose a deep learning-based method to address this issue, variational deep embedding with recurrence (VaDER). VaDER relies on a Gaussian mixture variational autoencoder framework, which is further extended to (i) model multivariate time series and (ii) directly deal with missing values. We validated VaDER by accurately recovering clusters from simulated and benchmark data with known ground truth clustering, while varying the degree of missingness. We then used VaDER to successfully stratify patients with Alzheimer disease and patients with Parkinson disease into subgroups characterized by clinically divergent disease progression profiles. Additional analyses demonstrated that these clinical differences reflected known underlying aspects of Alzheimer disease and Parkinson disease. CONCLUSIONS: We believe our results show that VaDER can be of great value for future efforts in patient stratification, and multivariate time-series clustering in general

    Doctor of Philosophy

    Get PDF
    dissertationWith the tremendous growth of data produced in the recent years, it is impossible to identify patterns or test hypotheses without reducing data size. Data mining is an area of science that extracts useful information from the data by discovering patterns and structures present in the data. In this dissertation, we will largely focus on clustering which is often the first step in any exploratory data mining task, where items that are similar to each other are grouped together, making downstream data analysis robust. Different clustering techniques have different strengths, and the resulting groupings provide different perspectives on the data. Due to the unsupervised nature i.e., the lack of domain experts who can label the data, validation of results is very difficult. While there are measures that compute "goodness" scores for clustering solutions as a whole, there are few methods that validate the assignment of individual data items to their clusters. To address these challenges we focus on developing a framework that can generate, compare, combine, and evaluate different solutions to make more robust and significant statements about the data. In the first part of this dissertation, we present fast and efficient techniques to generate and combine different clustering solutions. We build on some recent ideas on efficient representations of clusters of partitions to develop a well founded metric that is spatially aware to compare clusterings. With the ability to compare clusterings, we describe a heuristic to combine different solutions to produce a single high quality clustering. We also introduce a Markov chain Monte Carlo approach to sample different clusterings from the entire landscape to provide the users with a variety of choices. In the second part of this dissertation, we build certificates for individual data items and study their influence on effective data reduction. We present a geometric approach by defining regions of influence for data items and clusters and use this to develop adaptive sampling techniques to speedup machine learning algorithms. This dissertation is therefore a systematic approach to study the landscape of clusterings in an attempt to provide a better understanding of the data

    Neurobiological Divergence of the Positive and Negative Schizophrenia Subtypes Identified on a New Factor Structure of Psychopathology Using Non-negative Factorization:An International Machine Learning Study

    Get PDF
    ObjectiveDisentangling psychopathological heterogeneity in schizophrenia is challenging and previous results remain inconclusive. We employed advanced machine-learning to identify a stable and generalizable factorization of the “Positive and Negative Syndrome Scale (PANSS)”, and used it to identify psychopathological subtypes as well as their neurobiological differentiations.MethodsPANSS data from the Pharmacotherapy Monitoring and Outcome Survey cohort (1545 patients, 586 followed up after 1.35±0.70 years) were used for learning the factor-structure by an orthonormal projective non-negative factorization. An international sample, pooled from nine medical centers across Europe, USA, and Asia (490 patients), was used for validation. Patients were clustered into psychopathological subtypes based on the identified factor-structure, and the neurobiological divergence between the subtypes was assessed by classification analysis on functional MRI connectivity patterns.ResultsA four-factor structure representing negative, positive, affective, and cognitive symptoms was identified as the most stable and generalizable representation of psychopathology. It showed higher internal consistency than the original PANSS subscales and previously proposed factor-models. Based on this representation, the positive-negative dichotomy was confirmed as the (only) robust psychopathological subtypes, and these subtypes were longitudinally stable in about 80% of the repeatedly assessed patients. Finally, the individual subtype could be predicted with good accuracy from functional connectivity profiles of the ventro-medial frontal cortex, temporoparietal junction, and precuneus.ConclusionsMachine-learning applied to multi-site data with cross-validation yielded a factorization generalizable across populations and medical systems. Together with subtyping and the demonstrated ability to predict subtype membership from neuroimaging data, this work further disentangles the heterogeneity in schizophrenia

    Graph-based Methods for Visualization and Clustering

    Get PDF
    The amount of data that we produce and consume is larger than it has been at any point in the history of mankind, and it keeps growing exponentially. All this information, gathered in overwhelming volumes, often comes with two problematic characteristics: it is complex and deprived of semantical context. A common step to address those issues is to embed raw data in lower dimensions, by finding a mapping which preserves the similarity between data points from their original space to a new one. Measuring similarity between large sets of high-dimensional objects is, however, problematic for two main reasons: first, high-dimensional points are subject to the curse of dimensionality and second, the number of pairwise distances between points is quadratic with respect to the amount of data points. Both problems can be addressed by using nearest neighbours graphs to understand the structure in data. As a matter of fact, most dimensionality reduction methods use similarity matrices that can be interpreted as graph adjacency matrices. Yet, despite recent progresses, dimensionality reduction is still very challenging when applied to very large datasets. Indeed, although recent methods specifically address the problem of scaleability, processing datasets of millions of elements remain a very lengthy process. In this thesis, we propose new contributions which address the problem of scaleability using the framework of Graph Signal Processing, which extends traditional signal processing to graphs. We do so motivated by the premise that graphs are well suited to represent the structure of the data. In the first part of this thesis, we look at quantitative measures for the evaluation of dimensionality reduction methods. Using tools from graph theory and Graph Signal Processing, we show that specific characteristics related to quality can be assessed by taking measures on the graph, which indirectly validates the hypothesis relating graph to structure. The second contribution is a new method for a fast eigenspace approximation of the graph Laplacian. Using principles of GSP and random matrices, we show that an approximated eigensubpace can be recovered very efficiently, which be used for fast spectral clustering or visualization. Next, we propose a compressive scheme to accelerate any dimensionality reduction technique. The idea is based on compressive sampling and transductive learning on graphs: after computing the embedding for a small subset of data points, we propagate the information everywhere using transductive inference. The key components of this technique are a good sampling strategy to select the subset and the application of transductive learning on graphs. Finally, we address the problem of over-discriminative feature spaces by proposing a hierarchical clustering structure combined with multi-resolution graphs. Using efficient coarsening and refinement procedures on this structure, we show that dimensionality reduction algorithms can be run on intermediate levels and up-sampled to all points leading to a very fast dimensionality reduction method. For all contributions, we provide extensive experiments on both synthetic and natural datasets, including large-scale problems. This allows us to show the pertinence of our models and the validity of our proposed algorithms. Following reproducible principles, we provide everything needed to repeat the examples and the experiments presented in this work
    corecore