6 research outputs found

    A Pipeline for Clustering by Compression with Application to Patient Stratification in Spondyloarthritis

    Get PDF
    Funding Information: The authors acknowledge Fundação para a Ciência e Tecnologia, LASIGE Research Unit, ref. UIDB/00408/2020 and ref. UIDP/00408/2020 and Instituto de Telecomunicações Research Unit, ref. UIDB/50008/2020, and UIDP/50008/2020. The authors also acknowledge the Project PREDICT (PTDC/CCI-CIF/29877/2017), funded by Fundo Europeu de Desenvolvimento Regional (FEDER), through Programa Operacional Regional LISBOA (LISBOA2020), and by national funds, through Fundacção para a Ciência e Tecnologia (FCT), and projects MATISSE (DSAIPA/DS/0026/2019), MONET (PTDC/CCI-BIO/4180/2020) and SmartGlauco (PTDC/CTM-REF/2679/2020). Publisher Copyright: © 2023 by the authors.The normalized compression distance (NCD) is a similarity measure between a pair of finite objects based on compression. Clustering methods usually use distances (e.g., Euclidean distance, Manhattan distance) to measure the similarity between objects. The NCD is yet another distance with particular characteristics that can be used to build the starting distance matrix for methods such as hierarchical clustering or K-medoids. In this work, we propose Zgli, a novel Python module that enables the user to compute the NCD between files inside a given folder. Inspired by the CompLearn Linux command line tool, this module iterates on it by providing new text file compressors, a new compression-by-column option for tabular data, such as CSV files, and an encoder for small files made up of categorical data. Our results demonstrate that compression by column can yield better results than previous methods in the literature when clustering tabular data. Additionally, the categorical encoder shows that it can augment categorical data, allowing the use of the NCD for new data types. One of the advantages is that using this new feature does not require knowledge or context of the data. Furthermore, the fact that the new proposed module is written in Python, one of the most popular programming languages for machine learning, potentiates its use by developers to tackle problems with a new approach based on compression. This pipeline was tested in clinical data and proved a promising computational strategy by providing patient stratification via clusters aiding in precision medicine.publishersversionpublishe

    Using compression for profiling rheumatoid arthritis disease progression through data mining techniques

    Get PDF
    Tese de mestrado, Ciência de Dados, 2022, Universidade de Lisboa, Faculdade de CiênciasAnkylosing spondylitis (AS) is a chronic autoimmune inflammatory condition belonging to the spondyloarthropathy category of rheumatic diseases characterized by being highly debilitating diseases and having a high impact on patients physical and mental health as well as social and quality of life. Biological treatment for this pathology is difficult to pick and lacks clear selection criteria. Usually, treatment is chosen based on patient convenience. Our goal is to use an approach based on algorithmic information theory, without any domain-specific parameters to set, or any background knowledge required (clustering by compression), iterate over the current state of the art, so it can be better integrated into python pipelines as well as better suit our specific problem, and apply it to our data comprised of patients with AS so patterns between biological treatments and patient profiles can be established thereby helping clinicians make a better treatment choice for each patient. Unsupervised clustering models are generated using normalized compression distance matrices, which are then evaluated using v-measure, adjusted random score, and visually analyzed taking into account model contingency matrix and feature distribution per cluster. Possible patterns between biological treatment success and patient profiles were identified. Furthermore, we observed that the compression by column developed and implemented in this new tool for clustering by compression seemed to yield better results than the previous approach

    Applied Metaheuristic Computing

    Get PDF
    For decades, Applied Metaheuristic Computing (AMC) has been a prevailing optimization technique for tackling perplexing engineering and business problems, such as scheduling, routing, ordering, bin packing, assignment, facility layout planning, among others. This is partly because the classic exact methods are constrained with prior assumptions, and partly due to the heuristics being problem-dependent and lacking generalization. AMC, on the contrary, guides the course of low-level heuristics to search beyond the local optimality, which impairs the capability of traditional computation methods. This topic series has collected quality papers proposing cutting-edge methodology and innovative applications which drive the advances of AMC

    Applied Methuerstic computing

    Get PDF
    For decades, Applied Metaheuristic Computing (AMC) has been a prevailing optimization technique for tackling perplexing engineering and business problems, such as scheduling, routing, ordering, bin packing, assignment, facility layout planning, among others. This is partly because the classic exact methods are constrained with prior assumptions, and partly due to the heuristics being problem-dependent and lacking generalization. AMC, on the contrary, guides the course of low-level heuristics to search beyond the local optimality, which impairs the capability of traditional computation methods. This topic series has collected quality papers proposing cutting-edge methodology and innovative applications which drive the advances of AMC
    corecore