22 research outputs found

    Biclustering electronic health records to unravel disease presentation patterns

    Get PDF
    Tese de mestrado, Ciência de Dados, Universidade de Lisboa, Faculdade de Ciências, 2019A Esclerose Lateral Amiotrófica (ELA) é uma doença neurodegenerativa heterogénea com padrões de apresentação altamente variáveis. Dada a natureza heterogénea dos doentes com ELA, aquando do diagnóstico os clínicos normalmente estimam a progressão da doença utilizando uma taxa de decaimento funcional, calculada com base na Escala Revista de Avaliação Funcional de ELA (ALSFRS-R). A utilização de modelos de Aprendizagem Automática que consigam lidar com este padrões complexos é necessária para compreender a doença, melhorar os cuidados aos doentes e a sua sobrevivência. Estes modelos devem ser explicáveis para que os clínicos possam tomar decisões informadas. Desta forma, o nosso objectivo é descobrir padrões de apresentação da doença, para isso propondo uma nova abordagem de Prospecção de Dados: Descoberta de Meta-atributos Discriminativos (DMD), que utiliza uma combinação de Biclustering, Classificação baseada em Biclustering e Prospecção de Regras de Associação para Classificação. Estes padrões (chamados de Meta-atributos) são compostos por subconjuntos de atributos discriminativos conjuntamente com os seus valores, permitindo assim distinguir e caracterizar subgrupos de doentes com padrões similares de apresentação da doença. Os Registos de Saúde Electrónicos (RSE) utilizados neste trabalho provêm do conjunto de dados JPND ONWebDUALS (ONTology-based Web Database for Understanding Amyotrophic Lateral Sclerosis), composto por questões standardizadas acerca de factores de risco, mutações genéticas, atributos clínicos ou informação de sobrevivência de uma coorte de doentes e controlos seguidos pelo consórcio ENCALS (European Network to Cure ALS), que inclui vários países europeus, incluindo Portugal. Nesta tese a metodologia proposta foi utilizada na parte portuguesa do conjunto de dados ONWebDUALS para encontrar padrões de apresentação da doença que: 1) distinguissem os doentes de ELA dos seus controlos e 2) caracterizassem grupos de doentes de ELA com diferentes taxas de progressão (categorizados em grupos Lentos, Neutros e Rápidos). Nenhum padrão coerente emergiu das experiências efectuadas para a primeira tarefa. Contudo, para a segunda tarefa os padrões encontrados para cada um dos três grupos de progressão foram reconhecidos e validados por clínicos especialistas em ELA, como sendo características relevantes de doentes com progressão Lenta, Neutra e Rápida. Estes resultados sugerem que a nossa abordagem genérica baseada em Biclustering tem potencial para identificar padrões de apresentação noutros problemas ou doenças semelhantes.Amyotrophic Lateral Sclerosis (ALS) is a heterogeneous neurodegenerative disease with a high variability of presentation patterns. Given the heterogeneous nature of ALS patients and targeting a better prognosis, clinicians usually estimate disease progression at diagnosis using the rate of decay computed from the Revised ALS Functional Rating Scale (ALSFRS-R). In this context, the use of Machine Learning models able to unravel the complexity of disease presentation patterns is paramount for disease understanding, targeting improved patient care and longer survival times. Furthermore, explainable models are vital, since clinicians must be able to understand the reasoning behind a given model’s result before making a decision that can impact a patient’s life. Therefore we aim at unravelling disease presentation patterns by proposing a new Data Mining approach called Discriminative Meta-features Discovery (DMD), which uses a combination of Biclustering, Biclustering-based Classification and Class Association Rule Mining. These patterns (called Metafeatures) are composed of discriminative subsets of features together with their values, allowing to distinguish and characterize subgroups of patients with similar disease presentation patterns. The Electronic Health Record (EHR) data used in this work comes from the JPND ONWebDUALS (ONTology-based Web Database for Understanding Amyotrophic Lateral Sclerosis) dataset, comprised of standardized questionnaire answers regarding risk factors, genetic mutations, clinical features and survival information from a cohort of patients and controls from ENCALS (European Network to Cure ALS), a consortium of diverse European countries, including Portugal. In this work the proposed methodology was used on the ONWebDUALS Portuguese EHR data to find disease presentation patterns that: 1) distinguish the ALS patients from their controls and 2) characterize groups of ALS patients with different progression rates (categorized into Slow, Neutral and Fast groups). No clear pattern emerged from the experiments performed for the first task. However, in the second task the patterns found for each of the three progression groups were recognized and validated by ALS expert clinicians, as being relevant characteristics of slow, neutral and fast progressing patients. These results suggest that our generic Biclustering approach is a promising way to unravel disease presentation patterns and could be applied to similar problems and other diseases

    Genome classification by gene distribution: An overlapping subspace clustering approach

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Genomes of lower organisms have been observed with a large amount of horizontal gene transfers, which cause difficulties in their evolutionary study. Bacteriophage genomes are a typical example. One recent approach that addresses this problem is the unsupervised clustering of genomes based on gene order and genome position, which helps to reveal species relationships that may not be apparent from traditional phylogenetic methods.</p> <p>Results</p> <p>We propose the use of an overlapping subspace clustering algorithm for such genome classification problems. The advantage of subspace clustering over traditional clustering is that it can associate clusters with gene arrangement patterns, preserving genomic information in the clusters produced. Additionally, overlapping capability is desirable for the discovery of multiple conserved patterns within a single genome, such as those acquired from different species via horizontal gene transfers. The proposed method involves a novel strategy to vectorize genomes based on their gene distribution. A number of existing subspace clustering and biclustering algorithms were evaluated to identify the best framework upon which to develop our algorithm; we extended a generic subspace clustering algorithm called HARP to incorporate overlapping capability. The proposed algorithm was assessed and applied on bacteriophage genomes. The phage grouping results are consistent overall with the Phage Proteomic Tree and showed common genomic characteristics among the TP901-like, Sfi21-like and sk1-like phage groups. Among 441 phage genomes, we identified four significantly conserved distribution patterns structured by the terminase, portal, integrase, holin and lysin genes. We also observed a subgroup of Sfi21-like phages comprising a distinctive divergent genome organization and identified nine new phage members to the Sfi21-like genus: <it>Staphylococcus </it>71, phiPVL108, <it>Listeria </it>A118, 2389, <it>Lactobacillus phi </it>AT3, A2, <it>Clostridium </it>phi3626, <it>Geobacillus </it>GBSV1, and <it>Listeria monocytogenes </it>PSA.</p> <p>Conclusion</p> <p>The method described in this paper can assist evolutionary study through objectively classifying genomes based on their resemblance in gene order, gene content and gene positions. The method is suitable for application to genomes with high genetic exchange and various conserved gene arrangement, as demonstrated through our application on phages.</p

    The shapes of an epidemic: using Functional Data Analysis to characterize COVID-19 in Italy

    Full text link
    We investigate patterns of COVID-19 mortality across 20 Italian regions and their association with mobility, positivity, and socio-demographic, infrastructural and environmental covariates. Notwithstanding limitations in accuracy and resolution of the data available from public sources, we pinpoint significant trends exploiting information in curves and shapes with Functional Data Analysis techniques. These depict two starkly different epidemics; an "exponential" one unfolding in Lombardia and the worst hit areas of the north, and a milder, "flat(tened)" one in the rest of the country -- including Veneto, where cases appeared concurrently with Lombardia but aggressive testing was implemented early on. We find that mobility and positivity can predict COVID-19 mortality, also when controlling for relevant covariates. Among the latter, primary care appears to mitigate mortality, and contacts in hospitals, schools and work places to aggravate it. The techniques we describe could capture additional and potentially sharper signals if applied to richer data

    A mixture model with a reference-based automatic selection of components for disease classification from protein and/or gene expression levels

    Get PDF
    Background Bioinformatics data analysis is often using linear mixture model representing samples as additive mixture of components. Properly constrained blind matrix factorization methods extract those components using mixture samples only. However, automatic selection of extracted components to be retained for classification analysis remains an open issue. Results The method proposed here is applied to well-studied protein and genomic datasets of ovarian, prostate and colon cancers to extract components for disease prediction. It achieves average sensitivities of: 96.2 (sd=2.7%), 97.6% (sd=2.8%) and 90.8% (sd=5.5%) and average specificities of: 93.6% (sd=4.1%), 99% (sd=2.2%) and 79.4% (sd=9.8%) in 100 independent two-fold cross-validations. Conclusions We propose an additive mixture model of a sample for feature extraction using, in principle, sparseness constrained factorization on a sample-by-sample basis. As opposed to that, existing methods factorize complete dataset simultaneously. The sample model is composed of a reference sample representing control and/or case (disease) groups and a test sample. Each sample is decomposed into two or more components that are selected automatically (without using label information) as control specific, case specific and not differentially expressed (neutral). The number of components is determined by cross-validation. Automatic assignment of features (m/z ratios or genes) to particular component is based on thresholds estimated from each sample directly. Due to the locality of decomposition, the strength of the expression of each feature across the samples can vary. Yet, they will still be allocated to the related disease and/or control specific component. Since label information is not used in the selection process, case and control specific components can be used for classification. That is not the case with standard factorization methods. Moreover, the component selected by proposed method as disease specific can be interpreted as a sub-mode and retained for further analysis to identify potential biomarkers. As opposed to standard matrix factorization methods this can be achieved on a sample (experiment)-by-sample basis. Postulating one or more components with indifferent features enables their removal from disease and control specific components on a sample-by-sample basis. This yields selected components with reduced complexity and generally, it increases prediction accuracy

    Comparative analysis of gene duplications and their impact on expression levels in nematode genomes

    Get PDF
    Gene duplication is a major mechanism that plays a vital role in different evolutionary innovations, ranging from generating novel traits to phenotypic plasticity. Evolutionary impact of gene duplication and the fate of duplicated genes has been studied in detail. However, little is known about the impact of gene duplication on gene expression with respect to different evolutionary time scales. Here, we study genome-wide patterns of gene duplications in nematodes and assess their effect on expression levels. This study encompasses various macroevolutionary comparisons at different time scales and microevolutionary comparisons within the species Pristionchus pacificus. At the macroevolutionary level, by comparing species separated more than 280 million years ago, we found various lineage-specific expansions in multiple gene families along the Pristionchus lineage. Moreover, we found that duplicated genes are highly enriched among developmentally regulated genes. Interestingly, the results also show evidence for selection on duplication to increases the gene expression levels in a developmental stage-specific manner. To gain insights into the microevolution of gene expression levels after gene duplication, we compared different strains of P.pacificus and found that an additional gene copy does usually not increase gene expression levels in the different strains. Furthermore, we found a strong depletion of duplicated genes in large parts of the P. pacificus genome indicating towards negative selection against gene duplication. This shows that the impact on gene expression levels following gene duplication differs dramatically, where a selection for increased gene dosage dominates macroevolution and negative selection on gene duplication dominates within species level. This led us to wonder what happens at the intermediate time scale. We compared recent duplicates of P. pacificus with their single-copy orthologs in two closely related species and found a pattern similar to the microevolutionary trend. Additionally, comparison of closely related species of the Strongyloides genus and its developmental transcriptome also shows overall strong depletion of duplicated genes, similar to the observation at the microevolutionary level. At the same time, a strong enrichment of duplicated genes was found at a developmental stage associated with the parasitic activity of the nematodes. Similar to the macroevolutionary picture of P. pacificus, we also found selection for higher gene dosage in parasitism-associated gene families of S. papillosus, indicating the adaptive potential of duplicated genes. Even though these studies show widespread selection against both duplication and changes in gene expression, duplications are favoured in some conditions leading to adaptive changes in the organism. Overall this indicates that the regulation of expression levels of duplicated genes was subjected to different selection processes at different time scales, which represent a complex interplay between different evolutionary processes like natural selection, population dynamics, and genetic drift

    Computational methods to study gene regulation in humans using DNA and RNA sequencing data

    Get PDF
    Genes work in a coordinated fashion to perform complex functions. Disruption of gene regulatory programs can result in disease, highlighting the importance of understanding them. We can leverage large-scale DNA and RNA sequencing data to decipher gene regulatory relationships in humans. In this thesis, we present three projects on regulation of gene expression by other genes and by genetic variants using two computational frameworks: co-expression networks and expression quantitative trait loci (eQTL). First, we investigate the effect of alignment errors in RNA sequencing on detecting trans-eQTLs and co-expression of genes. We demonstrate that misalignment due to sequence similarity between genes may result in over 75% false positives in a standard trans-eQTL analysis. It produces a higher than background fraction of potential false positives in a conventional co-expression study too. These false-positive associations are likely to misleadingly replicate between studies. We present a metric, cross-mappability, to detect and avoid such false positives. Next, we focus on joint regulation of transcription and splicing in humans. We present a framework called transcriptome-wide networks (TWNs) for combining total expression of genes and relative isoform levels into a single sparse network, capturing the interplay between the regulation of splicing and transcription. We build TWNs for 16 human tissues and show that the hubs with multiple isoform neighbors in these networks are candidate alternative splicing regulators. Then, we study the tissue-specificity of network edges. Using these networks, we detect 20 genetic variants with distant regulatory impacts. Finally, we present a novel network inference method, SPICE, to study the regulation of transcription. Using maximum spanning trees, SPICE prioritizes potential direct regulatory relationships between genes. We also formulate a comprehensive set of metrics using biological data to establish a standard to evaluate biological networks. According to most of these metrics, SPICE performs better than current popular network inference methods when applied to RNA-sequencing data from diverse human tissues

    Tensor Networks for Dimensionality Reduction and Large-Scale Optimizations. Part 2 Applications and Future Perspectives

    Full text link
    Part 2 of this monograph builds on the introduction to tensor networks and their operations presented in Part 1. It focuses on tensor network models for super-compressed higher-order representation of data/parameters and related cost functions, while providing an outline of their applications in machine learning and data analytics. A particular emphasis is on the tensor train (TT) and Hierarchical Tucker (HT) decompositions, and their physically meaningful interpretations which reflect the scalability of the tensor network approach. Through a graphical approach, we also elucidate how, by virtue of the underlying low-rank tensor approximations and sophisticated contractions of core tensors, tensor networks have the ability to perform distributed computations on otherwise prohibitively large volumes of data/parameters, thereby alleviating or even eliminating the curse of dimensionality. The usefulness of this concept is illustrated over a number of applied areas, including generalized regression and classification (support tensor machines, canonical correlation analysis, higher order partial least squares), generalized eigenvalue decomposition, Riemannian optimization, and in the optimization of deep neural networks. Part 1 and Part 2 of this work can be used either as stand-alone separate texts, or indeed as a conjoint comprehensive review of the exciting field of low-rank tensor networks and tensor decompositions.Comment: 232 page

    Tensor Networks for Dimensionality Reduction and Large-Scale Optimizations. Part 2 Applications and Future Perspectives

    Full text link
    Part 2 of this monograph builds on the introduction to tensor networks and their operations presented in Part 1. It focuses on tensor network models for super-compressed higher-order representation of data/parameters and related cost functions, while providing an outline of their applications in machine learning and data analytics. A particular emphasis is on the tensor train (TT) and Hierarchical Tucker (HT) decompositions, and their physically meaningful interpretations which reflect the scalability of the tensor network approach. Through a graphical approach, we also elucidate how, by virtue of the underlying low-rank tensor approximations and sophisticated contractions of core tensors, tensor networks have the ability to perform distributed computations on otherwise prohibitively large volumes of data/parameters, thereby alleviating or even eliminating the curse of dimensionality. The usefulness of this concept is illustrated over a number of applied areas, including generalized regression and classification (support tensor machines, canonical correlation analysis, higher order partial least squares), generalized eigenvalue decomposition, Riemannian optimization, and in the optimization of deep neural networks. Part 1 and Part 2 of this work can be used either as stand-alone separate texts, or indeed as a conjoint comprehensive review of the exciting field of low-rank tensor networks and tensor decompositions.Comment: 232 page

    Characterizing the Huntington's disease, Parkinson's disease, and pan-neurodegenerative gene expression signature with RNA sequencing

    Get PDF
    Huntington's disease (HD) and Parkinson's disease (PD) are devastating neurodegenerative disorders that are characterized pathologically by degeneration of neurons in the brain and clinically by loss of motor function and cognitive decline in mid to late life. The cause of neuronal degeneration in these diseases is unclear, but both are histologically marked by aggregation of specific proteins in specific brain regions. In HD, fragments of a mutant Huntingtin protein aggregate and cause medium spiny interneurons of the striatum to degenerate. In contrast, PD brains exhibit aggregation of toxic fragments of the alpha synuclein protein throughout the central nervous system and trigger degeneration of dopaminergic neurons in the substantia nigra. Considering the commonalities and differences between these diseases, identifying common biological patterns across HD and PD as well as signatures unique to each may provide significant insight into the molecular mechanisms underlying neurodegeneration as a general process. State-of-the-art high-throughput sequencing technology allows for unbiased, whole genome quantification of RNA molecules within a biological sample that can be used to assess the level of activity, or expression, of thousands of genes simultaneously. In this thesis, I present three studies characterizing the RNA expression profiles of post-mortem HD and PD subjects using high-throughput mRNA sequencing data sets. The first study describes an analysis of differential expression between HD individuals and neurologically normal controls that indicates a widespread increase in immune, neuroinflammatory, and developmental gene expression. The second study expands upon the first study by making methodological improvements and extends the differential expression analysis to include PD subjects, with the goal of comparing and contrasting HD and PD gene expression profiles. This study was designed to identify common mechanisms underlying the neurodegenerative phenotype, transcending those of each unique disease, and has revealed specific biological processes, in particular those related to NFkB inflammation, common to HD and PD. The last study describes a novel methodology for combining mRNA and miRNA expression that seeks to identify associations between mRNA-miRNA modules and continuous clinical variables of interest, including CAG repeat length and clinical age of onset in HD

    Development and Validation of a Proof-of-Concept Prototype for Analytics-based Malicious Cybersecurity Insider Threat in a Real-Time Identification System

    Get PDF
    Insider threat has continued to be one of the most difficult cybersecurity threat vectors detectable by contemporary technologies. Most organizations apply standard technology-based practices to detect unusual network activity. While there have been significant advances in intrusion detection systems (IDS) as well as security incident and event management solutions (SIEM), these technologies fail to take into consideration the human aspects of personality and emotion in computer use and network activity, since insider threats are human-initiated. External influencers impact how an end-user interacts with both colleagues and organizational resources. Taking into consideration external influencers, such as personality, changes in organizational polices and structure, along with unusual technical activity analysis, would be an improvement over contemporary detection tools used for identifying at-risk employees. This would allow upper management or other organizational units to intervene before a malicious cybersecurity insider threat event occurs, or mitigate it quickly, once initiated. The main goal of this research study was to design, develop, and validate a proof-of-concept prototype for a malicious cybersecurity insider threat alerting system that will assist in the rapid detection and prediction of human-centric precursors to malicious cybersecurity insider threat activity. Disgruntled employees or end-users wishing to cause harm to the organization may do so by abusing the trust given to them in their access to available network and organizational resources. Reports on malicious insider threat actions indicated that insider threat attacks make up roughly 23% of all cybercrime incidents, resulting in $2.9 trillion in employee fraud losses globally. The damage and negative impact that insider threats cause was reported to be higher than that of outsider or other types of cybercrime incidents. Consequently, this study utilized weighted indicators to measure and correlate simulated user activity to possible precursors to malicious cybersecurity insider threat attacks. This study consisted of a mixed method approach utilizing an expert panel, developmental research, and quantitative data analysis using the developed tool on simulated data set. To assure validity and reliability of the indicators, a panel of subject matter experts (SMEs) reviewed the indicators and indicator categorizations that were collected from prior literature following the Delphi technique. The SMEs’ responses were incorporated into the development of a proof-of-concept prototype. Once the proof-of-concept prototype was completed and fully tested, an empirical simulation research study was conducted utilizing simulated user activity within a 16-month time frame. The results of the empirical simulation study were analyzed and presented. Recommendations resulting from the study also be provided
    corecore