33 research outputs found

    Statistical significance of variables driving systematic variation

    Full text link
    There are a number of well-established methods such as principal components analysis (PCA) for automatically capturing systematic variation due to latent variables in large-scale genomic data. PCA and related methods may directly provide a quantitative characterization of a complex biological variable that is otherwise difficult to precisely define or model. An unsolved problem in this context is how to systematically identify the genomic variables that are drivers of systematic variation captured by PCA. Principal components (and other estimates of systematic variation) are directly constructed from the genomic variables themselves, making measures of statistical significance artificially inflated when using conventional methods due to over-fitting. We introduce a new approach called the jackstraw that allows one to accurately identify genomic variables that are statistically significantly associated with any subset or linear combination of principal components (PCs). The proposed method can greatly simplify complex significance testing problems encountered in genomics and can be utilized to identify the genomic variables significantly associated with latent variables. Using simulation, we demonstrate that our method attains accurate measures of statistical significance over a range of relevant scenarios. We consider yeast cell-cycle gene expression data, and show that the proposed method can be used to straightforwardly identify statistically significant genes that are cell-cycle regulated. We also analyze gene expression data from post-trauma patients, allowing the gene expression data to provide a molecularly-driven phenotype. We find a greater enrichment for inflammatory-related gene sets compared to using a clinically defined phenotype. The proposed method provides a useful bridge between large-scale quantifications of systematic variation and gene-level significance analyses.Comment: 35 pages, 1 table, 6 main figures, 7 supplementary figure

    Concept Saliency Maps to Visualize Relevant Features in Deep Generative Models

    Full text link
    Evaluating, explaining, and visualizing high-level concepts in generative models, such as variational autoencoders (VAEs), is challenging in part due to a lack of known prediction classes that are required to generate saliency maps in supervised learning. While saliency maps may help identify relevant features (e.g., pixels) in the input for classification tasks of deep neural networks, similar frameworks are understudied in unsupervised learning. Therefore, we introduce a new method of obtaining saliency maps for latent representations of known or novel high-level concepts, often called concept vectors in generative models. Concept scores, analogous to class scores in classification tasks, are defined as dot products between concept vectors and encoded input data, which can be readily used to compute the gradients. The resulting concept saliency maps are shown to highlight input features deemed important for high-level concepts. Our method is applied to the VAE's latent space of CelebA dataset in which known attributes such as "smiles" and "hats" are used to elucidate relevant facial features. Furthermore, our application to spatial transcriptomic (ST) data of a mouse olfactory bulb demonstrates the potential of latent representations of morphological layers and molecular features in advancing our understanding of complex biological systems. By extending the popular method of saliency maps to generative models, the proposed concept saliency maps help improve interpretability of latent variable models in deep learning. Codes to reproduce and to implement concept saliency maps: https://github.com/lenbrocki/concept-saliency-mapsComment: 18th IEEE International Conference on Machine Learning and Applications (ICMLA

    Jaccard/Tanimoto similarity test and estimation methods

    Full text link
    Binary data are used in a broad area of biological sciences. Using binary presence-absence data, we can evaluate species co-occurrences that help elucidate relationships among organisms and environments. To summarize similarity between occurrences of species, we routinely use the Jaccard/Tanimoto coefficient, which is the ratio of their intersection to their union. It is natural, then, to identify statistically significant Jaccard/Tanimoto coefficients, which suggest non-random co-occurrences of species. However, statistical hypothesis testing using this similarity coefficient has been seldom used or studied. We introduce a hypothesis test for similarity for biological presence-absence data, using the Jaccard/Tanimoto coefficient. Several key improvements are presented including unbiased estimation of expectation and centered Jaccard/Tanimoto coefficients, that account for occurrence probabilities. We derived the exact and asymptotic solutions and developed the bootstrap and measurement concentration algorithms to compute statistical significance of binary similarity. Comprehensive simulation studies demonstrate that our proposed methods produce accurate p-values and false discovery rates. The proposed estimation methods are orders of magnitude faster than the exact solution. The proposed methods are implemented in an open source R package called jaccard (https://cran.r-project.org/package=jaccard). We introduce a suite of statistical methods for the Jaccard/Tanimoto similarity coefficient, that enable straightforward incorporation of probabilistic measures in analysis for species co-occurrences. Due to their generality, the proposed methods and implementations are applicable to a wide range of binary data arising from genomics, biochemistry, and other areas of science

    Integration of Radiomics and Tumor Biomarkers in Interpretable Machine Learning Models

    Full text link
    Despite the unprecedented performance of deep neural networks (DNNs) in computer vision, their practical application in the diagnosis and prognosis of cancer using medical imaging has been limited. One of the critical challenges for integrating diagnostic DNNs into radiological and oncological applications is their lack of interpretability, preventing clinicians from understanding the model predictions. Therefore, we study and propose the integration of expert-derived radiomics and DNN-predicted biomarkers in interpretable classifiers which we call ConRad, for computerized tomography (CT) scans of lung cancer. Importantly, the tumor biomarkers are predicted from a concept bottleneck model (CBM) such that once trained, our ConRad models do not require labor-intensive and time-consuming biomarkers. In our evaluation and practical application, the only input to ConRad is a segmented CT scan. The proposed model is compared to convolutional neural networks (CNNs) which act as a black box classifier. We further investigated and evaluated all combinations of radiomics, predicted biomarkers and CNN features in five different classifiers. We found the ConRad models using non-linear SVM and the logistic regression with the Lasso outperform others in five-fold cross-validation, although we highlight that interpretability of ConRad is its primary advantage. The Lasso is used for feature selection, which substantially reduces the number of non-zero weights while increasing the accuracy. Overall, the proposed ConRad model combines CBM-derived biomarkers and radiomics features in an interpretable ML model which perform excellently for the lung nodule malignancy classification

    Feature Perturbation Augmentation for Reliable Evaluation of Importance Estimators

    Full text link
    Post-hoc explanation methods attempt to make the inner workings of deep neural networks more interpretable. However, since a ground truth is in general lacking, local post-hoc interpretability methods, which assign importance scores to input features, are challenging to evaluate. One of the most popular evaluation frameworks is to perturb features deemed important by an interpretability method and to measure the change in prediction accuracy. Intuitively, a large decrease in prediction accuracy would indicate that the explanation has correctly quantified the importance of features with respect to the prediction outcome (e.g., logits). However, the change in the prediction outcome may stem from perturbation artifacts, since perturbed samples in the test dataset are out of distribution (OOD) compared to the training dataset and can therefore potentially disturb the model in an unexpected manner. To overcome this challenge, we propose feature perturbation augmentation (FPA) which creates and adds perturbed images during the model training. Through extensive computational experiments, we demonstrate that FPA makes deep neural networks (DNNs) more robust against perturbations. Furthermore, training DNNs with FPA demonstrate that the sign of importance scores may explain the model more meaningfully than has previously been assumed. Overall, FPA is an intuitive data augmentation technique that improves the evaluation of post-hoc interpretability methods

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Deep Learning Mental Health Dialogue System

    Full text link
    Mental health counseling remains a major challenge in modern society due to cost, stigma, fear, and unavailability. We posit that generative artificial intelligence (AI) models designed for mental health counseling could help improve outcomes by lowering barriers to access. To this end, we have developed a deep learning (DL) dialogue system called Serena. The system consists of a core generative model and post-processing algorithms. The core generative model is a 2.7 billion parameter Seq2Seq Transformer fine-tuned on thousands of transcripts of person-centered-therapy (PCT) sessions. The series of post-processing algorithms detects contradictions, improves coherency, and removes repetitive answers. Serena is implemented and deployed on \url{https://serena.chat}, which currently offers limited free services. While the dialogue system is capable of responding in a qualitatively empathetic and engaging manner, occasionally it displays hallucination and long-term incoherence. Overall, we demonstrate that a deep learning mental health dialogue system has the potential to provide a low-cost and effective complement to traditional human counselors with less barriers to access

    Evaluation of importance estimators in deep learning classifiers for Computed Tomography

    Full text link
    Deep learning has shown superb performance in detecting objects and classifying images, ensuring a great promise for analyzing medical imaging. Translating the success of deep learning to medical imaging, in which doctors need to understand the underlying process, requires the capability to interpret and explain the prediction of neural networks. Interpretability of deep neural networks often relies on estimating the importance of input features (e.g., pixels) with respect to the outcome (e.g., class probability). However, a number of importance estimators (also known as saliency maps) have been developed and it is unclear which ones are more relevant for medical imaging applications. In the present work, we investigated the performance of several importance estimators in explaining the classification of computed tomography (CT) images by a convolutional deep network, using three distinct evaluation metrics. First, the model-centric fidelity measures a decrease in the model accuracy when certain inputs are perturbed. Second, concordance between importance scores and the expert-defined segmentation masks is measured on a pixel level by a receiver operating characteristic (ROC) curves. Third, we measure a region-wise overlap between a XRAI-based map and the segmentation mask by Dice Similarity Coefficients (DSC). Overall, two versions of SmoothGrad topped the fidelity and ROC rankings, whereas both Integrated Gradients and SmoothGrad excelled in DSC evaluation. Interestingly, there was a critical discrepancy between model-centric (fidelity) and human-centric (ROC and DSC) evaluation. Expert expectation and intuition embedded in segmentation maps does not necessarily align with how the model arrived at its prediction. Understanding this difference in interpretability would help harnessing the power of deep learning in medicine.Comment: 4th International Workshop on EXplainable and TRAnsparent AI and Multi-Agent Systems (EXTRAAMAS 2022) - International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS

    Foundations for Open Scholarship Strategy Development, Version 2.1 [Pre-print]

    Get PDF
    This document aims to agree on a broad, international strategy for the implementation of open scholarship that meets the needs of different national and regional communities but works globally. Scholarly research can be idealised as an inspirational process for advancing our collective knowledge to the benefit of all humankind. However, current research practices often struggle with a range of tensions, in part due to the fact that this collective (or “commons”) ideal conflicts with the competitive system in which most scholars work, and in part because much of the infrastructure of the scholarly world is becoming largely digital. What is broadly termed as Open Scholarship is an attempt to realign modern research practices with this ideal. We do not propose a definition of Open Scholarship, but recognise that it is a holistic term that encompasses many disciplines, practices, and principles, sometimes also referred to as Open Science or Open Research. We choose the term Open Scholarship to be more inclusive of these other terms. When we refer to science in this document, we do so historically and use it as shorthand for more general scholarship. The purpose of this document is to provide a concise analysis of where the global Open Scholarship movement currently stands: what the common threads and strengths are, where the greatest opportunities and challenges lie, and how we can more effectively work together as a global community to recognise and address the top strategic priorities. This document was inspired by the Foundations for OER Strategy Development and work in the FORCE11 Scholarly Commons Working Group, and developed by an open contribution working group. Our hope is that this document will serve as a foundational resource for continuing discussions and initiatives about implementing effective strategies to help streamline the integration of Open Scholarship practices into a modern, digital research culture. Through this, we hope to extend the reach and impact of Open Scholarship into a global context, making sure that it is truly open for all. We also hope that this document will evolve as the conversations around Open Scholarship progress, and help to provide useful insight for both global co-ordination and local action. We believe this is a step forward in making Open Scholarship the norm. Ultimately, we expect the impact of widespread adoption of Open Scholarship to be diverse. We expect novel research practices to accelerate the pace of innovation, and therefore stimulate critical industries around the world. We could also expect to see an increase in public trust of science and scholarship, as transparency becomes more normative. As such, we expect interest in Open Scholarship to increase at multiple levels, due to its inherent influence on society and global economics

    Foundations for Open Scholarship Strategy Development

    Get PDF
    This document aims to agree on a broad, international strategy for the implementation of open scholarship that meets the needs of different national and regional communities but works globally.Scholarly research can be idealised as an inspirational process for advancing our collective knowledge to the benefit of all humankind. However, current research practices often struggle with a range of tensions, in part due to the fact that this collective (or “commons”) ideal conflicts with the competitive system in which most scholars work, and in part because much of the infrastructure of the scholarly world is becoming largely digital. What is broadly termed as Open Scholarship is an attempt to realign modern research practices with this ideal. We do not propose a definition of Open Scholarship, but recognise that it is a holistic term that encompasses many disciplines, practices, and principles, sometimes also referred to as Open Science or Open Research. We choose the term Open Scholarship to be more inclusive of these other terms. When we refer to science in this document, we do so historically and use it as shorthand for more general scholarship.The purpose of this document is to provide a concise analysis of where the global Open Scholarship movement currently stands: what the common threads and strengths are, where the greatest opportunities and challenges lie, and how we can more effectively work together as a global community to recognise and address the top strategic priorities. This document was inspired by the Foundations for OER Strategy Development and work in the FORCE11 Scholarly Commons Working Group, and developed by an open contribution working group.Our hope is that this document will serve as a foundational resource for continuing discussions and initiatives about implementing effective strategies to help streamline the integration of Open Scholarship practices into a modern, digital research culture. Through this, we hope to extend the reach and impact of Open Scholarship into a global context, making sure that it is truly open for all. We also hope that this document will evolve as the conversations around Open Scholarship progress, and help to provide useful insight for both global co-ordination and local action. We believe this is a step forward in making Open Scholarship the norm.Ultimately, we expect the impact of widespread adoption of Open Scholarship to be diverse. We expect novel research practices to accelerate the pace of innovation, and therefore stimulate critical industries around the world. We could also expect to see an increase in public trust of science and scholarship, as transparency becomes more normative. As such, we expect interest in Open Scholarship to increase at multiple levels, due to its inherent influence on society and global economics
    corecore