Search CORE

11 research outputs found

Machine Learning for Uncovering Biological Insights in Spatial Transcriptomics Data

Author: Abbasi-Asl Reza
Cahill Robert
Lee Alex J.
Publication venue
Publication date: 29/03/2023
Field of study

Development and homeostasis in multicellular systems both require exquisite control over spatial molecular pattern formation and maintenance. Advances in spatially-resolved and high-throughput molecular imaging methods such as multiplexed immunofluorescence and spatial transcriptomics (ST) provide exciting new opportunities to augment our fundamental understanding of these processes in health and disease. The large and complex datasets resulting from these techniques, particularly ST, have led to rapid development of innovative machine learning (ML) tools primarily based on deep learning techniques. These ML tools are now increasingly featured in integrated experimental and computational workflows to disentangle signals from noise in complex biological systems. However, it can be difficult to understand and balance the different implicit assumptions and methodologies of a rapidly expanding toolbox of analytical tools in ST. To address this, we summarize major ST analysis goals that ML can help address and current analysis trends. We also describe four major data science concepts and related heuristics that can help guide practitioners in their choices of the right tools for the right biological questions

arXiv.org e-Print Archive

Numerical Characterization of Support Recovery in Sparse Regression with Correlated Design

Author: Bhattacharyya Sharmodeep
Bouchard Kristofer
Kumar Ankit
Publication venue
Publication date: 23/03/2021
Field of study

Sparse regression is frequently employed in diverse scientific settings as a feature selection method. A pervasive aspect of scientific data that hampers both feature selection and estimation is the presence of strong correlations between predictive features. These fundamental issues are often not appreciated by practitioners, and jeapordize conclusions drawn from estimated models. On the other hand, theoretical results on sparsity-inducing regularized regression such as the Lasso have largely addressed conditions for selection consistency via asymptotics, and disregard the problem of model selection, whereby regularization parameters are chosen. In this numerical study, we address these issues through exhaustive characterization of the performance of several regression estimators, coupled with a range of model selection strategies. These estimators and selection criteria were examined across correlated regression problems with varying degrees of signal to noise, distribution of the non-zero model coefficients, and model sparsity. Our results reveal a fundamental tradeoff between false positive and false negative control in all regression estimators and model selection criteria examined. Additionally, we are able to numerically explore a transition point modulated by the signal-to-noise ratio and spectral properties of the design covariance matrix at which the selection accuracy of all considered algorithms degrades. Overall, we find that SCAD coupled with BIC or empirical Bayes model selection performs the best feature selection across the regression problems considered

arXiv.org e-Print Archive

eScholarship - University of California

Algorithmic advances in learning from large dimensional matrices and scientific data

Author: Ubaru Shashanka
Publication venue
Publication date: 01/05/2018
Field of study

University of Minnesota Ph.D. dissertation.May 2018. Major: Computer Science. Advisor: Yousef Saad. 1 computer file (PDF); xi, 196 pages.This thesis is devoted to answering a range of questions in machine learning and data analysis related to large dimensional matrices and scientific data. Two key research objectives connect the different parts of the thesis: (a) development of fast, efficient, and scalable algorithms for machine learning which handle large matrices and high dimensional data; and (b) design of learning algorithms for scientific data applications. The work combines ideas from multiple, often non-traditional, fields leading to new algorithms, new theory, and new insights in different applications. The first of the three parts of this thesis explores numerical linear algebra tools to develop efficient algorithms for machine learning with reduced computation cost and improved scalability. Here, we first develop inexpensive algorithms combining various ideas from linear algebra and approximation theory for matrix spectrum related problems such as numerical rank estimation, matrix function trace estimation including log-determinants, Schatten norms, and other spectral sums. We also propose a new method which simultaneously estimates the dimension of the dominant subspace of covariance matrices and obtains an approximation to the subspace. Next, we consider matrix approximation problems such as low rank approximation, column subset selection, and graph sparsification. We present a new approach based on multilevel coarsening to compute these approximations for large sparse matrices and graphs. Lastly, on the linear algebra front, we devise a novel algorithm based on rank shrinkage for the dictionary learning problem, learning a small set of dictionary columns which best represent the given data. The second part of this thesis focuses on exploring novel non-traditional applications of information theory and codes, particularly in solving problems related to machine learning and high dimensional data analysis. Here, we first propose new matrix sketching methods using codes for obtaining low rank approximations of matrices and solving least squares regression problems. Next, we demonstrate that codewords from certain coding scheme perform exceptionally well for the group testing problem. Lastly, we present a novel machine learning application for coding theory, that of solving large scale multilabel classification problems. We propose a new algorithm for multilabel classification which is based on group testing and codes. The algorithm has a simple inexpensive prediction method, and the error correction capabilities of codes are exploited for the first time to correct prediction errors. The third part of the thesis focuses on devising robust and stable learning algorithms, which yield results that are interpretable from specific scientific application viewpoint. We present Union of Intersections (UoI), a flexible, modular, and scalable framework for statistical-machine learning problems. We then adapt this framework to develop new algorithms for matrix decomposition problems such as nonnegative matrix factorization (NMF) and CUR decomposition. We apply these new methods to data from Neuroscience applications in order to obtain insights into the functionality of the brain. Finally, we consider the application of material informatics, learning from materials data. Here, we deploy regression techniques on materials data to predict physical properties of materials

University of Minnesota Digital Conservancy

NIPS - Not Even Wrong? A Systematic Review of Empirically Complete Demonstrations of Algorithmic Effectiveness in the Machine Learning and Artificial Intelligence Literature

Author: Király Franz J
Mateen Bilal
Sonabend Raphael
Publication venue
Publication date: 18/12/2018
Field of study

Objective: To determine the completeness of argumentative steps necessary to conclude effectiveness of an algorithm in a sample of current ML/AI supervised learning literature. Data Sources: Papers published in the Neural Information Processing Systems (NeurIPS, n\'ee NIPS) journal where the official record showed a 2017 year of publication. Eligibility Criteria: Studies reporting a (semi-)supervised model, or pre-processing fused with (semi-)supervised models for tabular data. Study Appraisal: Three reviewers applied the assessment criteria to determine argumentative completeness. The criteria were split into three groups, including: experiments (e.g real and/or synthetic data), baselines (e.g uninformed and/or state-of-art) and quantitative comparison (e.g. performance quantifiers with confidence intervals and formal comparison of the algorithm against baselines). Results: Of the 121 eligible manuscripts (from the sample of 679 abstracts), 99\% used real-world data and 29\% used synthetic data. 91\% of manuscripts did not report an uninformed baseline and 55\% reported a state-of-art baseline. 32\% reported confidence intervals for performance but none provided references or exposition for how these were calculated. 3\% reported formal comparisons. Limitations: The use of one journal as the primary information source may not be representative of all ML/AI literature. However, the NeurIPS conference is recognised to be amongst the top tier concerning ML/AI studies, so it is reasonable to consider its corpus to be representative of high-quality research. Conclusion: Using the 2017 sample of the NeurIPS supervised learning corpus as an indicator for the quality and trustworthiness of current ML/AI research, it appears that complete argumentative chains in demonstrations of algorithmic effectiveness are rare

arXiv.org e-Print Archive

UCL Discovery

Recommended from our members

Estimation and Sparse Selection of Conditional Probability Models for Vector Time Series

Author: Ruiz Trevor D.
Publication venue: 'Oregon State University'
Publication date
Field of study

Diverse scientific fields collect multiple time series data to investigate the dynamical behavior of complex systems: atmospheric and climate science, geophysics, neuroscience, epidemiology, ecology, and environmental science. Identifying patterns of mutual dependence among such data generates valuable knowledge that can be applied either for inferential or forecasting purposes. Vector autoregressive (VAR) processes provide a flexible class of statistical models for multiple time series that are easy to estimate using regression techniques. However, scaling to large data sets and extension to more general processes stretch the framework's capacity: due to the dense parametrization of VAR models, which have one parameter for every possible pairwise relationship between components (i.e., between each univariate time series in a collection), high-dimensional data generate difficulties associated with model selection and parametric regularization; and modeling more general processes requires data transformations that complicate inference, forecasting, and model interpretation. Fields such as epidemiology, neuroscience, and ecology generate high-dimensional time series of count vectors, which incur both sets of challenges at once. Autoregressive conditional probability models --- models in which the conditional means of a time series follow an autoregressive structure in the process history --- are natural generalizations that preserve ease of estimation and, in conjunction with selection methods in regression, promise to address challenges associated with modeling large multiple time series of count (and other discrete) data. This thesis focuses on developing empirical methodology for sparse selection of nonlinear VAR-type conditional Poisson models. Chapter 1 provides an overview of related existing work. Chapter 2 develops an empirical method for sparse selection in VAR models based on resampling methods. Chapter 3 presents a conditional probability generalization of the VAR model and analyzes the stability properties of Poisson generalized vector autoregressive (GVAR) processes. Chapter 4 combines the work of the preceding two chapters and develops a resampling-based method for sparse estimation of Poisson GVAR models. Finally, Chapter 5 summarizes key findings, challenges, and future work

ScholarsArchive@OSU

LIPIcs, Volume 277, GIScience 2023, Complete Volume

Author: Beecham Roger
Long Jed A.
Smith Dianna
Wise Sarah
Zhao Qunshan
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 12th International Conference on Geographic Information Science (GIScience 2023)
Publication date: 01/01/2023
Field of study

LIPIcs, Volume 277, GIScience 2023, Complete Volum

Dagstuhl Research Online Publication Server

Union of Intersections (UoI) for interpretable data driven discovery and prediction

Author: Bhattacharyya S
Bouchard KE
Bujan AF
Chang EF
Mahoney MW
Mao JH
Prabhat
Roosta-Khorasani F
Snijders AM
Ubaru S
Publication venue: eScholarship, University of California
Publication date: 01/01/2017
Field of study

The increasing size and complexity of scientific data could dramatically enhance discovery and prediction for basic scientific applications. Realizing this potential, however, requires novel statistical analysis methods that are both interpretable and predictive. We introduce Union of Intersections (UoI), a flexible, modular, and scalable framework for enhanced model selection and estimation. Methods based on UoI perform model selection and model estimation through intersection and union operations, respectively. We show that UoI-based methods achieve low-variance and nearly unbiased estimation of a small number of interpretable features, while maintaining high-quality prediction accuracy. We perform extensive numerical investigation to evaluate a UoI algorithm (UoILasso) on synthetic and real data. In doing so, we demonstrate the extraction of interpretable functional networks from human electrophysiology recordings as well as accurate prediction of phenotypes from genotype-phenotype data with reduced features. We also show (with the UoIL1Logistic and UoICUR variants of the basic framework) improved prediction parsimony for classification and matrix factorization on several benchmark biomedical data sets. These results suggest that methods based on the UoI framework could improve interpretation and prediction in data-driven discovery across scientific fields

arXiv.org e-Print Archive

eScholarship - University of California

12th International Conference on Geographic Information Science: GIScience 2023, September 12–15, 2023, Leeds, UK

Author
Publication venue: Schloss Dagstuhl – Leibniz-Zentrum für Informatik GmbH, Dagstuhl Publishing
Publication date: 07/09/2023
Field of study

No abstract available

Enlighten

Union of Intersections (UoI) for interpretable data driven discovery and prediction

Author: Bouchard KE,
Publication venue
Publication date: 08/05/2018
Field of study

Ezid

Semantic discovery and reuse of business process patterns

Author: Aldin L
de Cesare S
Lycett M
Publication venue: Athens University of Economics and Business
Publication date: 01/01/2009
Field of study

Patterns currently play an important role in modern information systems (IS) development and their use has mainly been restricted to the design and implementation phases of the development lifecycle. Given the increasing significance of business modelling in IS development, patterns have the potential of providing a viable solution for promoting reusability of recurrent generalized models in the very early stages of development. As a statement of research-in-progress this paper focuses on business process patterns and proposes an initial methodological framework for the discovery and reuse of business process patterns within the IS development lifecycle. The framework borrows ideas from the domain engineering literature and proposes the use of semantics to drive both the discovery of patterns as well as their reuse

OpenGrey Repository

Brunel University Research Archive

AIS Electronic Library (AISeL)