11 research outputs found
Machine Learning for Uncovering Biological Insights in Spatial Transcriptomics Data
Development and homeostasis in multicellular systems both require exquisite
control over spatial molecular pattern formation and maintenance. Advances in
spatially-resolved and high-throughput molecular imaging methods such as
multiplexed immunofluorescence and spatial transcriptomics (ST) provide
exciting new opportunities to augment our fundamental understanding of these
processes in health and disease. The large and complex datasets resulting from
these techniques, particularly ST, have led to rapid development of innovative
machine learning (ML) tools primarily based on deep learning techniques. These
ML tools are now increasingly featured in integrated experimental and
computational workflows to disentangle signals from noise in complex biological
systems. However, it can be difficult to understand and balance the different
implicit assumptions and methodologies of a rapidly expanding toolbox of
analytical tools in ST. To address this, we summarize major ST analysis goals
that ML can help address and current analysis trends. We also describe four
major data science concepts and related heuristics that can help guide
practitioners in their choices of the right tools for the right biological
questions
Numerical Characterization of Support Recovery in Sparse Regression with Correlated Design
Sparse regression is frequently employed in diverse scientific settings as a
feature selection method. A pervasive aspect of scientific data that hampers
both feature selection and estimation is the presence of strong correlations
between predictive features. These fundamental issues are often not appreciated
by practitioners, and jeapordize conclusions drawn from estimated models. On
the other hand, theoretical results on sparsity-inducing regularized regression
such as the Lasso have largely addressed conditions for selection consistency
via asymptotics, and disregard the problem of model selection, whereby
regularization parameters are chosen. In this numerical study, we address these
issues through exhaustive characterization of the performance of several
regression estimators, coupled with a range of model selection strategies.
These estimators and selection criteria were examined across correlated
regression problems with varying degrees of signal to noise, distribution of
the non-zero model coefficients, and model sparsity. Our results reveal a
fundamental tradeoff between false positive and false negative control in all
regression estimators and model selection criteria examined. Additionally, we
are able to numerically explore a transition point modulated by the
signal-to-noise ratio and spectral properties of the design covariance matrix
at which the selection accuracy of all considered algorithms degrades. Overall,
we find that SCAD coupled with BIC or empirical Bayes model selection performs
the best feature selection across the regression problems considered
Algorithmic advances in learning from large dimensional matrices and scientific data
University of Minnesota Ph.D. dissertation.May 2018. Major: Computer Science. Advisor: Yousef Saad. 1 computer file (PDF); xi, 196 pages.This thesis is devoted to answering a range of questions in machine learning and data analysis related to large dimensional matrices and scientific data. Two key research objectives connect the different parts of the thesis: (a) development of fast, efficient, and scalable algorithms for machine learning which handle large matrices and high dimensional data; and (b) design of learning algorithms for scientific data applications. The work combines ideas from multiple, often non-traditional, fields leading to new algorithms, new theory, and new insights in different applications. The first of the three parts of this thesis explores numerical linear algebra tools to develop efficient algorithms for machine learning with reduced computation cost and improved scalability. Here, we first develop inexpensive algorithms combining various ideas from linear algebra and approximation theory for matrix spectrum related problems such as numerical rank estimation, matrix function trace estimation including log-determinants, Schatten norms, and other spectral sums. We also propose a new method which simultaneously estimates the dimension of the dominant subspace of covariance matrices and obtains an approximation to the subspace. Next, we consider matrix approximation problems such as low rank approximation, column subset selection, and graph sparsification. We present a new approach based on multilevel coarsening to compute these approximations for large sparse matrices and graphs. Lastly, on the linear algebra front, we devise a novel algorithm based on rank shrinkage for the dictionary learning problem, learning a small set of dictionary columns which best represent the given data. The second part of this thesis focuses on exploring novel non-traditional applications of information theory and codes, particularly in solving problems related to machine learning and high dimensional data analysis. Here, we first propose new matrix sketching methods using codes for obtaining low rank approximations of matrices and solving least squares regression problems. Next, we demonstrate that codewords from certain coding scheme perform exceptionally well for the group testing problem. Lastly, we present a novel machine learning application for coding theory, that of solving large scale multilabel classification problems. We propose a new algorithm for multilabel classification which is based on group testing and codes. The algorithm has a simple inexpensive prediction method, and the error correction capabilities of codes are exploited for the first time to correct prediction errors. The third part of the thesis focuses on devising robust and stable learning algorithms, which yield results that are interpretable from specific scientific application viewpoint. We present Union of Intersections (UoI), a flexible, modular, and scalable framework for statistical-machine learning problems. We then adapt this framework to develop new algorithms for matrix decomposition problems such as nonnegative matrix factorization (NMF) and CUR decomposition. We apply these new methods to data from Neuroscience applications in order to obtain insights into the functionality of the brain. Finally, we consider the application of material informatics, learning from materials data. Here, we deploy regression techniques on materials data to predict physical properties of materials
NIPS - Not Even Wrong? A Systematic Review of Empirically Complete Demonstrations of Algorithmic Effectiveness in the Machine Learning and Artificial Intelligence Literature
Objective: To determine the completeness of argumentative steps necessary to
conclude effectiveness of an algorithm in a sample of current ML/AI supervised
learning literature.
Data Sources: Papers published in the Neural Information Processing Systems
(NeurIPS, n\'ee NIPS) journal where the official record showed a 2017 year of
publication.
Eligibility Criteria: Studies reporting a (semi-)supervised model, or
pre-processing fused with (semi-)supervised models for tabular data.
Study Appraisal: Three reviewers applied the assessment criteria to determine
argumentative completeness. The criteria were split into three groups,
including: experiments (e.g real and/or synthetic data), baselines (e.g
uninformed and/or state-of-art) and quantitative comparison (e.g. performance
quantifiers with confidence intervals and formal comparison of the algorithm
against baselines).
Results: Of the 121 eligible manuscripts (from the sample of 679 abstracts),
99\% used real-world data and 29\% used synthetic data. 91\% of manuscripts did
not report an uninformed baseline and 55\% reported a state-of-art baseline.
32\% reported confidence intervals for performance but none provided references
or exposition for how these were calculated. 3\% reported formal comparisons.
Limitations: The use of one journal as the primary information source may not
be representative of all ML/AI literature. However, the NeurIPS conference is
recognised to be amongst the top tier concerning ML/AI studies, so it is
reasonable to consider its corpus to be representative of high-quality
research.
Conclusion: Using the 2017 sample of the NeurIPS supervised learning corpus
as an indicator for the quality and trustworthiness of current ML/AI research,
it appears that complete argumentative chains in demonstrations of algorithmic
effectiveness are rare
Recommended from our members
Estimation and Sparse Selection of Conditional Probability Models for Vector Time Series
Diverse scientific fields collect multiple time series data to investigate the dynamical behavior of complex systems: atmospheric and climate science, geophysics, neuroscience, epidemiology, ecology, and environmental science. Identifying patterns of mutual dependence among such data generates valuable knowledge that can be applied either for inferential or forecasting purposes. Vector autoregressive (VAR) processes provide a flexible class of statistical models for multiple time series that are easy to estimate using regression techniques. However, scaling to large data sets and extension to more general processes stretch the framework's capacity: due to the dense parametrization of VAR models, which have one parameter for every possible pairwise relationship between components (i.e., between each univariate time series in a collection), high-dimensional data generate difficulties associated with model selection and parametric regularization; and modeling more general processes requires data transformations that complicate inference, forecasting, and model interpretation. Fields such as epidemiology, neuroscience, and ecology generate high-dimensional time series of count vectors, which incur both sets of challenges at once. Autoregressive conditional probability models --- models in which the conditional means of a time series follow an autoregressive structure in the process history --- are natural generalizations that preserve ease of estimation and, in conjunction with selection methods in regression, promise to address challenges associated with modeling large multiple time series of count (and other discrete) data. This thesis focuses on developing empirical methodology for sparse selection of nonlinear VAR-type conditional Poisson models. Chapter 1 provides an overview of related existing work. Chapter 2 develops an empirical method for sparse selection in VAR models based on resampling methods. Chapter 3 presents a conditional probability generalization of the VAR model and analyzes the stability properties of Poisson generalized vector autoregressive (GVAR) processes. Chapter 4 combines the work of the preceding two chapters and develops a resampling-based method for sparse estimation of Poisson GVAR models. Finally, Chapter 5 summarizes key findings, challenges, and future work
LIPIcs, Volume 277, GIScience 2023, Complete Volume
LIPIcs, Volume 277, GIScience 2023, Complete Volum
Union of Intersections (UoI) for interpretable data driven discovery and prediction
The increasing size and complexity of scientific data could dramatically enhance discovery and prediction for basic scientific applications. Realizing this potential, however, requires novel statistical analysis methods that are both interpretable and predictive. We introduce Union of Intersections (UoI), a flexible, modular, and scalable framework for enhanced model selection and estimation. Methods based on UoI perform model selection and model estimation through intersection and union operations, respectively. We show that UoI-based methods achieve low-variance and nearly unbiased estimation of a small number of interpretable features, while maintaining high-quality prediction accuracy. We perform extensive numerical investigation to evaluate a UoI algorithm (UoILasso) on synthetic and real data. In doing so, we demonstrate the extraction of interpretable functional networks from human electrophysiology recordings as well as accurate prediction of phenotypes from genotype-phenotype data with reduced features. We also show (with the UoIL1Logistic and UoICUR variants of the basic framework) improved prediction parsimony for classification and matrix factorization on several benchmark biomedical data sets. These results suggest that methods based on the UoI framework could improve interpretation and prediction in data-driven discovery across scientific fields
12th International Conference on Geographic Information Science: GIScience 2023, September 12–15, 2023, Leeds, UK
No abstract available
Semantic discovery and reuse of business process patterns
Patterns currently play an important role in modern information systems (IS) development and their use has mainly been restricted to the design and implementation phases of the development lifecycle. Given the increasing significance of business modelling in IS development, patterns have the potential of providing a viable solution for promoting reusability of recurrent generalized models in the very early stages of development. As a statement of research-in-progress this paper focuses on business process patterns and proposes an initial methodological framework for the discovery and reuse of business process patterns within the IS development lifecycle. The framework borrows ideas from the domain engineering literature and proposes the use of semantics to drive both the discovery of patterns as well as their reuse