328 research outputs found
Markov field models of molecular kinetics
Computer simulations such as molecular dynamics (MD) provide a possible means to understand protein dynamics and mechanisms on an atomistic scale. The resulting simulation data can be analyzed with Markov state models (MSMs), yielding a quantitative kinetic model that, e.g., encodes state populations and transition rates. However, the larger an investigated system, the more data is required to estimate a valid kinetic model. In this work, we show that this scaling problem can be escaped when decomposing a system into smaller ones, leveraging weak couplings between local domains. Our approach, termed independent Markov decomposition (IMD), is a first-order approximation neglecting couplings, i.e., it represents a decomposition of the underlying global dynamics into a set of independent local ones. We demonstrate that for truly independent systems, IMD can reduce the sampling by three orders of magnitude. IMD is applied to two biomolecular systems. First, synaptotagmin-1 is analyzed, a rapid calcium switch from the neurotransmitter release machinery. Within its C2A domain, local conformational switches are identified and modeled with independent MSMs, shedding light on the mechanism of its calcium-mediated activation. Second, the catalytic site of the serine protease TMPRSS2 is analyzed with a local drug-binding model. Equilibrium populations of different drug-binding modes are derived for three inhibitors, mirroring experimentally determined drug efficiencies. IMD is subsequently extended to an end-to-end deep learning framework called iVAMPnets, which learns a domain decomposition from simulation data and simultaneously models the kinetics in the local domains. We finally classify IMD and iVAMPnets as Markov field models (MFM), which we define as a class of models that describe dynamics by decomposing systems into local domains. Overall, this thesis introduces a local approach to Markov modeling that enables to quantitatively assess the kinetics of large macromolecular complexes, opening up possibilities to tackle current and future computational molecular biology questions
Contrastive Learning Can Find An Optimal Basis For Approximately View-Invariant Functions
Contrastive learning is a powerful framework for learning self-supervised
representations that generalize well to downstream supervised tasks. We show
that multiple existing contrastive learning methods can be reinterpreted as
learning kernel functions that approximate a fixed positive-pair kernel. We
then prove that a simple representation obtained by combining this kernel with
PCA provably minimizes the worst-case approximation error of linear predictors,
under a straightforward assumption that positive pairs have similar labels. Our
analysis is based on a decomposition of the target function in terms of the
eigenfunctions of a positive-pair Markov chain, and a surprising equivalence
between these eigenfunctions and the output of Kernel PCA. We give
generalization bounds for downstream linear prediction using our Kernel PCA
representation, and show empirically on a set of synthetic tasks that applying
Kernel PCA to contrastive learning models can indeed approximately recover the
Markov chain eigenfunctions, although the accuracy depends on the kernel
parameterization as well as on the augmentation strength.Comment: Published at ICLR 202
Transcriptional biomarkers of toxicity - powerful tools or random noise? : An applied perspective from studies on bivalves
Aquatic organisms are constantly at risk of being exposed to potentially harmful chemical compounds of natural or anthropogenic origin. Biological life can for instance respond to chemical stressors by changes in gene expression, and thus, certain gene transcripts can potentially function as biomarkers, i.e. early warnings, of toxicity and chemical stress. A major challenge for biomarker application is the extrapolation of transcriptional data to potential effects at the organism level or above. Importantly, successful biomarker use also requires basal understanding of how to distinguish actual responses from background noise. The aim of this thesis is, based on response magnitude and variation, to evaluate the biomarker potential in a set of putative transcriptional biomarkers of general toxicity and chemical stress.Specifically, I addressed a selection of six transcripts involved in cytoprotection and oxidative stress: catalase (cat), glutathione-S-transferase (gst), heat shock proteins 70 and 90 (hsp70, hsp90), metallothionein (mt) and superoxide dismutase (sod). Moreover, I used metal exposures to serve as a proxy for general chemical stress, and due to their ecological relevance and nature as sedentary filter-feeders, I used bivalves as study organisms.In a series of experiments, I tested transcriptional responses in the freshwater duck mussel, Anodonta anatina, exposed to copper or an industrial waste-water effluent, to address response robustness and sensitivity, and potential controlled (e.g. exposure concentration) and random (e.g. gravidness) sources of variation. In addition, I performed a systematic review and meta-analysis on transcriptional responses in metal exposed bivalves to (1) evaluate what responses to expect from arbitrary metal exposures, (2) assess the influence from metal concentration (expressed as toxic unit), exposure time and analyzed tissue, and (3) address potential impacts from publication bias in the scientific literature.Response magnitudes were generally small in relationship to the observed variation, both for A. anatina and bivalves in general. The expected response to an arbitrary metal exposure would generally be close to zero, based on both experimental observations and on the estimated impact from publication bias. Although many of the transcripts demonstrated concentration-response relationships, large background noise might in practice obscure the small responses even at relatively high exposures. As demonstrated in A. anatina under copper exposure, this can be the case already for single species under high resolution exposures to single pollutants. As demonstrated by the meta-regression, this problem can only be expected to increase further upon extrapolation between different species and exposure scenarios, due to increasing heterogeneity and random variation. Similar patterns can also be expected for time-dependent response variation, although the meta-regression revealed a general trend of slightly increasing response magnitude with increasing exposure times.In A. anatina, gravidness was identified as a source of random variability that can potentially affect the baseline of most assessed biomarkers, particularly when quantified in gills. Response magnitudes and variability in this species were generally similar for selected transcripts as for two biochemical biomarkers included for comparison (AChE, GST), suggesting that the transcripts might not capture early warnings more efficiently than other molecular endpoints that are more toxicologically relevant. Overall, high concentrations and long exposure durations presumably increase the likelihood of a detectable transcriptional response, but not to an extent that justifies universal application as biomarkers of general toxicity and chemical stress. Consequently, without a strictly defined and validated application, this approach on its own appears unlikely to be successful for future environmental risk assessment and monitoring. Ultimately, efficient use of transcriptional biomarkers might require additional implementation of complementary approaches offered by current molecular techniques
Contributions in functional data analysis and functional-analytic statistics
Functional data analysis is the study of statistical algorithms which are applied in the scenario when the observed data is a collection of functions. Since this type of data is becoming cheaper and easier to collect, there is an increased need to develop statistical tools to handle such data. The first part of this thesis focuses on deriving distances between distributions over function spaces and applying these to two-sample testing, goodness-of-fit testing and sample quality assessment. This presents a wide range of contributions since currently there exists either very few or no methods at all to tackle these problems for functional data. The second part of this thesis adopts the functional-analytic perspective to two statistical algorithms. This is a perspective where functions are viewed as living in specific function spaces and the tool box of functional analysis is applied to identify and prove properties of the algorithms. The two algorithms are variational Gaussian processes, used widely throughout machine learning for function modelling with large observation data sets, and functional statistical depth, used widely as a means to evaluate outliers and perform testing for functional data sets. The results presented contribute a taxonomy of the variational Gaussian process methodology and multiple new results in the theory of functional depth including the open problem of providing a depth which characterises distributions on function spaces.Open Acces
Spherical and Hyperbolic Toric Topology-Based Codes On Graph Embedding for Ising MRF Models: Classical and Quantum Topology Machine Learning
The paper introduces the application of information geometry to describe the
ground states of Ising models by utilizing parity-check matrices of cyclic and
quasi-cyclic codes on toric and spherical topologies. The approach establishes
a connection between machine learning and error-correcting coding. This
proposed approach has implications for the development of new embedding methods
based on trapping sets. Statistical physics and number geometry applied for
optimize error-correcting codes, leading to these embedding and sparse
factorization methods. The paper establishes a direct connection between DNN
architecture and error-correcting coding by demonstrating how state-of-the-art
architectures (ChordMixer, Mega, Mega-chunk, CDIL, ...) from the long-range
arena can be equivalent to of block and convolutional LDPC codes (Cage-graph,
Repeat Accumulate). QC codes correspond to certain types of chemical elements,
with the carbon element being represented by the mixed automorphism
Shu-Lin-Fossorier QC-LDPC code. The connections between Belief Propagation and
the Permanent, Bethe-Permanent, Nishimori Temperature, and Bethe-Hessian Matrix
are elaborated upon in detail. The Quantum Approximate Optimization Algorithm
(QAOA) used in the Sherrington-Kirkpatrick Ising model can be seen as analogous
to the back-propagation loss function landscape in training DNNs. This
similarity creates a comparable problem with TS pseudo-codeword, resembling the
belief propagation method. Additionally, the layer depth in QAOA correlates to
the number of decoding belief propagation iterations in the Wiberg decoding
tree. Overall, this work has the potential to advance multiple fields, from
Information Theory, DNN architecture design (sparse and structured prior graph
topology), efficient hardware design for Quantum and Classical DPU/TPU (graph,
quantize and shift register architect.) to Materials Science and beyond.Comment: 71 pages, 42 Figures, 1 Table, 1 Appendix. arXiv admin note: text
overlap with arXiv:2109.08184 by other author
Histograms: An educational eye
Many high-school students are not able to draw justified conclusions from statistical data in histograms. A literature review showed that most misinterpretations of histograms are related to difficulties with two statistical key concepts: data and distribution. The review also pointed to a lack of knowledge about students’ strategies when solving histogram tasks. As the literature provided little guidance for the design of lesson materials, several studies were conducted in preparation. In a first study, five solution strategies were found through qualitative analysis of students’ gazes when solving histograms and case-value plot tasks. Quantitative analysis of several histogram tasks through a mathematical model and a machine learning algorithm confirmed these results, which implied that these strategies could reliably and automatically be identified. Literature also suggested that dotplot tasks can support students’ learning to interpret histograms. Therefore, gazes on histogram tasks were compared before and after students solved dotplot tasks. The "after" tasks contained more gazes associated with correct strategies and fewer gazes associated with incorrect strategies. Although answers did not improve significantly, students’ verbal descriptions suggest that some students changed to a correct strategy. Newly designed materials thus started with dotplot tasks. From the previous studies, we conjectured that students lacked embodied experiences with actions related to histograms. Designed from an embodied instrumentation perspective, the tested materials provide starting points for scaling up. Together, the studies address the knowledge gaps identified in the literature. The studies contribute to knowledge about learning histograms and use in statistics education of eye-tracking research, interpretable models and machine learning algorithms, and embodied instrumentation design
Going Deeper with Spectral Embeddings
To make sense of millions of raw data and represent them efficiently,
practitioners rely on representation learning. Recently, deep connections have
been shown between these approaches and the spectral decompositions of some
underlying operators. Historically, explicit spectral embeddings were built
from graphs constructed on top of the data. In contrast, we propose two new
methods to build spectral embeddings: one based on functional analysis
principles and kernel methods, which leads to algorithms with theoretical
guarantees, and the other based on deep networks trained to optimize principled
variational losses, which yield practically efficient algorithms. Furthermore,
we provide a new sampling algorithm that leverages learned representations to
generate new samples in a single step
FAST AND MEMORY EFFICIENT ALGORITHMS FOR STRUCTURED MATRIX SPECTRUM APPROXIMATION
Approximating the singular values or eigenvalues of a matrix, i.e. spectrum approximation, is a fundamental task in data science and machine learning applications. While approximation of the top singular values has received considerable attention in numerical linear algebra, provably efficient algorithms for other spectrum approximation tasks such as spectral-sum estimation and spectrum density estimation are starting to emerge only recently. Two crucial components that have enabled efficient algorithms for spectrum approximation are access to randomness and structure in the underlying matrix. In this thesis, we study how randomization and the underlying structure of the matrix can be exploited to design fast and memory efficient algorithms for spectral sum-estimation and spectrum density estimation. In particular, we look at two classes of structure: sparsity and graph structure.
In the first part of this thesis, we show that sparsity can be exploited to give low-memory algorithms for spectral summarization tasks such as approximating some Schatten norms, the Estrada index and the logarithm of the determinant (log-det) of a sparse matrix. Surprisingly, we show that the space complexity of our algorithms are independent of the underlying dimension of the matrix. Complimenting our result for sparse matrices, we show that matrices that satisfy a certain smooth definition of sparsity, but potentially dense in the conventional sense, can be approximated in spectral-norm error by a truly sparse matrix. Our method is based on a simple sampling scheme that can be implemented in linear time. In the second part, we give the first truly sublinear time algorithm to approximate the spectral density of the (normalized) adjacency matrix of an undirected, unweighted graph in earth-mover distance. In addition to our sublinear time result, we give theoretical guarantees for a variant of the widely-used Kernel Polynomial Method and propose a new moment matching based method for spectrum density estimation of Hermitian matrices
CoLA: Exploiting Compositional Structure for Automatic and Efficient Numerical Linear Algebra
Many areas of machine learning and science involve large linear algebra
problems, such as eigendecompositions, solving linear systems, computing matrix
exponentials, and trace estimation. The matrices involved often have Kronecker,
convolutional, block diagonal, sum, or product structure. In this paper, we
propose a simple but general framework for large-scale linear algebra problems
in machine learning, named CoLA (Compositional Linear Algebra). By combining a
linear operator abstraction with compositional dispatch rules, CoLA
automatically constructs memory and runtime efficient numerical algorithms.
Moreover, CoLA provides memory efficient automatic differentiation, low
precision computation, and GPU acceleration in both JAX and PyTorch, while also
accommodating new objects, operations, and rules in downstream packages via
multiple dispatch. CoLA can accelerate many algebraic operations, while making
it easy to prototype matrix structures and algorithms, providing an appealing
drop-in tool for virtually any computational effort that requires linear
algebra. We showcase its efficacy across a broad range of applications,
including partial differential equations, Gaussian processes, equivariant model
construction, and unsupervised learning.Comment: Code available at https://github.com/wilson-labs/col
Robotics Approach in Mobile Laser Scanning Generation of Georeferenced Point-based Forest Models
A mobile laser scanning (MLS) system equipped with a lidar, inertial navigation system and satellite positioning can be used to reconstruct georeferenced point-based models of the surveyed environments. Ideal reconstruction requires accurate trajectories that are challenging to obtain in forests. Satellite signals are heavily degraded under the forest canopy, while lidar-based positioning is often inefficient due to the forest's unstructured and complex nature. Most forestry-related solutions compute or improve the trajectory in post-processing, focusing on accuracy rather than the possibility of real-time operation. On the other hand, real-time solutions exist, but they are primarily tested and evaluated in urban environments, and the forest's effect on them is less known.
In this study, high-quality, real-time point-based forest model generation was considered by applying techniques from the field of robotics. Forest data were collected with an MLS system mounted 1) on a stick carried by a person and 2) mounted on a forest harvester while performing thinning operations. The system's trajectory was computed using lidar-inertial-based smoothing and mapping algorithms with real-time limitations. In addition, satellite measurements were either fused into the smoothing algorithm contributing to the trajectory estimation or were used to georeference the trajectory in a post-processing manner.
Collecting reliable reference trajectories is difficult in forests. Therefore, this study mainly contains qualitative and relative evaluation. The results indicate that real-time and onboard processing is feasible for forest data with adequate accuracy. State-of-the-art edge and planar feature-based lidar odometry was the most accurate but could not fully maintain real-time operation. On the other hand, the normal distributions transform-based odometry can maintain fast and constant computation with slightly lower accuracy. Fusing the satellite positioning for the mapping reduced the internal integrity of the reconstructed point cloud models, and it is suggested to use it for post-processed georeferencing instead
- …