43 research outputs found
EsaCL: Efficient Continual Learning of Sparse Models
A key challenge in the continual learning setting is to efficiently learn a
sequence of tasks without forgetting how to perform previously learned tasks.
Many existing approaches to this problem work by either retraining the model on
previous tasks or by expanding the model to accommodate new tasks. However,
these approaches typically suffer from increased storage and computational
requirements, a problem that is worsened in the case of sparse models due to
need for expensive re-training after sparsification. To address this challenge,
we propose a new method for efficient continual learning of sparse models
(EsaCL) that can automatically prune redundant parameters without adversely
impacting the model's predictive power, and circumvent the need of retraining.
We conduct a theoretical analysis of loss landscapes with parameter pruning,
and design a directional pruning (SDP) strategy that is informed by the
sharpness of the loss function with respect to the model parameters. SDP
ensures model with minimal loss of predictive accuracy, accelerating the
learning of sparse models at each stage. To accelerate model update, we
introduce an intelligent data selection (IDS) strategy that can identify
critical instances for estimating loss landscape, yielding substantially
improved data efficiency. The results of our experiments show that EsaCL
achieves performance that is competitive with the state-of-the-art methods on
three continual learning benchmarks, while using substantially reduced memory
and computational resources.Comment: SDM 2024 : SIAM International Conference on Data Minin
Learning DFA for Simple Examples
We present a framework for learning DFA from simple examples. We show that efficient PAC learning of DFA is possible if the class of distributions is restricted to simple distributions where a teacher might choose examples based on the knowledge of the target concept. This answers an open research question posed in Pitt\u27s seminal paper: Are DFA\u27s PAC-identifiable if examples are drawn from the uniform distribution, or some other known simple distribution? Our approach uses the RPNI algorithm for learning DFA from labeled examples. In particular, we describe an efficient learning algorithm for exact learning of the target DFA with high probability when a bound on the number of states (N) of the target DFA is known in advance. When N is not known, we show how this algorithm can be used for efficient PAC learning of DFAs
3M-Diffusion: Latent Multi-Modal Diffusion for Text-Guided Generation of Molecular Graphs
Generating molecules with desired properties is a critical task with broad
applications in drug discovery and materials design. Inspired by recent
advances in large language models, there is a growing interest in using natural
language descriptions of molecules to generate molecules with the desired
properties. Most existing methods focus on generating molecules that precisely
match the text description. However, practical applications call for methods
that generate diverse, and ideally novel, molecules with the desired
properties. We propose 3M-Diffusion, a novel multi-modal molecular graph
generation method, to address this challenge. 3M-Diffusion first encodes
molecular graphs into a graph latent space aligned with text descriptions. It
then reconstructs the molecular structure and atomic attributes based on the
given text descriptions using the molecule decoder. It then learns a
probabilistic mapping from the text space to the latent molecular graph space
using a diffusion model. The results of our extensive experiments on several
datasets demonstrate that 3M-Diffusion can generate high-quality, novel and
diverse molecular graphs that semantically match the textual description
provided
Visual Methods for Examining Support Vector Machine Results, with Applications to Gene Expression Data Analysis
Support vector machines (SVM) offer a theoretically well-founded approach to automated learning of pattern classifiers. They have been proven to give highly accurate results in complex classification problems, for example, gene expression analysis. The SVM algorithm is also quite intuitive with a few inputs to vary in the fitting process and several outputs that are interesting to study. For many data mining tasks (e.g., cancer prediction) finding classifiers with good predictive accuracy is important, but understanding the classifier is equally important. By studying the classifier outputs we may be able to produce a simpler classifier, learn which variables are the important discriminators between classes, and find the samples that are problematic to the classification. Visual methods for exploratory data analysis can help us to study the outputs and complement automated classification algorithms in data mining. We present the use of tour-based methods to plot aspects of the SVM classifier. This approach provides insights about the cluster structure in the data, the nature of boundaries between clusters, and problematic outliers. Furthermore, tours can be used to assess the variable importance. We show how visual methods can be used as a complement to cross-validation methods in order to find good SVM input parameters for a particular data set
Accelerating Science: A Computing Research Agenda
The emergence of "big data" offers unprecedented opportunities for not only
accelerating scientific advances but also enabling new modes of discovery.
Scientific progress in many disciplines is increasingly enabled by our ability
to examine natural phenomena through the computational lens, i.e., using
algorithmic or information processing abstractions of the underlying processes;
and our ability to acquire, share, integrate and analyze disparate types of
data. However, there is a huge gap between our ability to acquire, store, and
process data and our ability to make effective use of the data to advance
discovery. Despite successful automation of routine aspects of data management
and analytics, most elements of the scientific process currently require
considerable human expertise and effort. Accelerating science to keep pace with
the rate of data acquisition and data processing calls for the development of
algorithmic or information processing abstractions, coupled with formal methods
and tools for modeling and simulation of natural processes as well as major
innovations in cognitive tools for scientists, i.e., computational tools that
leverage and extend the reach of human intellect, and partner with humans on a
broad range of tasks in scientific discovery (e.g., identifying, prioritizing
formulating questions, designing, prioritizing and executing experiments
designed to answer a chosen question, drawing inferences and evaluating the
results, and formulating new questions, in a closed-loop fashion). This calls
for concerted research agenda aimed at: Development, analysis, integration,
sharing, and simulation of algorithmic or information processing abstractions
of natural processes, coupled with formal methods and tools for their analyses
and simulation; Innovations in cognitive tools that augment and extend human
intellect and partner with humans in all aspects of science.Comment: Computing Community Consortium (CCC) white paper, 17 page
MolBind: Multimodal Alignment of Language, Molecules, and Proteins
Recent advancements in biology and chemistry have leveraged multi-modal
learning, integrating molecules and their natural language descriptions to
enhance drug discovery. However, current pre-training frameworks are limited to
two modalities, and designing a unified network to process different modalities
(e.g., natural language, 2D molecular graphs, 3D molecular conformations, and
3D proteins) remains challenging due to inherent gaps among them. In this work,
we propose MolBind, a framework that trains encoders for multiple modalities
through contrastive learning, mapping all modalities to a shared feature space
for multi-modal semantic alignment. To facilitate effective pre-training of
MolBind on multiple modalities, we also build and collect a high-quality
dataset with four modalities, MolBind-M4, including graph-language,
conformation-language, graph-conformation, and conformation-protein paired
data. MolBind shows superior zero-shot learning performance across a wide range
of tasks, demonstrating its strong capability of capturing the underlying
semantics of multiple modalities
Advanced Cyberinfrastructure for Science, Engineering, and Public Policy
Progress in many domains increasingly benefits from our ability to view the
systems through a computational lens, i.e., using computational abstractions of
the domains; and our ability to acquire, share, integrate, and analyze
disparate types of data. These advances would not be possible without the
advanced data and computational cyberinfrastructure and tools for data capture,
integration, analysis, modeling, and simulation. However, despite, and perhaps
because of, advances in "big data" technologies for data acquisition,
management and analytics, the other largely manual, and labor-intensive aspects
of the decision making process, e.g., formulating questions, designing studies,
organizing, curating, connecting, correlating and integrating crossdomain data,
drawing inferences and interpreting results, have become the rate-limiting
steps to progress. Advancing the capability and capacity for evidence-based
improvements in science, engineering, and public policy requires support for
(1) computational abstractions of the relevant domains coupled with
computational methods and tools for their analysis, synthesis, simulation,
visualization, sharing, and integration; (2) cognitive tools that leverage and
extend the reach of human intellect, and partner with humans on all aspects of
the activity; (3) nimble and trustworthy data cyber-infrastructures that
connect, manage a variety of instruments, multiple interrelated data types and
associated metadata, data representations, processes, protocols and workflows;
and enforce applicable security and data access and use policies; and (4)
organizational and social structures and processes for collaborative and
coordinated activity across disciplinary and institutional boundaries.Comment: A Computing Community Consortium (CCC) white paper, 9 pages. arXiv
admin note: text overlap with arXiv:1604.0200
Characterization of the retinal proteome during rod photoreceptor genesis
<p>Abstract</p> <p>Background</p> <p>The process of rod photoreceptor genesis, cell fate determination and differentiation is complex and multi-factorial. Previous studies have defined a model of photoreceptor differentiation that relies on intrinsic changes within the presumptive photoreceptor cells as well as changes in surrounding tissue that are extrinsic to the cell. We have used a proteomics approach to identify proteins that are dynamically expressed in the mouse retina during rod genesis and differentiation.</p> <p>Findings</p> <p>A series of six developmental ages from E13 to P5 were used to define changes in retinal protein expression during rod photoreceptor genesis and early differentiation. Retinal proteins were separated by isoelectric focus point and molecular weight. Gels were analyzed for changes in protein spot intensity across developmental time. Protein spots that peaked in expression at E17, P0 and P5 were picked from gels for identification. There were 239 spots that were picked for identification based on their dynamic expression during the developmental period of maximal rod photoreceptor genesis and differentiation. Of the 239 spots, 60 of them were reliably identified and represented a single protein. Ten proteins were represented by multiple spots, suggesting they were post-translationally modified. Of the 42 unique dynamically expressed proteins identified, 16 had been previously reported to be associated with the developing retina.</p> <p>Conclusions</p> <p>Our results represent the first proteomics study of the developing mouse retina that includes prenatal development. We identified 26 dynamically expressed proteins in the developing mouse retina whose expression had not been previously associated with retinal development.</p
Semi-supervised prediction of protein subcellular localization using abstraction augmented Markov models
<p>Abstract</p> <p>Background</p> <p>Determination of protein subcellular localization plays an important role in understanding protein function. Knowledge of the subcellular localization is also essential for genome annotation and drug discovery. Supervised machine learning methods for predicting the localization of a protein in a cell rely on the availability of large amounts of labeled data. However, because of the high cost and effort involved in labeling the data, the amount of labeled data is quite small compared to the amount of unlabeled data. Hence, there is a growing interest in developing <it>semi-supervised methods</it> for predicting protein subcellular localization from large amounts of unlabeled data together with small amounts of labeled data.</p> <p>Results</p> <p>In this paper, we present an Abstraction Augmented Markov Model (AAMM) based approach to semi-supervised protein subcellular localization prediction problem. We investigate the effectiveness of AAMMs in exploiting <it>unlabeled</it> data. We compare semi-supervised AAMMs with: (i) Markov models (MMs) (which do not take advantage of unlabeled data); (ii) an expectation maximization (EM); and (iii) a co-training based approaches to semi-supervised training of MMs (that make use of unlabeled data).</p> <p>Conclusions</p> <p>The results of our experiments on three protein subcellular localization data sets show that semi-supervised AAMMs: (i) can effectively exploit unlabeled data; (ii) are more accurate than both the MMs and the EM based semi-supervised MMs; and (iii) are comparable in performance, and in some cases outperform, the co-training based semi-supervised MMs.</p
On Evaluating MHC-II Binding Peptide Prediction Methods
Choice of one method over another for MHC-II binding peptide prediction is typically based on published reports of their estimated performance on standard benchmark datasets. We show that several standard benchmark datasets of unique peptides used in such studies contain a substantial number of peptides that share a high degree of sequence identity with one or more other peptide sequences in the same dataset. Thus, in a standard cross-validation setup, the test set and the training set are likely to contain sequences that share a high degree of sequence identity with each other, leading to overly optimistic estimates of performance. Hence, to more rigorously assess the relative performance of different prediction methods, we explore the use of similarity-reduced datasets. We introduce three similarity-reduced MHC-II benchmark datasets derived from MHCPEP, MHCBN, and IEDB databases. The results of our comparison of the performance of three MHC-II binding peptide prediction methods estimated using datasets of unique peptides with that obtained using their similarity-reduced counterparts shows that the former can be rather optimistic relative to the performance of the same methods on similarity-reduced counterparts of the same datasets. Furthermore, our results demonstrate that conclusions regarding the superiority of one method over another drawn on the basis of performance estimates obtained using commonly used datasets of unique peptides are often contradicted by the observed performance of the methods on the similarity-reduced versions of the same datasets. These results underscore the importance of using similarity-reduced datasets in rigorously comparing the performance of alternative MHC-II peptide prediction methods