Search CORE

43 research outputs found

EsaCL: Efficient Continual Learning of Sparse Models

Author: Honavar Vasant G
Ren Weijieying
Publication venue
Publication date: 10/01/2024
Field of study

A key challenge in the continual learning setting is to efficiently learn a sequence of tasks without forgetting how to perform previously learned tasks. Many existing approaches to this problem work by either retraining the model on previous tasks or by expanding the model to accommodate new tasks. However, these approaches typically suffer from increased storage and computational requirements, a problem that is worsened in the case of sparse models due to need for expensive re-training after sparsification. To address this challenge, we propose a new method for efficient continual learning of sparse models (EsaCL) that can automatically prune redundant parameters without adversely impacting the model's predictive power, and circumvent the need of retraining. We conduct a theoretical analysis of loss landscapes with parameter pruning, and design a directional pruning (SDP) strategy that is informed by the sharpness of the loss function with respect to the model parameters. SDP ensures model with minimal loss of predictive accuracy, accelerating the learning of sparse models at each stage. To accelerate model update, we introduce an intelligent data selection (IDS) strategy that can identify critical instances for estimating loss landscape, yielding substantially improved data efficiency. The results of our experiments show that EsaCL achieves performance that is competitive with the state-of-the-art methods on three continual learning benchmarks, while using substantially reduced memory and computational resources.Comment: SDM 2024 : SIAM International Conference on Data Minin

arXiv.org e-Print Archive

Learning DFA for Simple Examples

Author: Honavar Vasant
Parekh Rajesh G.
Publication venue: Iowa State University Digital Repository
Publication date: 18/03/1997
Field of study

We present a framework for learning DFA from simple examples. We show that efficient PAC learning of DFA is possible if the class of distributions is restricted to simple distributions where a teacher might choose examples based on the knowledge of the target concept. This answers an open research question posed in Pitt\u27s seminal paper: Are DFA\u27s PAC-identifiable if examples are drawn from the uniform distribution, or some other known simple distribution? Our approach uses the RPNI algorithm for learning DFA from labeled examples. In particular, we describe an efficient learning algorithm for exact learning of the target DFA with high probability when a bound on the number of states (N) of the target DFA is known in advance. When N is not known, we show how this algorithm can be used for efficient PAC learning of DFAs

Digital Repository @ Iowa State University (ISU)

3M-Diffusion: Latent Multi-Modal Diffusion for Text-Guided Generation of Molecular Graphs

Author: Honavar Vasant G
Xiao Teng
Zhu Huaisheng
Publication venue
Publication date: 11/03/2024
Field of study

Generating molecules with desired properties is a critical task with broad applications in drug discovery and materials design. Inspired by recent advances in large language models, there is a growing interest in using natural language descriptions of molecules to generate molecules with the desired properties. Most existing methods focus on generating molecules that precisely match the text description. However, practical applications call for methods that generate diverse, and ideally novel, molecules with the desired properties. We propose 3M-Diffusion, a novel multi-modal molecular graph generation method, to address this challenge. 3M-Diffusion first encodes molecular graphs into a graph latent space aligned with text descriptions. It then reconstructs the molecular structure and atomic attributes based on the given text descriptions using the molecule decoder. It then learns a probabilistic mapping from the text space to the latent molecular graph space using a diffusion model. The results of our extensive experiments on several datasets demonstrate that 3M-Diffusion can generate high-quality, novel and diverse molecular graphs that semantically match the textual description provided

arXiv.org e-Print Archive

Visual Methods for Examining Support Vector Machine Results, with Applications to Gene Expression Data Analysis

Author: Caragea Doina
Cook Dianne
Honavar Vasant G.
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2005
Field of study

Support vector machines (SVM) offer a theoretically well-founded approach to automated learning of pattern classifiers. They have been proven to give highly accurate results in complex classification problems, for example, gene expression analysis. The SVM algorithm is also quite intuitive with a few inputs to vary in the fitting process and several outputs that are interesting to study. For many data mining tasks (e.g., cancer prediction) finding classifiers with good predictive accuracy is important, but understanding the classifier is equally important. By studying the classifier outputs we may be able to produce a simpler classifier, learn which variables are the important discriminators between classes, and find the samples that are problematic to the classification. Visual methods for exploratory data analysis can help us to study the outputs and complement automated classification algorithms in data mining. We present the use of tour-based methods to plot aspects of the SVM classifier. This approach provides insights about the cluster structure in the data, the nature of boundaries between clusters, and problematic outliers. Furthermore, tours can be used to assess the variable importance. We show how visual methods can be used as a complement to cross-validation methods in order to find good SVM input parameters for a particular data set

Digital Repository @ Iowa State University (ISU)

Accelerating Science: A Computing Research Agenda

Author: Hill Mark D.
Honavar Vasant G.
Yelick Katherine
Publication venue
Publication date: 01/01/2016
Field of study

The emergence of "big data" offers unprecedented opportunities for not only accelerating scientific advances but also enabling new modes of discovery. Scientific progress in many disciplines is increasingly enabled by our ability to examine natural phenomena through the computational lens, i.e., using algorithmic or information processing abstractions of the underlying processes; and our ability to acquire, share, integrate and analyze disparate types of data. However, there is a huge gap between our ability to acquire, store, and process data and our ability to make effective use of the data to advance discovery. Despite successful automation of routine aspects of data management and analytics, most elements of the scientific process currently require considerable human expertise and effort. Accelerating science to keep pace with the rate of data acquisition and data processing calls for the development of algorithmic or information processing abstractions, coupled with formal methods and tools for modeling and simulation of natural processes as well as major innovations in cognitive tools for scientists, i.e., computational tools that leverage and extend the reach of human intellect, and partner with humans on a broad range of tasks in scientific discovery (e.g., identifying, prioritizing formulating questions, designing, prioritizing and executing experiments designed to answer a chosen question, drawing inferences and evaluating the results, and formulating new questions, in a closed-loop fashion). This calls for concerted research agenda aimed at: Development, analysis, integration, sharing, and simulation of algorithmic or information processing abstractions of natural processes, coupled with formal methods and tools for their analyses and simulation; Innovations in cognitive tools that augment and extend human intellect and partner with humans in all aspects of science.Comment: Computing Community Consortium (CCC) white paper, 17 page

arXiv.org e-Print Archive

eScholarship - University of California

MolBind: Multimodal Alignment of Language, Molecules, and Proteins

Author: Cui Chao
Honavar Vasant G.
Xiao Teng
Zhu Huaisheng
Publication venue
Publication date: 02/04/2024
Field of study

Recent advancements in biology and chemistry have leveraged multi-modal learning, integrating molecules and their natural language descriptions to enhance drug discovery. However, current pre-training frameworks are limited to two modalities, and designing a unified network to process different modalities (e.g., natural language, 2D molecular graphs, 3D molecular conformations, and 3D proteins) remains challenging due to inherent gaps among them. In this work, we propose MolBind, a framework that trains encoders for multiple modalities through contrastive learning, mapping all modalities to a shared feature space for multi-modal semantic alignment. To facilitate effective pre-training of MolBind on multiple modalities, we also build and collect a high-quality dataset with four modalities, MolBind-M4, including graph-language, conformation-language, graph-conformation, and conformation-protein paired data. MolBind shows superior zero-shot learning performance across a wide range of tasks, demonstrating its strong capability of capturing the underlying semantics of multiple modalities

arXiv.org e-Print Archive

Advanced Cyberinfrastructure for Science, Engineering, and Public Policy

Author: Bradley Elizabeth
Hill Mark D.
Honavar Vasant G.
Mynatt Elizabeth
Nahrstedt Klara
Rexford Jennifer
Rushmeier Holly
Yelick Katherine
Publication venue
Publication date: 01/01/2017
Field of study

Progress in many domains increasingly benefits from our ability to view the systems through a computational lens, i.e., using computational abstractions of the domains; and our ability to acquire, share, integrate, and analyze disparate types of data. These advances would not be possible without the advanced data and computational cyberinfrastructure and tools for data capture, integration, analysis, modeling, and simulation. However, despite, and perhaps because of, advances in "big data" technologies for data acquisition, management and analytics, the other largely manual, and labor-intensive aspects of the decision making process, e.g., formulating questions, designing studies, organizing, curating, connecting, correlating and integrating crossdomain data, drawing inferences and interpreting results, have become the rate-limiting steps to progress. Advancing the capability and capacity for evidence-based improvements in science, engineering, and public policy requires support for (1) computational abstractions of the relevant domains coupled with computational methods and tools for their analysis, synthesis, simulation, visualization, sharing, and integration; (2) cognitive tools that leverage and extend the reach of human intellect, and partner with humans on all aspects of the activity; (3) nimble and trustworthy data cyber-infrastructures that connect, manage a variety of instruments, multiple interrelated data types and associated metadata, data representations, processes, protocols and workflows; and enforce applicable security and data access and use policies; and (4) organizational and social structures and processes for collaborative and coordinated activity across disciplinary and institutional boundaries.Comment: A Computing Community Consortium (CCC) white paper, 9 pages. arXiv admin note: text overlap with arXiv:1604.0200

arXiv.org e-Print Archive

eScholarship - University of California

Characterization of the retinal proteome during rod photoreceptor genesis

Author: Barnhill Alison E
Buss Janice E
Greenlee Heather West
Hecker Laura A
Honavar Vasant G
Kohutyuk Oksana
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background The process of rod photoreceptor genesis, cell fate determination and differentiation is complex and multi-factorial. Previous studies have defined a model of photoreceptor differentiation that relies on intrinsic changes within the presumptive photoreceptor cells as well as changes in surrounding tissue that are extrinsic to the cell. We have used a proteomics approach to identify proteins that are dynamically expressed in the mouse retina during rod genesis and differentiation. Findings A series of six developmental ages from E13 to P5 were used to define changes in retinal protein expression during rod photoreceptor genesis and early differentiation. Retinal proteins were separated by isoelectric focus point and molecular weight. Gels were analyzed for changes in protein spot intensity across developmental time. Protein spots that peaked in expression at E17, P0 and P5 were picked from gels for identification. There were 239 spots that were picked for identification based on their dynamic expression during the developmental period of maximal rod photoreceptor genesis and differentiation. Of the 239 spots, 60 of them were reliably identified and represented a single protein. Ten proteins were represented by multiple spots, suggesting they were post-translationally modified. Of the 42 unique dynamically expressed proteins identified, 16 had been previously reported to be associated with the developing retina. Conclusions Our results represent the first proteomics study of the developing mouse retina that includes prenatal development. We identified 26 dynamically expressed proteins in the developing mouse retina whose expression had not been previously associated with retinal development.</p

Crossref

Directory of Open Access Journals

PubMed Central

Semi-supervised prediction of protein subcellular localization using abstraction augmented Markov models

Author: A Blum
A Goldberg
A Höglund
Adrian Silvescu
AP Dempster
Cornelia Caragea
CS Ong
D Ron
Doina Caragea
G Camps-valls
G Casella
J Lafferty
J Lin
J Weston
J Zhang
JL Gardy
K Nigam
K Park
L Breiman
L Käll
M Belkin
M Li
M Szummer
MS Scott
ND Lawrence
O Emanuelsson
P Baldi
P Kuksa
Q Xu
T Jaakkola
T Jebara
T Joachims
TG Dietterich
Vasant Honavar
W Ansorge
X Zhu
Y Bengio
Y Grandvalet
Y Qi
Y Yuan
ZY Niu
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Determination of protein subcellular localization plays an important role in understanding protein function. Knowledge of the subcellular localization is also essential for genome annotation and drug discovery. Supervised machine learning methods for predicting the localization of a protein in a cell rely on the availability of large amounts of labeled data. However, because of the high cost and effort involved in labeling the data, the amount of labeled data is quite small compared to the amount of unlabeled data. Hence, there is a growing interest in developing <it>semi-supervised methods</it> for predicting protein subcellular localization from large amounts of unlabeled data together with small amounts of labeled data. Results In this paper, we present an Abstraction Augmented Markov Model (AAMM) based approach to semi-supervised protein subcellular localization prediction problem. We investigate the effectiveness of AAMMs in exploiting <it>unlabeled</it> data. We compare semi-supervised AAMMs with: (i) Markov models (MMs) (which do not take advantage of unlabeled data); (ii) an expectation maximization (EM); and (iii) a co-training based approaches to semi-supervised training of MMs (that make use of unlabeled data). Conclusions The results of our experiments on three protein subcellular localization data sets show that semi-supervised AAMMs: (i) can effectively exploit unlabeled data; (ii) are more accurate than both the MMs and the EM based semi-supervised MMs; and (iii) are comparable in performance, and in some cases outperform, the co-training based semi-supervised MMs.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

UNT Digital Library

On Evaluating MHC-II Binding Peptide Prediction Methods

Author: A Bairoch
A Chinnasamy
B Korber
B Peters
C Cai
C Leslie
D Madden
D O'Sullivan
Drena Dobbs
F Burden
G Raghava
G Tsoumakas
G Zhang
H Mamitsuka
H Noguchi
H Rammensee
H Saigo
H Singh
H Yu
I Witten
J Cui
J Demšar
J Garcia
J Platt
J Salomon
M Bhasin
M Bhasin
M Friedman
M Nielsen
M Nielsen
M Nielsen
M Rajapakse
N Murugan
P Baldi
P Donnes
P Reche
P Wang
R Fisher
R Mallios
S Buus
T Hertz
U Gowthaman
V Brusic
Vasant Honavar
Vladimir B. Bajic
Yasser EL-Manzalawy
Publication venue: Public Library of Science
Publication date: 01/01/2008
Field of study

Choice of one method over another for MHC-II binding peptide prediction is typically based on published reports of their estimated performance on standard benchmark datasets. We show that several standard benchmark datasets of unique peptides used in such studies contain a substantial number of peptides that share a high degree of sequence identity with one or more other peptide sequences in the same dataset. Thus, in a standard cross-validation setup, the test set and the training set are likely to contain sequences that share a high degree of sequence identity with each other, leading to overly optimistic estimates of performance. Hence, to more rigorously assess the relative performance of different prediction methods, we explore the use of similarity-reduced datasets. We introduce three similarity-reduced MHC-II benchmark datasets derived from MHCPEP, MHCBN, and IEDB databases. The results of our comparison of the performance of three MHC-II binding peptide prediction methods estimated using datasets of unique peptides with that obtained using their similarity-reduced counterparts shows that the former can be rather optimistic relative to the performance of the same methods on similarity-reduced counterparts of the same datasets. Furthermore, our results demonstrate that conclusions regarding the superiority of one method over another drawn on the basis of performance estimates obtained using commonly used datasets of unique peptides are often contradicted by the observed performance of the methods on the similarity-reduced versions of the same datasets. These results underscore the importance of using similarity-reduced datasets in rigorously comparing the performance of alternative MHC-II peptide prediction methods

Digital Repository @ Iowa State University (ISU)

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central