11 research outputs found
Cluster Exploration using Informative Manifold Projections
Dimensionality reduction (DR) is one of the key tools for the visual
exploration of high-dimensional data and uncovering its cluster structure in
two- or three-dimensional spaces. The vast majority of DR methods in the
literature do not take into account any prior knowledge a practitioner may have
regarding the dataset under consideration. We propose a novel method to
generate informative embeddings which not only factor out the structure
associated with different kinds of prior knowledge but also aim to reveal any
remaining underlying structure. To achieve this, we employ a linear combination
of two objectives: firstly, contrastive PCA that discounts the structure
associated with the prior information, and secondly, kurtosis projection
pursuit which ensures meaningful data separation in the obtained embeddings. We
formulate this task as a manifold optimization problem and validate it
empirically across a variety of datasets considering three distinct types of
prior knowledge. Lastly, we provide an automated framework to perform iterative
visual exploration of high-dimensional data
HypBO: Expert-Guided Chemist-in-the-Loop Bayesian Search for New Materials
Robotics and automation offer massive accelerations for solving intractable,
multivariate scientific problems such as materials discovery, but the available
search spaces can be dauntingly large. Bayesian optimization (BO) has emerged
as a popular sample-efficient optimization engine, thriving in tasks where no
analytic form of the target function/property is known. Here we exploit expert
human knowledge in the form of hypotheses to direct Bayesian searches more
quickly to promising regions of chemical space. Previous methods have used
underlying distributions derived from existing experimental measurements, which
is unfeasible for new, unexplored scientific tasks. Also, such distributions
cannot capture intricate hypotheses. Our proposed method, which we call HypBO,
uses expert human hypotheses to generate an improved seed of samples.
Unpromising seeds are automatically discounted, while promising seeds are used
to augment the surrogate model data, thus achieving better-informed sampling.
This process continues in a global versus local search fashion, organized in a
bilevel optimization framework. We validate the performance of our method on a
range of synthetic functions and demonstrate its practical utility on a real
chemical design task where the use of expert hypotheses accelerates the search
performance significantly
Continuation Methods for Approximate Large Scale Object Sequencing
We propose a set of highly scalable algorithms for the combinatorial data analysis problem of seriating similarity matrices. Seriation consists of finding a permutation of data instances, such that similar instances are nearby in the ordering. Applications of the seriation problem can be found in various disciplines such as in bioinformatics for genome sequencing, data visualization and exploratory data analysis. Our algorithms attempt to minimize certain p-SUM objectives, which also arise in the problem of envelope reduction of sparse matrices. In particular, we present a set of graduated non-convexity algorithms for vector-based relaxations of the general p-SUM problem for p ∈ {2,1,½} that can scale to very large problem sizes. Different choices of p emphasize global versus local similarity pattern structure. We conduct a number of experiments to compare our algorithms to various state-of-the-art combinatorial optimization methods on real and synthetic datasets. The experimental results demonstrate that compared to other approaches, the proposed algorithms are very competitive and scale well with large problem sizes
Fine-tuning GPT-3 for machine learning electronic and functional properties of organic molecules
We evaluate the effectiveness of fine-tuning GPT-3 for the prediction of electronic and functional properties of organic molecules. Our findings show that fine-tuned GPT-3 can successfully identify and distinguish between chemically meaningful patterns, and discern subtle differences among them, exhibiting robust predictive performance for the prediction of molecular properties. We focus on assessing the fine-tuned models' resilience to information loss, resulting from the absence of atoms or chemical groups, and to noise that we introduce via random alterations in atomic identities. We discuss the challenges and limitations inherent to the use of GPT-3 in molecular machine-learning tasks and suggest potential directions for future research and improvements to address these issues
Domain Knowledge Injection in Bayesian Search for New Materials
In this paper we propose DKIBO, a Bayesian optimization (BO) algorithm that accommodates domain knowledge to tune exploration in the search space. Bayesian optimization has recently emerged as a sample-efficient optimizer for many intractable scientific problems. While various existing BO frameworks allow the input of prior beliefs to accelerate the search by narrowing down the space, incorporating such knowledge is not always straightforward and can often introduce bias and lead to poor performance. Here we propose a simple approach to incorporate structural knowledge in the acquisition function by utilizing an additional deterministic surrogate model to enrich the approximation power of the Gaussian process. This is suitably chosen according to structural information of the problem at hand and acts a corrective term towards a better-informed sampling. We empirically demonstrate the practical utility of the proposed method by successfully injecting domain knowledge in a materials design task. We further validate our method’s performance on different experimental settings and ablation analyses.</jats:p
Fine-tuning GPT-3 for machine learning electronic and functional properties of organic molecules
We evaluate the effectiveness of fine-tuning GPT-3 for the prediction of electronic and functional properties of organic molecules. Our findings show that fine-tuned GPT-3 can successfully identify and distinguish between chemically meaningful patterns, and discern subtle differences among them, exhibiting robust predictive performance for the prediction of molecular properties. We focus on assessing the fine-tuned models' resilience to information loss, resulting from the absence of atoms or chemical groups, and to noise that we introduce via random alterations in atomic identities. We discuss the challenges and limitations inherent to the use of GPT-3 in molecular machine-learning tasks and suggest potential directions for future research and improvements to address these issues
Fine-tuning GPT-3 for machine learning electronic and functional properties of organic molecules
We evaluate the effectiveness of fine-tuning GPT-3 for the prediction of electronic and functional properties of organic molecules. Our findings show that fine-tuned GPT-3 can successfully identify and distinguish between chemically meaningful patterns, and discern subtle differences among them, exhibiting robust predictive performance for the prediction of molecular properties. We focus on assessing the fine-tuned models\u27 resilience to information loss, resulting from the absence of atoms or chemical groups, and to noise that we introduce via random alterations in atomic identities. We discuss the challenges and limitations inherent to the use of GPT-3 in molecular machine-learning tasks and suggest potential directions for future research and improvements to address these issues