2,990 research outputs found
Enrichment of limited training sets in machine-learning-based analog/RF Test
Abstract-This paper discusses the generation of informationrich, arbitrarily-large synthetic data sets which can be used to (a) efficiently learn tests that correlate a set of low-cost measurements to a set of device performances and (b) grade such tests with parts per million (PPM) accuracy. This is achieved by sampling a non-parametric estimate of the joint probability density function of measurements and performances. Our case study is an ultra-high frequency receiver front-end and the focus of the paper is to learn the mapping between a lowcost test measurement pattern and a single pass/fail test decision which reflects compliance to all performances. The small fraction of devices for which such a test decision is prone to error are identified and retested through standard specification-based test. The mapping can be set to explore thoroughly the tradeoff between test escapes, yield loss, and percentage of retested devices
Regression modeling for digital test of ΣΔ modulators
The cost of Analogue and Mixed-Signal circuit
testing is an important bottleneck in the industry, due to timeconsuming
verification of specifications that require state-ofthe-
art Automatic Test Equipment. In this paper, we apply
the concept of Alternate Test to achieve digital testing of
converters. By training an ensemble of regression models that
maps simple digital defect-oriented signatures onto Signal to
Noise and Distortion Ratio (SNDR), an average error of 1:7%
is achieved. Beyond the inference of functional metrics, we show
that the approach can provide interesting diagnosis information.Ministerio de Educación y Ciencia TEC2007-68072/MICJunta de Andalucía TIC 5386, CT 30
Most Ligand-Based Classification Benchmarks Reward Memorization Rather than Generalization
Undetected overfitting can occur when there are significant redundancies
between training and validation data. We describe AVE, a new measure of
training-validation redundancy for ligand-based classification problems that
accounts for the similarity amongst inactive molecules as well as active. We
investigated seven widely-used benchmarks for virtual screening and
classification, and show that the amount of AVE bias strongly correlates with
the performance of ligand-based predictive methods irrespective of the
predicted property, chemical fingerprint, similarity measure, or
previously-applied unbiasing techniques. Therefore, it may be that the
previously-reported performance of most ligand-based methods can be explained
by overfitting to benchmarks rather than good prospective accuracy
Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening
This work introduces a number of algebraic topology approaches, such as
multicomponent persistent homology, multi-level persistent homology and
electrostatic persistence for the representation, characterization, and
description of small molecules and biomolecular complexes. Multicomponent
persistent homology retains critical chemical and biological information during
the topological simplification of biomolecular geometric complexity.
Multi-level persistent homology enables a tailored topological description of
inter- and/or intra-molecular interactions of interest. Electrostatic
persistence incorporates partial charge information into topological
invariants. These topological methods are paired with Wasserstein distance to
characterize similarities between molecules and are further integrated with a
variety of machine learning algorithms, including k-nearest neighbors, ensemble
of trees, and deep convolutional neural networks, to manifest their descriptive
and predictive powers for chemical and biological problems. Extensive numerical
experiments involving more than 4,000 protein-ligand complexes from the PDBBind
database and near 100,000 ligands and decoys in the DUD database are performed
to test respectively the scoring power and the virtual screening power of the
proposed topological approaches. It is demonstrated that the present approaches
outperform the modern machine learning based methods in protein-ligand binding
affinity predictions and ligand-decoy discrimination
Machine Learning Methodologies for Interpretable Compound Activity Predictions
Machine learning (ML) models have gained attention for mining the pharmaceutical data that are currently generated at unprecedented rates and potentially accelerate the discovery of new drugs. The advent of deep learning (DL) has also raised expectations in pharmaceutical research. A central task in drug discovery is the initial search of compounds with desired biological activity. ML algorithms are able to find patterns in compound structures that are related to bioactivity, the so-called structure-activity relationships (SARs). ML-based predictions can complement biological testing to prioritize further experiments. Moreover, insights into model decisions are highly desired for further validation and identification of activity-relevant substructures. However, the interpretation of complex ML models remains essentially prohibitive. This thesis focuses on ML-based predictions of compound activity against multiple biological targets. Single-target and multi-target models are generated for relevant tasks including the prediction of profiling matrices from screening data and the discrimination between weak and strong inhibitors for more than a hundred kinases. Moreover, the relative performance of distinct modeling strategies is systematically analyzed under varying training conditions, and practical guidelines are reported. Since explainable model decisions are a clear requirement for the utility of ML bioactivity models in pharmaceutical research, methods for the interpretation and intuitive visualization of activity predictions from any ML or DL model are introduced. Taken together, this dissertation presents contributions that advance in the application and rationalization of ML models for biological activity and SAR predictions
Analysis of Multitarget Activities and Assay Interference Characteristics of Pharmaceutically Relevant Compounds
The availability of large amounts of data in public repositories provide a useful source of knowledge in the field of drug discovery. Given the increasing sizes of compound databases and volumes of activity data, computational data mining can be used to study different characteristics and properties of compounds on a large scale. One of the major source of identification of new compounds in early phase of drug discovery is high-throughput screening where millions of compounds are tested against many targets. The screening data provides opportunities to assess activity profiles of compounds. This thesis aims at systematically mining activity data from publicly available sources in order to study the nature of growth of bioactive compounds, analyze multitarget activities and assay interference characteristics of pharmaceutically relevant compounds in context of polypharmacology. In the first study, growth of bioactive compounds against five major target families is monitored over time and compound-scaffold-CSK (cyclic skeleton) hierarchy is applied to investigate structural diversity of active compounds and topological diversity of their scaffolds. The next part of the thesis is based on the analysis of screening data. Initially, extensively assayed compounds are mined from the PubChem database and promiscuity of these compounds is assessed by taking assay frequencies into account. Next, DCM (dark chemical matter) or consistently inactive compounds that have been extensively tested are systematically extracted and their analog relationships with bioactive compounds are determined in order to derive target hypotheses for DCM. Further, PAINS (pan-assay interference compounds) are identified in the extensively tested set of compounds using substructure filters and their assay interference characteristics are studied. Finally, the limitations of PAINS filters are addressed using machine learning models that can distinguish between promiscuous and DCM PAINS. Structural context dependence of PAINS activities is studied by assessing predictions through feature weighting and mapping
A systematic evaluation of deep learning methods for the prediction of drug synergy in cancer
One of the main obstacles to the successful treatment of cancer is the phenomenon of drug resistance. A common strategy to overcome resistance is the use of combination therapies. However, the space of possibilities is huge and efficient search strategies are required. Machine Learning (ML) can be a useful tool for the discovery of novel, clinically relevant anti-cancer drug combinations. In particular, deep learning (DL) has become a popular choice for modeling drug combination effects. Here, we set out to examine the impact of different methodological choices on the performance of multimodal DL-based drug synergy prediction methods, including the use of different input data types, preprocessing steps and model architectures. Focusing on the NCI ALMANAC dataset, we found that feature selection based on prior biological knowledge has a positive impact on performance. Drug features appeared to be more predictive of drug response. Molecular fingerprint-based drug representations performed slightly better than learned representations, and gene expression data of cancer or drug response-specific genes also improved performance. In general, fully connected feature-encoding subnetworks outperformed other architectures, with DL outperforming other ML methods. Using a state-of-the-art interpretability method, we showed that DL models can learn to associate drug and cell line features with drug response in a biologically meaningful way. The strategies explored in this study will help to improve the development of computational methods for the rational design of effective drug combinations for cancer therapy.Author summary Cancer therapies often fail because tumor cells become resistant to treatment. One way to overcome resistance is by treating patients with a combination of two or more drugs. Some combinations may be more effective than when considering individual drug effects, a phenomenon called drug synergy. Computational drug synergy prediction methods can help to identify new, clinically relevant drug combinations. In this study, we developed several deep learning models for drug synergy prediction. We examined the effect of using different types of deep learning architectures, and different ways of representing drugs and cancer cell lines. We explored the use of biological prior knowledge to select relevant cell line features, and also tested data-driven feature reduction methods. We tested both precomputed drug features and deep learning methods that can directly learn features from raw representations of molecules. We also evaluated whether including genomic features, in addition to gene expression data, improves the predictive performance of the models. Through these experiments, we were able to identify strategies that will help guide the development of new deep learning models for drug synergy prediction in the future.Competing Interest StatementThe authors have declared no competing interest.This study was supported by the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic funding of UIDB/04469/2020 unit and through a PhD scholarship (SFRH/BD/130913/2017) awarded to Delora Baptista.info:eu-repo/semantics/publishedVersio
Ligand-Protein Binding Affinity Prediction Using Machine Learning Scoring Functions.
In recent years, artificial intelligence makes its appearance in extremely different fields
with promising results able to produce enormous steps forward in some circumstances.
In chemoinformatics the use of machine learning technique, in particular, allows the
scientific community to build apparently accurate scoring functions for computational
docking. These types of scoring functions can overperform classic ones, the type of
scoring functions used until now. However the comparison between classic and
machine learning scoring functions are based on particular tests which can favour these
latter, as highlighted by some studies. In particular the machine learning scoring
functions, per definition, must be trained on some data, passing to the model the
instances chosen to describe the complexes and the relative ligand-protein affinity. In
these conditions the scoring power of the machine learning scoring functions can be
evaluated on different dataset and the scoring functions performance recorded can be
different depending on it. In particular, datasets very similar to the one used for the
training phase of the machine learning scoring function can facilitate in reaching high
performance in the scoring power. The objective of the present study is to verify the real efficiency and the effective
performances of the new born machine learning scoring functions. Our aim is to give an
answer to the scientific community about the doubts on the fact that the machine
learning scoring function can be or not the revolutionary road to be followed in the field
of chemioinformatic and drug discovery. In order to do this many tests are conducted
and a definitive test protocol to be executed to exhaustive validate a new machine
learning scoring function is proposed .
Here we investigate what are the circumstances in which a machine learning scoring
function produces overestimated performances and why it can happen. As a possible
solution we propose a tests protocol to be followed in order to guarantee a real
performance descriptions of machine learning scoring functions. Eventually an effective
and innovative solution in the field of machine learning scoring functions is proposed. It
consists in the use of per-target scoring functions which are machine learning scoring
functions created using complexes coming from a single protein and able to predict the
affinity of complexes which use that target. The data used to build the model are
synthetic and for this reason are easy to be created. The performances on the target
chosen are better than the ones obtained with basic model of scoring functions and
machine learning scoring functions trained on database composed by more than one
protein
Methods for the Analysis of Matched Molecular Pairs and Chemical Space Representations
Compound optimization is a complex process where different properties are optimized to increase the biological activity and therapeutic effects of a molecule. Frequently, the structure of molecules is modified in order to improve their property values. Therefore, computational analysis of the effects of structure modifications on property values is of great importance for the drug discovery process. It is also essential to analyze chemical space, i.e., the set of all chemically feasible molecules, in order to find subsets of molecules that display favorable property values. This thesis aims to expand the computational repertoire to analyze the effect of structure alterations and visualize chemical space. Matched molecular pairs are defined as pairs of compounds that share a large common substructure and only differ by a small chemical transformation. They have been frequently used to study property changes caused by structure modifications. These analyses are expanded in this thesis by studying the effect of chemical transformations on the ionization state and ligand efficiency, both measures of great importance in drug design. Additionally, novel matched molecular pairs based on retrosynthetic rules are developed to increase their utility for prospective use of chemical transformations in compound optimization. Further, new methods based on matched molecular pairs are described to obtain preliminary SAR information of screening hit compounds and predict the potency change caused by a chemical transformation. Visualizations of chemical space are introduced to aid compound optimization efforts. First, principal component plots are used to rationalize a matched molecular pair based multi-objective compound optimization procedure. Then, star coordinate and parallel coordinate plots are introduced to analyze drug-like subspaces, where compounds with favorable property values can be found. Finally, a novel network-based visualization of high-dimensional property space is developed. Concluding, the applications developed in this thesis expand the methodological spectrum of computer-aided compound optimization
- …