518 research outputs found

    Advanced Statistical Methods for Atomic-Level Quantification of Multi-Component Alloys

    Get PDF
    This thesis comprises a collection of papers whose common theme is data analysis of high entropy alloys. The experimental technique used to view these alloys at the nano-scale produces a dataset that, while comprised of approximately 10^7 atoms, is corrupted by observational noise and sparsity. Our goal is to developstatistical methods to quantify the atomic structure of these materials. Understanding the atomic structure of these materials involves three parts: 1. Determining the crystal structure of the material 2. Finding the optimal transformation onto a reference structure 3. Finding the optimal matching between structures and the lattice constantFrom identifying these elements, we may map a noisy and sparse representation of an HEA onto its reference structure and determine the probabilities of different elemental types that are immediately adjacent, i.e., first neighbors, or are one-level removed and are second neighbors. Having these elemental descriptors of a material, researchers may then develop interaction potentials for molecular dynamics simulations, and make accurate predictions about these novel metallic alloys

    Topological regression as an interpretable and efficient tool for quantitative structureactivity relationship modeling

    Get PDF
    Quantitative structure-activity relationship (QSAR)modeling is a powerful tool for drug discovery, yet the lack of interpretability of commonly used QSAR models hinders their application inmolecular design.We propose a similaritybased regression framework, topological regression (TR), that offers a statistically grounded, computationally fast, and interpretable technique to predict drug responses. We compare the predictive performance of TR on 530 ChEMBL human target activity datasets against the predictive performance of deep-learning-based QSAR models. Our results suggest that our sparse TR model can achieve equal, if not better, performance than the deep learningbased QSAR models and provide better intuitive interpretation by extracting an approximate isometry between the chemical space of the drugs and their activity space

    Predicting Proteome-Early Drug Induced Cardiac Toxicity Relationships (Pro-EDICToRs) with Node Overlapping Parameters (NOPs) of a new class of Blood Mass-Spectra graphs

    Get PDF
    The 11th International Electronic Conference on Synthetic Organic Chemistry session Computational ChemistryBlood Serum Proteome-Mass Spectra (SP-MS) may allow detecting Proteome-Early Drug Induced Cardiac Toxicity Relationships (called here Pro-EDICToRs). However, due to the thousands of proteins in the SP identifying general Pro-EDICToRs patterns instead of a single protein marker may represents a more realistic alternative. In this sense, first we introduced a novel Cartesian 2D spectrum graph for SP-MS. Next, we introduced the graph node-overlapping parameters (nopk) to numerically characterize SP-MS using them as inputs to seek a Quantitative Proteome-Toxicity Relationship (QPTR) classifier for Pro-EDICToRs with accuracy higher than 80%. Principal Component Analysis (PCA) on the nopk values present in the QPTR model explains with one factor (F1) the 82.7% of variance. Next, these nopk values were used to construct by the first time a Pro-EDICToRs Complex Network having nodes (samples) linked by edges (similarity between two samples). We compared the topology of two sub-networks (cardiac toxicity and control samples); finding extreme relative differences for the re-linking (P) and Zagreb (M2) indices (9.5 and 54.2 % respectively) out of 11 parameters. We also compared subnetworks with well known ideal random networks including Barabasi-Albert, Kleinberg Small World, Erdos-Renyi, and Epsstein Power Law models. Finally, we proposed Partial Order (PO) schemes of the 115 samples based on LDA-probabilities, F1-scores and/or network node degrees. PCA-CN and LDA-PCA based POs with Tanimoto’s coefficients equal or higher than 0.75 are promising for the study of Pro-EDICToRs. These results shows that simple QPTRs models based on MS graph numerical parameters are an interesting tool for proteome researchThe authors thank projects funded by the Xunta de Galicia (PXIB20304PR and BTF20302PR) and the Ministerio de Sanidad y Consumo (PI061457). González-Díaz H. acknowledges tenure track research position funded by the Program Isidro Parga Pondal, Xunta de Galici

    A Tale of Two Approaches: Comparing Top-Down and Bottom-Up Strategies for Analyzing and Visualizing High-Dimensional Data

    Get PDF
    The proliferation of high-throughput and sensory technologies in various fields has led to a considerable increase in data volume, complexity, and diversity. Traditional data storage, analysis, and visualization methods are struggling to keep pace with the growth of modern data sets, necessitating innovative approaches to overcome the challenges of managing, analyzing, and visualizing data across various disciplines. One such approach is utilizing novel storage media, such as deoxyribonucleic acid~(DNA), which presents efficient, stable, compact, and energy-saving storage option. Researchers are exploring the potential use of DNA as a storage medium for long-term storage of significant cultural and scientific materials. In addition to novel storage media, scientists are also focussing on developing new techniques that can integrate multiple data modalities and leverage machine learning algorithms to identify complex relationships and patterns in vast data sets. These newly-developed data management and analysis approaches have the potential to unlock previously unknown insights into various phenomena and to facilitate more effective translation of basic research findings to practical and clinical applications. Addressing these challenges necessitates different problem-solving approaches. Researchers are developing novel tools and techniques that require different viewpoints. Top-down and bottom-up approaches are essential techniques that offer valuable perspectives for managing, analyzing, and visualizing complex high-dimensional multi-modal data sets. This cumulative dissertation explores the challenges associated with handling such data and highlights top-down, bottom-up, and integrated approaches that are being developed to manage, analyze, and visualize this data. The work is conceptualized in two parts, each reflecting the two problem-solving approaches and their uses in published studies. The proposed work showcases the importance of understanding both approaches, the steps of reasoning about the problem within them, and their concretization and application in various domains

    Machine Learning Small Molecule Properties in Drug Discovery

    Full text link
    Machine learning (ML) is a promising approach for predicting small molecule properties in drug discovery. Here, we provide a comprehensive overview of various ML methods introduced for this purpose in recent years. We review a wide range of properties, including binding affinities, solubility, and ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity). We discuss existing popular datasets and molecular descriptors and embeddings, such as chemical fingerprints and graph-based neural networks. We highlight also challenges of predicting and optimizing multiple properties during hit-to-lead and lead optimization stages of drug discovery and explore briefly possible multi-objective optimization techniques that can be used to balance diverse properties while optimizing lead candidates. Finally, techniques to provide an understanding of model predictions, especially for critical decision-making in drug discovery are assessed. Overall, this review provides insights into the landscape of ML models for small molecule property predictions in drug discovery. So far, there are multiple diverse approaches, but their performances are often comparable. Neural networks, while more flexible, do not always outperform simpler models. This shows that the availability of high-quality training data remains crucial for training accurate models and there is a need for standardized benchmarks, additional performance metrics, and best practices to enable richer comparisons between the different techniques and models that can shed a better light on the differences between the many techniques.Comment: 46 pages, 1 figur

    Artificial Intelligence in Materials Science: Applications of Machine Learning to Extraction of Physically Meaningful Information from Atomic Resolution Microscopy Imaging

    Get PDF
    Materials science is the cornerstone for technological development of the modern world that has been largely shaped by the advances in fabrication of semiconductor materials and devices. However, the Moore’s Law is expected to stop by 2025 due to reaching the limits of traditional transistor scaling. However, the classical approach has shown to be unable to keep up with the needs of materials manufacturing, requiring more than 20 years to move a material from discovery to market. To adapt materials fabrication to the needs of the 21st century, it is necessary to develop methods for much faster processing of experimental data and connecting the results to theory, with feedback flow in both directions. However, state-of-the-art analysis remains selective and manual, prone to human error and unable to handle large quantities of data generated by modern equipment. Recent advances in scanning transmission electron and scanning tunneling microscopies have allowed imaging and manipulation of materials on the atomic level, and these capabilities require development of automated, robust, reproducible methods.Artificial intelligence and machine learning have dealt with similar issues in applications to image and speech recognition, autonomous vehicles, and other projects that are beginning to change the world around us. However, materials science faces significant challenges preventing direct application of the such models without taking physical constraints and domain expertise into account.Atomic resolution imaging can generate data that can lead to better understanding of materials and their properties through using artificial intelligence methods. Machine learning, in particular combinations of deep learning and probabilistic modeling, can learn to recognize physical features in imaging, making this process automated and speeding up characterization. By incorporating the knowledge from theory and simulations with such frameworks, it is possible to create the foundation for the automated atomic scale manufacturing

    From Knowledgebases to Toxicity Prediction and Promiscuity Assessment

    Get PDF
    Polypharmacology marked a paradigm shift in drug discovery from the traditional ‘one drug, one target’ approach to a multi-target perspective, indicating that highly effective drugs favorably modulate multiple biological targets. This ability of drugs to show activity towards many targets is referred to as promiscuity, an essential phenomenon that may as well lead to undesired side-effects. While activity at therapeutic targets provides desired biological response, toxicity often results from non-specific modulation of off-targets. Safety, efficacy and pharmacokinetics have been the primary concerns behind the failure of a majority of candidate drugs. Computer-based (in silico) models that can predict the pharmacological and toxicological profiles complement the ongoing efforts to lower the high attrition rates. High-confidence bioactivity data is a prerequisite for the development of robust in silico models. Additionally, data quality has been a key concern when integrating data from publicly-accessible bioactivity databases. A majority of the bioactivity data originates from high- throughput screening campaigns and medicinal chemistry literature. However, large numbers of screening hits are considered false-positives due to a number of reasons. In stark contrast, many compounds do not demonstrate biological activity despite being tested in hundreds of assays. This thesis work employs cheminformatics approaches to contribute to the aforementioned diverse, yet highly related, aspects that are crucial in rationalizing and expediting drug discovery. Knowledgebase resources of approved and withdrawn drugs were established and enriched with information integrated from multiple databases. These resources are not only useful in small molecule discovery and optimization, but also in the elucidation of mechanisms of action and off- target effects. In silico models were developed to predict the effects of small molecules on nuclear receptor and stress response pathways and human Ether-à-go-go-Related Gene encoded potassium channel. Chemical similarity and machine-learning based methods were evaluated while highlighting the challenges involved in the development of robust models using public domain bioactivity data. Furthermore, the true promiscuity of the potentially frequent hitter compounds was identified and their mechanisms of action were explored at the molecular level by investigating target-ligand complexes. Finally, the chemical and biological spaces of the extensively tested, yet inactive, compounds were investigated to reconfirm their potential to be promising candidates.Die Polypharmakologie beschreibt einen Paradigmenwechsel von "einem Wirkstoff - ein Zielmolekül" zu "einem Wirkstoff - viele Zielmoleküle" und zeigt zugleich auf, dass hochwirksame Medikamente nur durch die Interaktion mit mehreren Zielmolekülen Ihre komplette Wirkung entfalten können. Hierbei ist die biologische Aktivität eines Medikamentes direkt mit deren Nebenwirkungen assoziiert, was durch die Interaktion mit therapeutischen bzw. Off-Targets erklärt werden kann (Promiskuität). Ein Ungleichgewicht dieser Wechselwirkungen resultiert oftmals in mangelnder Wirksamkeit, Toxizität oder einer ungünstigen Pharmakokinetik, anhand dessen man das Scheitern mehrerer potentieller Wirkstoffe in ihrer präklinischen und klinischen Entwicklungsphase aufzeigen kann. Die frühzeitige Vorhersage des pharmakologischen und toxikologischen Profils durch computergestützte Modelle (in-silico) anhand der chemischen Struktur kann helfen den Prozess der Medikamentenentwicklung zu verbessern. Eine Voraussetzung für die erfolgreiche Vorhersage stellen zuverlässige Bioaktivitätsdaten dar. Allerdings ist die Datenqualität oftmals ein zentrales Problem bei der Datenintegration. Die Ursache hierfür ist die Verwendung von verschiedenen Bioassays und „Readouts“, deren Daten zum Großteil aus primären und bestätigenden Bioassays gewonnen werden. Während ein Großteil der Treffer aus primären Assays als falsch-positiv eingestuft werden, zeigen einige Substanzen keine biologische Aktivität, obwohl sie in beiden Assay- Typen ausgiebig getestet wurden (“extensively assayed compounds”). In diese Arbeit wurden verschiedene chemoinformatische Methoden entwickelt und angewandt, um die zuvor genannten Probleme zu thematisieren sowie Lösungsansätze aufzuzeigen und im Endeffekt die Arzneimittelforschung zu beschleunigen. Hierfür wurden nicht redundante, Hand-validierte Wissensdatenbanken für zugelassene und zurückgezogene Medikamente erstellt und mit weiterführenden Informationen angereichert, um die Entdeckung und Optimierung kleiner organischer Moleküle voran zu treiben. Ein entscheidendes Tool ist hierbei die Aufklärung derer Wirkmechanismen sowie Off-Target-Interaktionen. Für die weiterführende Charakterisierung von Nebenwirkungen, wurde ein Hauptaugenmerk auf Nuklearrezeptoren, Pathways in welchen Stressrezeptoren involviert sind sowie den hERG-Kanal gelegt und mit in-silico Modellen simuliert. Die Erstellung dieser Modelle wurden Mithilfe eines integrativen Ansatzes aus “state-of-the-art” Algorithmen wie Ähnlichkeitsvergleiche und “Machine- Learning” umgesetzt. Um ein hohes Maß an Vorhersagequalität zu gewährleisten, wurde bei der Evaluierung der Datensätze explizit auf die Datenqualität und deren chemische Vielfalt geachtet. Weiterführend wurden die in-silico-Modelle dahingehend erweitert, das Substrukturfilter genauer betrachtet wurden, um richtige Wirkmechanismen von unspezifischen Bindungsverhalten (falsch- positive Substanzen) zu unterscheiden. Abschließend wurden der chemische und biologische Raum ausgiebig getesteter, jedoch inaktiver, kleiner organischer Moleküle (“extensively assayed compounds”) untersucht und mit aktuell zugelassenen Medikamenten verglichen, um ihr Potenzial als vielversprechende Kandidaten zu bestätigen

    Quantitative structure fate relationships for multimedia environmental analysis

    Get PDF
    Key physicochemical properties for a wide spectrum of chemical pollutants are unknown. This thesis analyses the prospect of assessing the environmental distribution of chemicals directly from supervised learning algorithms using molecular descriptors, rather than from multimedia environmental models (MEMs) using several physicochemical properties estimated from QSARs. Dimensionless compartmental mass ratios of 468 validation chemicals were compared, in logarithmic units, between: a) SimpleBox 3, a Level III MEM, propagating random property values within statistical distributions of widely recommended QSARs; and, b) Support Vector Regressions (SVRs), acting as Quantitative Structure-Fate Relationships (QSFRs), linking mass ratios to molecular weight and constituent counts (atoms, bonds, functional groups and rings) for training chemicals. Best predictions were obtained for test and validation chemicals optimally found to be within the domain of applicability of the QSFRs, evidenced by low MAE and high q2 values (in air, MAE≤0.54 and q2≥0.92; in water, MAE≤0.27 and q2≥0.92).Las propiedades fisicoquĂ­micas de un gran espectro de contaminantes quĂ­micos son desconocidas. Esta tesis analiza la posibilidad de evaluar la distribuciĂłn ambiental de compuestos utilizando algoritmos de aprendizaje supervisados alimentados con descriptores moleculares, en vez de modelos ambientales multimedia alimentados con propiedades estimadas por QSARs. Se han comparado fracciones mĂĄsicas adimensionales, en unidades logarĂ­tmicas, de 468 compuestos entre: a) SimpleBox 3, un modelo de nivel III, propagando valores aleatorios de propiedades dentro de distribuciones estadĂ­sticas de QSARs recomendados; y, b) regresiones de vectores soporte (SVRs) actuando como relaciones cuantitativas de estructura y destino (QSFRs), relacionando fracciones mĂĄsicas con pesos moleculares y cuentas de constituyentes (ĂĄtomos, enlaces, grupos funcionales y anillos) para compuestos de entrenamiento. Las mejores predicciones resultaron para compuestos de test y validaciĂłn correctamente localizados dentro del dominio de aplicabilidad de los QSFRs, evidenciado por valores bajos de MAE y valores altos de q2 (en aire, MAE≤0.54 y q2≥0.92; en agua, MAE≤0.27 y q2≥0.92)

    Scalable Probabilistic Model Selection for Network Representation Learning in Biological Network Inference

    Get PDF
    A biological system is a complex network of heterogeneous molecular entities and their interactions contributing to various biological characteristics of the system. Although the biological networks not only provide an elegant theoretical framework but also offer a mathematical foundation to analyze, understand, and learn from complex biological systems, the reconstruction of biological networks is an important and unsolved problem. Current biological networks are noisy, sparse and incomplete, limiting the ability to create a holistic view of the biological reconstructions and thus fail to provide a system-level understanding of the biological phenomena. Experimental identification of missing interactions is both time-consuming and expensive. Recent advancements in high-throughput data generation and significant improvement in computational power have led to novel computational methods to predict missing interactions. However, these methods still suffer from several unresolved challenges. It is challenging to extract information about interactions and incorporate that information into the computational model. Furthermore, the biological data are not only heterogeneous but also high-dimensional and sparse presenting the difficulty of modeling from indirect measurements. The heterogeneous nature and sparsity of biological data pose significant challenges to the design of deep neural network structures which use essentially either empirical or heuristic model selection methods. These unscalable methods heavily rely on expertise and experimentation, which is a time-consuming and error-prone process and are prone to overfitting. Furthermore, the complex deep networks tend to be poorly calibrated with high confidence on incorrect predictions. In this dissertation, we describe novel algorithms that address these challenges. In Part I, we design novel neural network structures to learn representation for biological entities and further expand the model to integrate heterogeneous biological data for biological interaction prediction. In part II, we develop a novel Bayesian model selection method to infer the most plausible network structures warranted by data. We demonstrate that our methods achieve the state-of-the-art performance on the tasks across various domains including interaction prediction. Experimental studies on various interaction networks show that our method makes accurate and calibrated predictions. Our novel probabilistic model selection approach enables the network structures to dynamically evolve to accommodate incrementally available data. In conclusion, we discuss the limitations and future directions for proposed works
    • 

    corecore