447 research outputs found

    Simplifying inverse material design problems for fixed lattices with alchemical chirality

    Get PDF
    Massive brute-force compute campaigns relying on demanding ab initio calculations routinely search for novel materials in chemical compound space, the vast virtual set of all conceivable stable combinations of elements and structural configurations which form matter. Here we demonstrate that 4-dimensional chirality, arising from anti-symmetry of alchemical perturbations, dissects that space and defines approximate ranks which effectively reduce its formal dimensionality, and enable us to break down its combinatorial scaling. The resulting distinct `alchemical' enantiomers must share the exact same electronic energy up to third order -- independent of respective covalent bond topology, and imposing relevant constraints on chemical bonding. Alchemical chirality deepens our understanding of chemical compound space and enables the `on-the-fly' establishment of new trends without empiricism for any materials with fixed lattices. We demonstrate its efficacy for three such cases: i) new formulas for estimating electronic energy contributions to chemical bonding; ii) analysis of the perturbed electron density of BN doped benzene; and iii) ranking stability estimates for BN doping in over 2,000 naphthalene and over 400 million picene derivatives

    Simplifying inverse materials design problems for fixed lattices with alchemical chirality

    Get PDF
    Brute-force compute campaigns relying on demanding ab initio calculations routinely search for previously un- known materials in chemical compound space (CCS), the vast set of all conceivable stable combinations of elements and structural configurations. Here, we demonstrate that four-dimensional chirality arising from antisymmetry of alchemical perturbations dissects CCS and defines approximate ranks, which reduce its formal dimensionality and break down its combinatorial scaling. The resulting "alchemical" enantiomers have the same electronic energy up to the third order, independent of respective covalent bond topology, imposing relevant constraints on chemical bonding. Alchemical chirality deepens our understanding of CCS and enables the establishment of trends without empiricism for any materials with fixed lattices. We demonstrate the efficacy for three cases: (i) new rules for elec- tronic energy contributions to chemical bonding; (ii) analysis of the electron density of BN-doped benzene; and (iii) ranking over 2000 and 4 million BN-doped naphthalene and picene derivatives, respectively

    Structural Pattern Recognition for Chemical-Compound Virtual Screening

    Get PDF
    Les molècules es configuren de manera natural com a xarxes, de manera que són ideals per estudiar utilitzant les seves representacions gràfiques, on els nodes representen àtoms i les vores representen els enllaços químics. Una alternativa per a aquesta representació directa és el gràfic reduït ampliat, que resumeix les estructures químiques mitjançant descripcions de nodes de tipus farmacòfor per codificar les propietats moleculars rellevants. Un cop tenim una manera adequada de representar les molècules com a gràfics, hem de triar l’eina adequada per comparar-les i analitzar-les. La distància d'edició de gràfics s'utilitza per resoldre la concordança de gràfics tolerant als errors; aquesta metodologia calcula la distància entre dos gràfics determinant el nombre mínim de modificacions necessàries per transformar un gràfic en l’altre. Aquestes modificacions (conegudes com a operacions d’edició) tenen associat un cost d’edició (també conegut com a cost de transformació), que s’ha de determinar en funció del problema. Aquest estudi investiga l’eficàcia d’una comparació molecular basada només en gràfics que utilitza gràfics reduïts ampliats i distància d’edició de gràfics com a eina per a aplicacions de cribratge virtual basades en lligands. Aquestes aplicacions estimen la bioactivitat d'una substància química que utilitza la bioactivitat de compostos similars. Una part essencial d’aquest estudi es centra en l’ús d’aprenentatge automàtic i tècniques de processament del llenguatge natural per optimitzar els costos de transformació utilitzats en les comparacions moleculars amb la distància d’edició de gràfics.Las moléculas tienen la forma natural de redes, lo que las hace ideales para estudiar mediante el empleo de sus representaciones gráficas, donde los nodos representan los átomos y los bordes representan los enlaces químicos. Una alternativa para esta representación sencilla es el gráfico reducido extendido, que resume las estructuras químicas utilizando descripciones de nodos de tipo farmacóforo para codificar las propiedades moleculares relevantes. Una vez que tenemos una forma adecuada de representar moléculas como gráficos, debemos elegir la herramienta adecuada para compararlas y analizarlas. La distancia de edición de gráficos se utiliza para resolver la coincidencia de gráficos tolerante a errores; esta metodología estima una distancia entre dos gráficos determinando el número mínimo de modificaciones necesarias para transformar un gráfico en el otro. Estas modificaciones (conocidas como operaciones de edición) tienen un costo de edición (también conocido como costo de transformación) asociado, que debe determinarse en función del problema. Este estudio investiga la efectividad de una comparación molecular basada solo en gráficos que emplea gráficos reducidos extendidos y distancia de edición de gráficos como una herramienta para aplicaciones de detección virtual basadas en ligandos. Estas aplicaciones estiman la bioactividad de una sustancia química empleando la bioactividad de compuestos similares. Una parte esencial de este estudio se centra en el uso de técnicas de procesamiento de lenguaje natural y aprendizaje automático para optimizar los costos de transformación utilizados en las comparaciones moleculares con la distancia de edición de gráficos.Molecules are naturally shaped as networks, making them ideal for studying by employing their graph representations, where nodes represent atoms and edges represent the chemical bonds. An alternative for this straightforward representation is the extended reduced graph, which summarizes the chemical structures using pharmacophore-type node descriptions to encode the relevant molecular properties. Once we have a suitable way to represent molecules as graphs, we need to choose the right tool to compare and analyze them. Graph edit distance is used to solve the error-tolerant graph matching; this methodology estimates a distance between two graphs by determining the minimum number of modifications required to transform one graph into the other. These modifications (known as edit operations) have an edit cost (also known as transformation cost) associated, which must be determined depending on the problem. This study investigates the effectiveness of a graph-only driven molecular comparison employing extended reduced graphs and graph edit distance as a tool for ligand-based virtual screening applications. Those applications estimate the bioactivity of a chemical employing the bioactivity of similar compounds. An essential part of this study focuses on using machine learning and natural language processing techniques to optimize the transformation costs used in the molecular comparisons with the graph edit distance. Overall, this work shows a framework that combines graph reduction and comparison with optimization tools and natural language processing to identify bioactivity similarities in a structurally diverse group of molecules. We confirm the efficiency of this framework with several chemoinformatic tests applied to regression and classification problems over different publicly available datasets

    SELFIES and the future of molecular string representations

    Get PDF
    Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, SMILES, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, SMILES has several shortcomings -- most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100\% robustness: SELFIES (SELF-referencIng Embedded Strings). SELFIES has since simplified and enabled numerous new applications in chemistry. In this manuscript, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete Future Projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science.Comment: 34 pages, 15 figures, comments and suggestions for additional references are welcome

    SELFIES and the future of molecular string representations

    Get PDF
    Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, SMILES, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, SMILES has several shortcomings -- most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100\% robustness: SELFIES (SELF-referencIng Embedded Strings). SELFIES has since simplified and enabled numerous new applications in chemistry. In this manuscript, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete Future Projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science

    Development of Computer-Aided Molecular Design Methods for Bioengineering Applications

    Get PDF
    Computer-aided molecular design (CAMD) offers a methodology for rational product design. The CAMD procedure consists of pre-design, design and post-design phases. CAMD was used to address two bioengineering problems: design of excipients for lyophilized protein formulations and design of ionic liquids for use in bioseparations. Protein stability remains a major concern during protein drug development. Lyophilization, or freeze-drying, is often sought to improve chemical stability. However, lyophilization can result in protein aggregation. Excipients, or additives, are included to stabilize proteins in lyophilized formulations. CAMD was used to rationally select or design excipients for lyophilized protein formulations. The use of solvents to aid separation is common in chemical processes. Ionic liquids offer a class of molecules with tunable properties that can be altered to find optimal solvents for a given application. CAMD was used to design ionic liquids for extractive distillation and in situ extractive fermentation processes. The pre-design phase involves experimental data gathering and problem formulation. When available, data was obtained from literature sources. For excipient design, data of percent protein monomer remaining post-lyophilization was measured for a variety of protein-excipient combinations. In problem formulation, the objective was to minimize the difference between the properties of the designed molecule and the target property values. Problem formulations resulted in either mixed-integer linear programs (MILPs) or mixed-integer non-linear programs (MINLPs). The design phase consists of the forward problem and the reverse problem. In the forward problem, linear quantitative structure-property relationships (QSPRs) were developed using connectivity indices. Chiral connectivity indices were used for excipient property models to improve fit and incorporate three-dimensional structural information. Descriptor selection methods were employed to find models that minimized Mallow's Cp statistic, obtaining models with good fit while avoiding overfitting. Cross-validation was performed to access predictive capabilities. Model development was also performed to develop group contribution models and non-linear QSPRs. A UNIFAC model was developed to predict the thermodynamic properties of ionic liquids. In the reverse problem of the design phase, molecules were proposed with optimal property values. Deterministic methods were used to design ionic liquids entrainers for azeotropic distillation. Tabu search, a stochastic optimization method, was applied to both ionic liquid and excipient design to provide novel molecular candidates. Tabu search was also compared to a genetic algorithm for CAMD applications. Tuning was performed using a test case to determine parameter values for both methods. After tuning, both stochastic methods were used with design cases to provide optimal excipient stabilizers for lyophilized protein formulations. Results suggested that the genetic algorithm provided a faster time to solution while the tabu search provides quality solutions more consistently. The post-design phase provides solution analysis and verification. Process simulation was used to evaluate the energy requirements of azeotropic separations using designed ionic liquids. Results demonstrated that less energy was required than processes using conventional entrainers or ionic liquids that were not optimally designed. Molecular simulation was used to guide protein formulation design and may prove to be a useful tool in post-design verification. Finally, prediction intervals were used for properties predicted from linear QSPRs to quantify the prediction error in the CAMD solutions. Overlapping prediction intervals indicate solutions with statistically similar property values. Prediction interval analysis showed that tabu search returns many results with statistically similar property values in the design of carbohydrate glass formers for lyophilized protein formulations. The best solutions from tabu search and the genetic algorithm were shown to be statistically similar for all design cases considered. Overall the CAMD method developed here provides a comprehensive framework for the design of novel molecules for bioengineering approaches

    SELFIES and the future of molecular string representations

    Get PDF
    Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, Smiles, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, Smiles has several shortcomings—most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100% robustness: SELF-referencing embedded string (Selfies). Selfies has since simplified and enabled numerous new applications in chemistry. In this perspective, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete future projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages, and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science

    SELFIES and the future of molecular string representations

    Get PDF
    Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, Smiles, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, Smiles has several shortcomings—most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100% robustness: SELF-referencing embedded string (Selfies). Selfies has since simplified and enabled numerous new applications in chemistry. In this perspective, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete future projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages, and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science
    • …