    Machine learning dihydrogen activation in the chemical space surrounding Vaska’s complex

    Homogeneous catalysis using transition metal complexes is ubiquitously used for organic synthesis, as well as technologically relevant in applications such as water splitting and CO2 reduction. The key steps underlying homogeneous catalysis require a specific combination of electronic and steric effects from the ligands bound to the metal center. Finding the optimal combination of ligands is a challenging task due to the exceedingly large number of possibilities and the non-trivial ligand–ligand interactions. The classic example of Vaska\u27s complex, trans-[Ir(PPh3)2(CO)(Cl)], illustrates this scenario. The ligands of this species activate iridium for the oxidative addition of hydrogen, yielding the dihydride cis-[Ir(H)2(PPh3)2(CO)(Cl)] complex. Despite the simplicity of this system, thousands of derivatives can be formulated for the activation of H2, with a limited number of ligands belonging to the same general categories found in the original complex. In this work, we show how DFT and machine learning (ML) methods can be combined to enable the prediction of reactivity within large chemical spaces containing thousands of complexes. In a space of 2574 species derived from Vaska\u27s complex, data from DFT calculations are used to train and test ML models that predict the H2-activation barrier. In contrast to experiments and calculations requiring several days to be completed, the ML models were trained and used on a laptop on a time-scale of minutes. As a first approach, we combined Bayesian-optimized artificial neural networks (ANN) with features derived from autocorrelation and deltametric functions. The resulting ANNs achieved high accuracies, with mean absolute errors (MAE) between 1 and 2 kcal mol−1, depending on the size of the training set. By using a Gaussian process (GP) model trained with a set of selected features, including fingerprints, accuracy was further enhanced. Remarkably, this GP model minimized the MAE below 1 kcal mol−1, by using only 20% or less of the data available for training. The gradient boosting (GB) method was also used to assess the relevance of the features, which was used for both feature selection and model interpretation purposes. Features accounting for chemical composition, atom size and electronegativity were found to be the most determinant in the predictions. Further, the ligand fragments with the strongest influence on the H2-activation barrier were identified

    Development of predictive models for catalyst development

    Abstract. This work was done as a part of the BioSPRINT project, which aims to improve biorefinery operations through process intensification and to replace fossil-based polymers with new bio-based products. The goal was to identify machine learned (ML) models that will accelerate the catalyst identification with high-throughput (HTP) screening methods, identify non-obvious formulations and allow catalyst tuning for different feedstock compositions. Maximum activity for conversion of complex sugar mixtures with optimal selectivity towards the key products of interest is desired. In the literature part of the thesis, ML was studied in general, where the focus was on different variable selection methods and modeling techniques, more specifically on data-driven modeling. Furthermore, modeling in catalysis was discussed with focus on ML in catalysis. Catalyst screening and selection, descriptor modeling and selection, and predictive modeling in catalysis were studied. In the experimental part, focus was on developing ML models that predict catalyst performance with relevant descriptors. Dataset for hydrogenation of 5-ethoxymethylfurfural with simple bimetal catalysts, including main metals and promoters, was used as ML model input with the addition of catalyst descriptors found in the literature. Four different responses were used in the experiments: selectivity and conversion with two different solvents. Methods used in the experimental part were discussed in detail, where data collection, preprocessing, variable selection, modeling and model validation were considered. Reference models without variable selection were first identified. Secondly, regularization algorithms were used to identify models. Finally, models with variable subsets obtained with regularization algorithms were identified. The effect of cross-validation was also studied. In general, good modeling results were obtained with boosted ensemble tree methods, support vector machine (SVM) methods and Gaussian process regression (GPR) methods. Lasso regression turned out to be the best variable selection method. Good results were obtained with the descriptors found in the literature. It was also shown, that fairly good results can be obtained with only two variables in the studied case. Promoter variables were not considered nearly as important as main metals with variable selection algorithms. Even though the modeling results were good, the variable selection methods were almost purely data-driven, and the actual relevance of the variables cannot be guaranteed. In the future work, optimization should be studied with the goal of finding catalysts that maximize catalyst performance values based on the model predictions. Also, extrapolation capabilities of the models need to be studied and improved. The studied methods can be easily implemented to other datasets. In the BioSPRINT project, experimental results related to the dehydration reaction of C5 and C6 sugars with simple metal catalysts will be obtained and used with the studied methods.Ennustavien mallien laatiminen katalyytin valmistuksen tehostamiseksi. Tiivistelmä. Tämä työ tehtiin osana BioSPRINT-projektia, jonka tavoitteena on kehittää biojalostamoiden toimintaa parantamalla niiden prosessitehokkuutta ja korvata fossiilipohjaiset polymeerit uusilla biopohjaisilla tuotteilla. Työn tavoitteena oli muodostaa koneoppimista hyödyntämällä mallit, jotka nopeuttavat optimaalisten katalyyttien löytämistä tehoseulonnan (high-throughput (HTP) screening) avulla, auttavat identifioimaan vaikeasti löydettäviä katalyyttiyhdistelmiä ja mahdollistavat katalyytin valinnan eri lähtöainekoostumuksilla. Tavoitteena on maksimoida monimutkaisten sokeriyhdisteiden konversio ja selektiivisyys halutuiksi tuotteiksi. Työn kirjallisuusosiossa perehdyttiin koneoppimiseen yleisellä tasolla, missä pääpaino oli muuttujanvalintamenetelmissä ja datapohjaisissa mallinnusmenetelmissä. Lisäksi kirjallisuusosassa tutkittiin mallinnuksen käyttöä katalyysissä, missä pääpaino oli koneoppimisen käytössä. Työssä tarkasteltiin myös katalyyttien seulontaa ja valintaa, laskennallisten muuttujien (deskriptorien) määrittelyä ja valintaa, sekä ennustavan mallinnuksen käyttöä katalyysissä. Kokeellisessa osiossa painopiste oli koneoppimista hyödyntävien mallien muodostuksessa, jotka ennustavat katalyyttien suorituskykyä oleellisilla deskriptoreilla. Data-aineistona käytettiin 5-etoksimetyylifurfuraalin hydrausreaktion tuloksia yksinkertaisilla kaksikomponenttisilla metallikatalyyteillä, jotka sisältävät päämetallin ja promoottorin. Data-aineistoa täydennettiin kirjallisuudesta löytyvillä katalyyttien deskriptoreilla ja käytettiin koneoppimista hyödyntävien mallien sisääntulona. Tutkimuksissa käytettiin neljää eri vastemuuttujaa: selektiivisyyttä ja konversiota kahdella eri liuottimella. Kokeellisessa osiossa käytetyt menetelmät käytiin läpi perusteellisesti huomioon ottaen data-aineiston keräämisen, esikäsittelyn, muuttujanvalinnan, mallinnuksen ja mallin validoinnin. Ensin referenssimallit identifioitiin. Tämän jälkeen regularisaatioalgoritmeilla suoritettiin mallinnus. Lopuksi mallinnus suoritettiin käyttämällä muuttujajoukkoja, jotka oli valittu käyttäen regularisaatioalgoritmeja. Myös ristivalidoinnin vaikutusta tutkittiin. Yleisesti hyvät mallinnustulokset saavutettiin boosted ensemble tree -tekniikalla, tukivektorikoneella ja Gaussian process -regressiolla. Lasso-menetelmä todettiin parhaaksi muuttujanvalinta-algoritmiksi. Hyvät tulokset saavutettiin kirjallisuudesta löytyvien deskriptorien avulla. Tutkimuksissa todettiin myös, että hyvät mallinnustulokset voidaan saavuttaa kyseisessä tutkimustapauksessa jopa vain kahdella muuttujalla. Päämetalleja kuvaavien muuttujien merkitsevyys todettiin paljon suuremmaksi kuin promoottorien vastaavien muuttujien. Saatavia mallinnustuloksia tarkasteltaessa täytyy huomioida, että muuttujanvalinta oli melkein täysin datapohjainen eikä muuttujien varsinaista merkitsevyyttä voida taata. Jatkossa mallien ennustuksia voidaan hyödyntää optimoinnissa, jossa tavoitteena on etsiä katalyyttiyhdistelmä, joka maksimoi katalyyttien suorituskyvyn. Myös mallin ekstrapolointikykyä täytyy tutkia ja kehittää. Tutkittavat menetelmät ovat helposti sovellettavissa myös muille samantyylisille data-aineistoille. BioSPRINT-projektista saadaan tulevaisuudessa käytettäväksi viisi- ja kuusihiilisten sokerien dehydraatioon perustuva data-aineisto yksinkertaisilla metallikatalyyteillä, jota tullaan käyttämään jatkotutkimuksissa

    Designing algorithms to aid discovery by chemical robots

    Recently, automated robotic systems have become very efficient, thanks to improved coupling between sensor systems and algorithms, of which the latter have been gaining significance thanks to the increase in computing power over the past few decades. However, intelligent automated chemistry platforms for discovery orientated tasks need to be able to cope with the unknown, which is a profoundly hard problem. In this Outlook, we describe how recent advances in the design and application of algorithms, coupled with the increased amount of chemical data available, and automation and control systems may allow more productive chemical research and the development of chemical robots able to target discovery. This is shown through examples of workflow and data processing with automation and control, and through the use of both well-used and cutting-edge algorithms illustrated using recent studies in chemistry. Finally, several algorithms are presented in relation to chemical robots and chemical intelligence for knowledge discovery

    Biomass Gasification and Applied Intelligent Retrieval in Modeling

    Gasification technology often requires the use of modeling approaches to incorporate several intermediate reactions in a complex nature. These traditional models are occasionally impractical and often challenging to bring reliable relations between performing parameters. Hence, this study outlined the solutions to overcome the challenges in modeling approaches. The use of machine learning (ML) methods is essential and a promising integration to add intelligent retrieval to traditional modeling approaches of gasification technology. Regarding this, this study charted applied ML-based artificial intelligence in the field of gasification research. This study includes a summary of applied ML algorithms, including neural network, support vector, decision tree, random forest, and gradient boosting, and their performance evaluations for gasification technologies

    Automated in Silico Design of Homogeneous Catalysts

    Catalyst discovery is increasingly relying on computational chemistry, and many of the computational tools are currently being automated. The state of this automation and the degree to which it may contribute to speeding up development of catalysts are the subject of this Perspective. We also consider the main challenges associated with automated catalyst design, in particular the generation of promising and chemically realistic candidates, the tradeoff between accuracy and cost in estimating the catalytic performance, the opportunities associated with automated generation and use of large amounts of data, and even how to define the objectives of catalyst design. Throughout the Perspective, we take a cross-disciplinary approach and evaluate the potential of methods and experiences from fields other than homogeneous catalysis. Finally, we provide an overview of software packages available for automated in silico design of homogeneous catalysts.publishedVersio