6 research outputs found

    Critical assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: focusing on applicability domain and overfitting by variable selection

    Get PDF
    The estimation of the accuracy of predictions is a critical problem in QSAR modeling. The "distance to model" can be defined as a metric that defines the similarity between the training set molecules and the test set compound for the given property in the context of a specific model. It could be expressed in many different ways, e.g., using Tanimoto coefficient, leverage, correlation in space of models, etc. In this paper we have used mixtures of Gaussian distributions as well as statistical tests to evaluate six types of distances to models with respect to their ability to discriminate compounds with small and large prediction errors. The analysis was performed for twelve QSAR models of aqueous toxicity against T. pyriformis obtained with different machine-learning methods and various types of descriptors. The distances to model based on standard deviation of predicted toxicity calculated from the ensemble of models afforded the best results. This distance also successfully discriminated molecules with low and large prediction errors for a mechanism-based model developed using log P and the Maximum Acceptor Superdelocalizability descriptors. Thus, the distance to model metric could also be used to augment mechanistic QSAR models by estimating their prediction errors. Moreover, the accuracy of prediction is mainly determined by the training set data distribution in the chemistry and activity spaces but not by QSAR approaches used to develop the models. We have shown that incorrect validation of a model may result in the wrong estimation of its performance and suggested how this problem could be circumvented. The toxicity of 3182 and 48774 molecules from the EPA High Production Volume (HPV) Challenge Program and EINECS (European chemical Substances Information System), respectively, was predicted, and the accuracy of prediction was estimated. The developed models are available online at http://www.qspr.org site

    Applications of artificial neural networks (ANNs) in several different materials research fields

    Get PDF
    PhDIn materials science, the traditional methodological framework is the identification of the composition-processing-structure-property causal pathways that link hierarchical structure to properties. However, all the properties of materials can be derived ultimately from structure and bonding, and so the properties of a material are interrelated to varying degrees. The work presented in this thesis, employed artificial neural networks (ANNs) to explore the correlations of different material properties with several examples in different fields. Those including 1) to verify and quantify known correlations between physical parameters and solid solubility of alloy systems, which were first discovered by Hume-Rothery in the 1930s. 2) To explore unknown crossproperty correlations without investigating complicated structure-property relationships, which is exemplified by i) predicting structural stability of perovskites from bond-valence based tolerance factors tBV, and predicting formability of perovskites by using A-O and B-O bond distances; ii) correlating polarizability with other properties, such as first ionization potential, melting point, heat of vaporization and specific heat capacity. 3) In the process of discovering unanticipated relationships between combination of properties of materials, ANNs were also found to be useful for highlighting unusual data points in handbooks, tables and databases that deserve to have their veracity inspected. By applying this method, massive errors in handbooks were found, and a systematic, intelligent and potentially automatic method to detect errors in handbooks is thus developed. Through presenting these four distinct examples from three aspects of ANN capability, different ways that ANNs can contribute to progress in materials science has been explored. These approaches are novel and deserve to be pursued as part of the newer methodologies that are beginning to underpin material research

    Prediction of the physical properties of pure chemical compounds through different computational methods.

    Get PDF
    Ph. D. University of KwaZulu-Natal, Durban 2014.Liquid thermal conductivities, viscosities, thermal decomposition temperatures, electrical conductivities, normal boiling point temperatures, sublimation and vaporization enthalpies, saturated liquid speeds of sound, standard molar chemical exergies, refractive indices, and freezing point temperatures of pure organic compounds and ionic liquids are important thermophysical properties needed for the design and optimization of products and chemical processes. Since sufficiently purification of pure compounds as well as experimentally measuring their thermophysical properties are costly and time consuming, predictive models are of great importance in engineering. The liquid thermal conductivity of pure organic compounds was the first investigated property, in this study, for which, a general model, a quantitative structure property relationship, and a group contribution method were developed. The novel gene expression programming mathematical strategy [1, 2], firstly introduced by our group, for development of non-linear models for thermophysical properties, was successfully implemented to develop an explicit model for determination of the thermal conductivity of approximately 1600 liquids at different temperatures but atmospheric pressure. The statistical parameters of the obtained correlation show about 9% absolute average relative deviation of the results from the corresponding DIPPR 801 data [3]. It should be mentioned that the gene expression programing technique is a complicated mathematical algorithm and needs a significant computer power and this is the largest databases of thermophysical property that has been successfully managed by this strategy. The quantitative structure property relationship was developed using the sequential search algorithm and the same database used in previous step. The model shows the average absolute relative deviation (AARD %), standard deviation error, and root mean square error of 7.4%, 0.01, and 0.01 over the training, validation and test sets, respectively. The database used in previous sections was used to develop a group contribution model for liquid thermal conductivity. The statistical analysis of the performance of the obtained model shows approximately a 7.1% absolute average relative deviation of the results from the corresponding DIPPR 801 [4] data. In the next stage, an extensive database of viscosities of 443 ionic liquids was initially compiled from literature (more than 200 articles). Then, it was employed to develop a group contribution model. Using this model, a training set composed of 1336 experimental data was correlated with a low AARD% of about 6.3. A test set consists of 336 data point was used to validate this model. It shows an AARD% of 6.8 for the test set. In the next part of this study, an extensive database of thermal decomposition temperature of 586 ionic liquids was compiled from literature. Then, it was used to develop a quantitative structure property relationship. The proposed quantitative structure property relationship produces an acceptable average absolute relative deviation (AARD) of less than 5.2 % taking into consideration all 586 experimental data values. The updated database of thermal decomposition temperature including 613 ionic liquids was subsequently used to develop a group contribution model. Using this model, a training set comprised of 489 data points was correlated with a low AARD of 4.5 %. A test set consisting of 124 data points was employed to test its capability. The model shows an AARD of 4.3 % for the test set. Electrical conductivity of ionic liquids was the next property investigated in this study. Initially, a database of electrical conductivities of 54 ionic liquids was collected from literature. Then, it was used to develop two models; a quantitative structure property relationship and a group contribution model. Since the electrical conductivities of ionic liquids has a complicated temperature- and chemical structure- dependency, the least square support vector machines strategy was used as a non-linear regression tool to correlate the electrical conductivity of ionic liquids. The deviation of the quantitative structure property relationship from the 783 experimental data used in its development (training set) is 1.8%. The validity of the model was then evaluated using another experimental data set comprising 97 experimental data (deviation: 2.5%). Finally, the reproducibility and reliability of the model was successfully assessed using the last experimental dataset of 97 experimental data (deviation: 2.7%). Using the group contribution model, a training set composed of 863 experimental data was correlated with a low AARD of about 3.1% from the corresponding experimental data. Then, the model was validated using a data set composed of 107 experimental data points with a low AARD of 3.6%. Finally, a test set consists of 107 data points was used for its validation. It shows an AARD of 4.9% for the test set. In the next stage, the most comprehensive database of normal boiling point temperatures of approximately 18000 pure organic compounds was provided and used to develop a quantitative structure property relationship. In order to develop the model, the sequential search algorithm was initially used to select the best subset of molecular descriptors. In the next step, a three-layer feed forward artificial neural network was used as a regression tool to develop the final model. It seems that this is the first time that the quantitative structure property relationship technique has successfully been used to handle a large database as large as the one used for normal boiling point temperatures of pure organic compounds. Generally, handling large databases of compounds has always been a challenge in quantitative structure property relationship world due to the handling large number of chemical structures (particularly, the optimization of the chemical structures), the high demand of computational power and very high percentage of failures of the software packages. As a result, this study is regarded as a long step forward in quantitative structure property relationship world. A comprehensive database of sublimation enthalpies of 1269 pure organic compounds at 298.15 K was successfully compiled from literature and used to develop an accurate group contribution. The model is capable of predicting the sublimation enthalpies of organic compounds at 298.15 K with an acceptable average absolute relative deviation between predicted and experimental values of 6.4%. Vaporization enthalpies of organic compounds at 298.15 K were also studied in this study. An extensive database of 2530 pure organic compounds was used to develop a comprehensive group contribution model. It demonstrates an acceptable %AARD of 3.7% from experimental data. Speeds of sound in saturated liquid phase was the next property investigated in this study. Initially, A collection of 1667 experimental data for 74 pure chemical compounds were extracted from the ThermoData Engine of National Institute of Standards and Technology [5]. Then, a least square support vector machines-group contribution model was developed. The model shows a low AARD% of 0.5% from the corresponding experimental data. In the next part of this study, a simple group contribution model was presented for the prediction of the standard molar chemical exergy of pure organic compounds. It is capable of predicting the standard chemical exergy of pure organic compounds with an acceptable average absolute relative deviation of 1.6% from the literature data of 133 organic compounds. The largest ever reported databank for refractive indices of approximately 12 000 pure organic compounds was initially provided. A novel computational scheme based on coupling the sequential search strategy with the genetic function approximation (GFA) strategy was used to develop a model for refractive indices of pure organic compounds. It was determined that the strategy can have both the capabilities of handling large databases (the advantage of sequential search algorithm over other subset variable selection methods) and choosing most accurate subset of variables (the advantages of genetic algorithm-based subset variable selection methods such as GFA). The model shows a promising average absolute relative deviation of 0.9 % from the corresponding literature values. Subsequently, a group contribution model was developed based on the same database. The model shows an average absolute relative deviation of 0.83% from corresponding literature values. Freezing Point temperature of organic compounds was the last property investigated. Initially, the largest ever reported databank in open literature for freezing points of more than 16 500 pure organic compounds was provided. Then, the sequential search algorithm was successfully applied to derive a model. The model shows an average absolute relative deviations of 12.6% from the corresponding literature values. The same database was used to develop a group contribution model. The model demonstrated an average absolute relative deviation of 10.76%, which is of adequate accuracy for many practical applications

    Quantitative Structure-Property Relationship Modeling & Computer-Aided Molecular Design: Improvements & Applications

    Get PDF
    The objective of this work was to develop an integrated capability to design molecules with desired properties. An automated robust genetic algorithm (GA) module has been developed to facilitate the rapid design of new molecules. The generated molecules were scored for the relevant thermophysical properties using non-linear quantitative structure-property relationship (QSPR) models. The descriptor reduction and model development for the QSPR models were implemented using evolutionary algorithms (EA) and artificial neural networks (ANNs). QSPR models for octanol-water partition coefficients (Kow), melting points (MP), normal boiling points (NBP), Gibbs energy of formation, universal quasi-chemical (UNIQUAC) model parameters, and infinite-dilution activity coefficients of cyclohexane and benzene in various organic solvents were developed in this work. To validate the current design methodology, new chemical penetration enhancers (CPEs) for transdermal insulin delivery and new solvents for extractive distillation of the cyclohexane + benzene system were designed. In general, the use of non-linear QSPR models developed in this work provided predictions better than or as good as existing literature models. In particular, the current models for NBP, Gibbs energy of formation, UNIQUAC model parameters, and infinite-dilution activity coefficients have lower errors on external test sets than the literature models. The current models for MP and Kow are comparable with the best models in the literature. The GA-based design framework implemented in this work successfully identified new CPEs for transdermal delivery of insulin, with permeability values comparable to the best CPEs in the literature. Also, new solvents for extractive distillation of cyclohexane/benzene with selectivities two to four times that of the existing solvents were identified. These two case studies validate the ability of the current design framework to identify new molecules with desired target properties.Chemical Engineerin

    Bioinspired Materials Design: A Text Mining Approach to Determining Design Principles of Biological Materials

    Get PDF
    Biological materials are often more efficient and tend to have a wider range and combination of properties than present-day engineered materials. Despite the limited set of components, biological materials are able to achieve great diversity in their material properties by the arrangements of the material components, which form unique structures. The structure-property relationships are known as structural design principles. With the utilization of these design principles, materials designers can develop bioinspired engineered materials with similarly improved effectiveness. While considerable research has been conducted on biological materials, identifying beneficial structural design principles can be time-intensive. To aid materials designers, the research in this dissertation focuses on the development of a text mining algorithm that can quickly identify potential structural design principles of biological materials with respect to a chosen material property or combination of properties. The development of the text mining tool involves four separate stages. The first stage centers on the creation of a basic information retrieval algorithm to extract passages describing property-specific structural design principles from a corpus of materials journal articles. Although the Stage 1 tool identifies over 90% of the principles (recall), only 32% of the returned passages are relevant (precision). The second stage investigates text classification techniques to refine the program in order to improve precision. The classic techniques of machine learning classifiers, statistical features, and part-of-speech analyses, are evaluated for effectiveness in sorting passages into relevant and irrelevant classes. In the third stage, manual identification of patterns in the returned passages is employed to create a rule-based method. The resulting Stage 3 algorithm’s precision values increase to 45%. In the final stage of algorithm development, the manual rule-based classification method is revisited to identify stricter rules to further emphasize precision. The Stage 4 algorithm successfully improves overall precision to 65% and reduces the number of returned passages by 74%, which allows a materials designer to more quickly identify useful principles. Finally, the research concludes with a validation that the text mining tool effectively identifies structural design principles and that the principles can be used in the development of bioinspired materials
    corecore