172 research outputs found

    Evolutionary Computation and QSAR Research

    Get PDF
    [Abstract] The successful high throughput screening of molecule libraries for a specific biological property is one of the main improvements in drug discovery. The virtual molecular filtering and screening relies greatly on quantitative structure-activity relationship (QSAR) analysis, a mathematical model that correlates the activity of a molecule with molecular descriptors. QSAR models have the potential to reduce the costly failure of drug candidates in advanced (clinical) stages by filtering combinatorial libraries, eliminating candidates with a predicted toxic effect and poor pharmacokinetic profiles, and reducing the number of experiments. To obtain a predictive and reliable QSAR model, scientists use methods from various fields such as molecular modeling, pattern recognition, machine learning or artificial intelligence. QSAR modeling relies on three main steps: molecular structure codification into molecular descriptors, selection of relevant variables in the context of the analyzed activity, and search of the optimal mathematical model that correlates the molecular descriptors with a specific activity. Since a variety of techniques from statistics and artificial intelligence can aid variable selection and model building steps, this review focuses on the evolutionary computation methods supporting these tasks. Thus, this review explains the basic of the genetic algorithms and genetic programming as evolutionary computation approaches, the selection methods for high-dimensional data in QSAR, the methods to build QSAR models, the current evolutionary feature selection methods and applications in QSAR and the future trend on the joint or multi-task feature selection methods.Instituto de Salud Carlos III, PIO52048Instituto de Salud Carlos III, RD07/0067/0005Ministerio de Industria, Comercio y Turismo; TSI-020110-2009-53)Galicia. ConsellerĂ­a de EconomĂ­a e Industria; 10SIN105004P

    Prediction of the physical properties of pure chemical compounds through different computational methods.

    Get PDF
    Ph. D. University of KwaZulu-Natal, Durban 2014.Liquid thermal conductivities, viscosities, thermal decomposition temperatures, electrical conductivities, normal boiling point temperatures, sublimation and vaporization enthalpies, saturated liquid speeds of sound, standard molar chemical exergies, refractive indices, and freezing point temperatures of pure organic compounds and ionic liquids are important thermophysical properties needed for the design and optimization of products and chemical processes. Since sufficiently purification of pure compounds as well as experimentally measuring their thermophysical properties are costly and time consuming, predictive models are of great importance in engineering. The liquid thermal conductivity of pure organic compounds was the first investigated property, in this study, for which, a general model, a quantitative structure property relationship, and a group contribution method were developed. The novel gene expression programming mathematical strategy [1, 2], firstly introduced by our group, for development of non-linear models for thermophysical properties, was successfully implemented to develop an explicit model for determination of the thermal conductivity of approximately 1600 liquids at different temperatures but atmospheric pressure. The statistical parameters of the obtained correlation show about 9% absolute average relative deviation of the results from the corresponding DIPPR 801 data [3]. It should be mentioned that the gene expression programing technique is a complicated mathematical algorithm and needs a significant computer power and this is the largest databases of thermophysical property that has been successfully managed by this strategy. The quantitative structure property relationship was developed using the sequential search algorithm and the same database used in previous step. The model shows the average absolute relative deviation (AARD %), standard deviation error, and root mean square error of 7.4%, 0.01, and 0.01 over the training, validation and test sets, respectively. The database used in previous sections was used to develop a group contribution model for liquid thermal conductivity. The statistical analysis of the performance of the obtained model shows approximately a 7.1% absolute average relative deviation of the results from the corresponding DIPPR 801 [4] data. In the next stage, an extensive database of viscosities of 443 ionic liquids was initially compiled from literature (more than 200 articles). Then, it was employed to develop a group contribution model. Using this model, a training set composed of 1336 experimental data was correlated with a low AARD% of about 6.3. A test set consists of 336 data point was used to validate this model. It shows an AARD% of 6.8 for the test set. In the next part of this study, an extensive database of thermal decomposition temperature of 586 ionic liquids was compiled from literature. Then, it was used to develop a quantitative structure property relationship. The proposed quantitative structure property relationship produces an acceptable average absolute relative deviation (AARD) of less than 5.2 % taking into consideration all 586 experimental data values. The updated database of thermal decomposition temperature including 613 ionic liquids was subsequently used to develop a group contribution model. Using this model, a training set comprised of 489 data points was correlated with a low AARD of 4.5 %. A test set consisting of 124 data points was employed to test its capability. The model shows an AARD of 4.3 % for the test set. Electrical conductivity of ionic liquids was the next property investigated in this study. Initially, a database of electrical conductivities of 54 ionic liquids was collected from literature. Then, it was used to develop two models; a quantitative structure property relationship and a group contribution model. Since the electrical conductivities of ionic liquids has a complicated temperature- and chemical structure- dependency, the least square support vector machines strategy was used as a non-linear regression tool to correlate the electrical conductivity of ionic liquids. The deviation of the quantitative structure property relationship from the 783 experimental data used in its development (training set) is 1.8%. The validity of the model was then evaluated using another experimental data set comprising 97 experimental data (deviation: 2.5%). Finally, the reproducibility and reliability of the model was successfully assessed using the last experimental dataset of 97 experimental data (deviation: 2.7%). Using the group contribution model, a training set composed of 863 experimental data was correlated with a low AARD of about 3.1% from the corresponding experimental data. Then, the model was validated using a data set composed of 107 experimental data points with a low AARD of 3.6%. Finally, a test set consists of 107 data points was used for its validation. It shows an AARD of 4.9% for the test set. In the next stage, the most comprehensive database of normal boiling point temperatures of approximately 18000 pure organic compounds was provided and used to develop a quantitative structure property relationship. In order to develop the model, the sequential search algorithm was initially used to select the best subset of molecular descriptors. In the next step, a three-layer feed forward artificial neural network was used as a regression tool to develop the final model. It seems that this is the first time that the quantitative structure property relationship technique has successfully been used to handle a large database as large as the one used for normal boiling point temperatures of pure organic compounds. Generally, handling large databases of compounds has always been a challenge in quantitative structure property relationship world due to the handling large number of chemical structures (particularly, the optimization of the chemical structures), the high demand of computational power and very high percentage of failures of the software packages. As a result, this study is regarded as a long step forward in quantitative structure property relationship world. A comprehensive database of sublimation enthalpies of 1269 pure organic compounds at 298.15 K was successfully compiled from literature and used to develop an accurate group contribution. The model is capable of predicting the sublimation enthalpies of organic compounds at 298.15 K with an acceptable average absolute relative deviation between predicted and experimental values of 6.4%. Vaporization enthalpies of organic compounds at 298.15 K were also studied in this study. An extensive database of 2530 pure organic compounds was used to develop a comprehensive group contribution model. It demonstrates an acceptable %AARD of 3.7% from experimental data. Speeds of sound in saturated liquid phase was the next property investigated in this study. Initially, A collection of 1667 experimental data for 74 pure chemical compounds were extracted from the ThermoData Engine of National Institute of Standards and Technology [5]. Then, a least square support vector machines-group contribution model was developed. The model shows a low AARD% of 0.5% from the corresponding experimental data. In the next part of this study, a simple group contribution model was presented for the prediction of the standard molar chemical exergy of pure organic compounds. It is capable of predicting the standard chemical exergy of pure organic compounds with an acceptable average absolute relative deviation of 1.6% from the literature data of 133 organic compounds. The largest ever reported databank for refractive indices of approximately 12 000 pure organic compounds was initially provided. A novel computational scheme based on coupling the sequential search strategy with the genetic function approximation (GFA) strategy was used to develop a model for refractive indices of pure organic compounds. It was determined that the strategy can have both the capabilities of handling large databases (the advantage of sequential search algorithm over other subset variable selection methods) and choosing most accurate subset of variables (the advantages of genetic algorithm-based subset variable selection methods such as GFA). The model shows a promising average absolute relative deviation of 0.9 % from the corresponding literature values. Subsequently, a group contribution model was developed based on the same database. The model shows an average absolute relative deviation of 0.83% from corresponding literature values. Freezing Point temperature of organic compounds was the last property investigated. Initially, the largest ever reported databank in open literature for freezing points of more than 16 500 pure organic compounds was provided. Then, the sequential search algorithm was successfully applied to derive a model. The model shows an average absolute relative deviations of 12.6% from the corresponding literature values. The same database was used to develop a group contribution model. The model demonstrated an average absolute relative deviation of 10.76%, which is of adequate accuracy for many practical applications

    (Q)SAR Modelling of Nanomaterial Toxicity - A Critical Review

    Get PDF
    There is an increasing recognition that nanomaterials pose a risk to human health, and that the novel engineered nanomaterials (ENMs) in the nanotechnology industry and their increasing industrial usage poses the most immediate problem for hazard assessment, as many of them remain untested. The large number of materials and their variants (different sizes and coatings for instance) that require testing and ethical pressure towards non-animal testing means that expensive animal bioassay is precluded, and the use of (quantitative) structure activity relationships ((Q)SAR) models as an alternative source of hazard information should be explored. (Q)SAR modelling can be applied to fill the critical knowledge gaps by making the best use of existing data, prioritize physicochemical parameters driving toxicity, and provide practical solutions to the risk assessment problems caused by the diversity of ENMs. This paper covers the core components required for successful application of (Q)SAR technologies to ENMs toxicity prediction, and summarizes the published nano-(Q)SAR studies and outlines the challenges ahead for nano-(Q)SAR modelling. It provides a critical review of (1) the present status of the availability of ENMs characterization/toxicity data, (2) the characterization of nanostructures that meets the need of (Q)SAR analysis, (3) the summary of published nano-(Q)SAR studies and their limitations, (4) the in silico tools for (Q)SAR screening of nanotoxicity and (5) the prospective directions for the development of nano-(Q)SAR models

    Quantitative Structure-Property Relationship Modeling & Computer-Aided Molecular Design: Improvements & Applications

    Get PDF
    The objective of this work was to develop an integrated capability to design molecules with desired properties. An automated robust genetic algorithm (GA) module has been developed to facilitate the rapid design of new molecules. The generated molecules were scored for the relevant thermophysical properties using non-linear quantitative structure-property relationship (QSPR) models. The descriptor reduction and model development for the QSPR models were implemented using evolutionary algorithms (EA) and artificial neural networks (ANNs). QSPR models for octanol-water partition coefficients (Kow), melting points (MP), normal boiling points (NBP), Gibbs energy of formation, universal quasi-chemical (UNIQUAC) model parameters, and infinite-dilution activity coefficients of cyclohexane and benzene in various organic solvents were developed in this work. To validate the current design methodology, new chemical penetration enhancers (CPEs) for transdermal insulin delivery and new solvents for extractive distillation of the cyclohexane + benzene system were designed. In general, the use of non-linear QSPR models developed in this work provided predictions better than or as good as existing literature models. In particular, the current models for NBP, Gibbs energy of formation, UNIQUAC model parameters, and infinite-dilution activity coefficients have lower errors on external test sets than the literature models. The current models for MP and Kow are comparable with the best models in the literature. The GA-based design framework implemented in this work successfully identified new CPEs for transdermal delivery of insulin, with permeability values comparable to the best CPEs in the literature. Also, new solvents for extractive distillation of cyclohexane/benzene with selectivities two to four times that of the existing solvents were identified. These two case studies validate the ability of the current design framework to identify new molecules with desired target properties.Chemical Engineerin

    Evaluation of the availability and applicability of computational approaches in the safety assessment of nanomaterials: Final report of the Nanocomput project

    Get PDF
    This is the final report of the Nanocomput project, the main aims of which were to review the current status of computational methods that are potentially useful for predicting the properties of engineered nanomaterials, and to assess their applicability in order to provide advice on the use of these approaches for the purposes of the REACH regulation. Since computational methods cover a broad range of models and tools, emphasis was placed on Quantitative Structure-Property Relationship (QSPR) and Quantitative Structure-Activity Relationship (QSAR) models, and their potential role in predicting NM properties. In addition, the status of a diverse array of compartment-based mathematical models was assessed. These models comprised toxicokinetic (TK), toxicodynamic (TD), in vitro and in vivo dosimetry, and environmental fate models. Finally, based on systematic reviews of the scientific literature, as well as the outputs of the EU-funded research projects, recommendations for further research and development were also made. The Nanocomput project was carried out by the European Commission’s Joint Research Centre (JRC) for the Directorate-General (DG) for Internal Market, Industry, Entrepreneurship and SMEs (DG GROW) under the terms of an Administrative Arrangement between JRC and DG GROW. The project lasted 39 months, from January 2014 to March 2017, and was supported by a steering group with representatives from DG GROW, DG Environment and the European Chemicals Agency (ECHA).JRC.F.3-Chemicals Safety and Alternative Method

    SMARTS Approach to Chemical Data Mining and Physicochemical Property Prediction.

    Full text link
    The calculation of physicochemical and biological properties is essential in order to facilitate modern drug discovery. Chemical spaces dimensionalized by these descriptors have been used to scaffold-hop in order to discover new lead and drug-like molecules. Broadening the boundaries of structure based drug design, these molecules are expected to share the same physiological target and have similar efficacy, as do known drug molecules sharing the same region in chemical property space. In the past few decades physicochemical and ADMET (absorption, distribution, metabolism, elimination, and toxicity) property predictors have been the subject of increased focus in academia and the pharmaceutical industry. Due to the ever increasing attention given to data mining and property predictions, we first discuss the sources of experimental pKa values and current methodologies used for pKa prediction in proteins and small molecules. Of particular concern is an analysis of the scope, statistical validity, overall accuracy, and predictive power of these methods. The expressed concerns are not limited to predicting pKa, but apply to all empirical predictive methodologies. In a bottom-up approach, we explored the influence of freely generated SMARTS string representations of molecular fragments on chelation and cytotoxicity. Later investigations, involving the derivation of predictive models, use stepwise regression to determine the optimal pool of SMARTS strings having the greatest influence over the property of interest. By applying a unique scoring system to sets of highly generalized SMARTS strings, we have constructed well balanced regression trees with predictive accuracy exceeding that of many published and commercially available models for cytotoxicity, pKa, and aqueous solubility. The methodology is robust, extremely adaptable, and can handle any molecular dataset with experimental data. This story details our struggles of data gathering, curation, and the development of a machine learning methodology able to derive and validate highly accurate regression trees capable of extremely fast property predictions. Regression trees created by our method are well suited to calculate descriptors for large in silico molecular libraries, facilitating data mining of chemical spaces in search of new lead molecules in drug discovery.Ph.D.Medicinal ChemistryUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/64627/1/adamclee_1.pd

    Computational approaches to virtual screening in human central nervous system therapeutic targets

    Get PDF
    In the past several years of drug design, advanced high-throughput synthetic and analytical chemical technologies are continuously producing a large number of compounds. These large collections of chemical structures have resulted in many public and commercial molecular databases. Thus, the availability of larger data sets provided the opportunity for developing new knowledge mining or virtual screening (VS) methods. Therefore, this research work is motivated by the fact that one of the main interests in the modern drug discovery process is the development of new methods to predict compounds with large therapeutic profiles (multi-targeting activity), which is essential for the discovery of novel drug candidates against complex multifactorial diseases like central nervous system (CNS) disorders. This work aims to advance VS approaches by providing a deeper understanding of the relationship between chemical structure and pharmacological properties and design new fast and robust tools for drug designing against different targets/pathways. To accomplish the defined goals, the first challenge is dealing with big data set of diverse molecular structures to derive a correlation between structures and activity. However, an extendable and a customizable fully automated in-silico Quantitative-Structure Activity Relationship (QSAR) modeling framework was developed in the first phase of this work. QSAR models are computationally fast and powerful tool to screen huge databases of compounds to determine the biological properties of chemical molecules based on their chemical structure. The generated framework reliably implemented a full QSAR modeling pipeline from data preparation to model building and validation. The main distinctive features of the designed framework include a)efficient data curation b) prior estimation of data modelability and, c)an-optimized variable selection methodology that was able to identify the most biologically relevant features responsible for compound activity. Since the underlying principle in QSAR modeling is the assumption that the structures of molecules are mainly responsible for their pharmacological activity, the accuracy of different structural representation approaches to decode molecular structural information largely influence model predictability. However, to find the best approach in QSAR modeling, a comparative analysis of two main categories of molecular representations that included descriptor-based (vector space) and distance-based (metric space) methods was carried out. Results obtained from five QSAR data sets showed that distance-based method was superior to capture the more relevant structural elements for the accurate characterization of molecular properties in highly diverse data sets (remote chemical space regions). This finding further assisted to the development of a novel tool for molecular space visualization to increase the understanding of structure-activity relationships (SAR) in drug discovery projects by exploring the diversity of large heterogeneous chemical data. In the proposed visual approach, four nonlinear DR methods were tested to represent molecules lower dimensionality (2D projected space) on which a non-parametric 2D kernel density estimation (KDE) was applied to map the most likely activity regions (activity surfaces). The analysis of the produced probabilistic surface of molecular activities (PSMAs) from the four datasets showed that these maps have both descriptive and predictive power, thus can be used as a spatial classification model, a tool to perform VS using only structural similarity of molecules. The above QSAR modeling approach was complemented with molecular docking, an approach that predicts the best mode of drug-target interaction. Both approaches were integrated to develop a rational and re-usable polypharmacology-based VS pipeline with improved hits identification rate. For the validation of the developed pipeline, a dual-targeting drug designing model against Parkinson’s disease (PD) was derived to identify novel inhibitors for improving the motor functions of PD patients by enhancing the bioavailability of dopamine and avoiding neurotoxicity. The proposed approach can easily be extended to more complex multi-targeting disease models containing several targets and anti/offtargets to achieve increased efficacy and reduced toxicity in multifactorial diseases like CNS disorders and cancer. This thesis addresses several issues of cheminformatics methods (e.g., molecular structures representation, machine learning, and molecular similarity analysis) to improve and design new computational approaches used in chemical data mining. Moreover, an integrative drug-designing pipeline is designed to improve polypharmacology-based VS approach. This presented methodology can identify the most promising multi-targeting candidates for experimental validation of drug-targets network at the systems biology level in the drug discovery process

    Machine learning applications to essential oils and natural extracts

    Get PDF
    Machine Learning (ML) is a branch of Artificial Intelligence (AI) that allow computers to learn without being explicitly programmed. Various are the applications of ML in pharmaceutical sciences, especially for the prediction of chemical bioactivity and physical properties, becoming an integral component of the drug discovery process. ML is characterized by three learning paradigms that differ in the type of task or problem that an algorithm is intended to solve: supervised, unsupervised, and reinforcement learning. In chapter 2, supervised learning methods were applied to extracts of Lycium barbarum L. fruits for the development of a QSPR model to predict zeaxanthin and carotenoids content based on routinely colorimetric analyses performed on homogenized samples, developing a useful tool that could be used in the food industry. In chapters 3 and 4, ML was applied to the chemical composition of essential oils and correlated to the experimentally determined associated biofilm modulation influence that was either positive or negative. In these two studies, it was demonstrated that biofilm growth is influenced by the presence of essential oils extracted from different plants harvested in different seasons. ML classification techniques were used to develop a Quantitative Activity-Composition Relationship (QCAR) to discover the chemical components mainly responsible for the anti-biofilm activity. The derived models demonstrated that machine learning is a valuable tool to investigate complex chemical mixtures, enabling scientists to understand each component's contribution to the activity. Therefore, these classification models can describe and predict the activity of chemical mixtures and guide the composition of artificial essential oils with desired biological activity. In chapter 5, unsupervised learning models were developed and applied to clinical strains of bacteria that cause cystic fibrosis. The most severe infections reoccurring in cystic fibrosis are due to S. aureus and P. aeruginosa. Intensive use of antimicrobial drugs to fight lung infections leads to the development of antibiotic-resistant bacterial strains. New antimicrobial compounds should be identified to overcome antibiotic resistance in patients. Sixty-one essential oils were studied against a panel of 40 clinical strains of S. aureus and P. aeruginosa isolated from cystic fibrosis patients, and unsupervised machine learning algorithms were applied to pick-up a small number of representative strains (clusters of strains) among the panel of 40. Thus, rapidly identifying three essential oils that strongly inhibit antibiotic-resistant bacterial growth

    The determination of petroleum reservoir fluid properties : application of robust modeling approaches.

    Get PDF
    Doctor of Philosophy in Chemical Engineering. University of KwaZulu-Natal, Durban 2016.Abstract available in PDF file
    • …
    corecore