11 research outputs found

    Basic approaches and applications of QSAR/QSPR methods

    The main objective of this paper is todescribe briefly the applications and methodologies involved in QSAR/QSPR, relate and comparethem to some of our various preceding published works.An intriguing and important field of activity for applying the results discussed in this work is QSAR and QSPR. Theoretical and practical results toward the statistical analysis and modeling of molecular descriptors were presented. Particularly with more emphasis on employing statistical methods for modeling data by using molecular descriptors

    Bayesian neural network learning for repeat purchase modelling in direct marketing.

    We focus on purchase incidence modelling for a European direct mail company. Response models based on statistical and neural network techniques are contrasted. The evidence framework of MacKay is used as an example implementation of Bayesian neural network learning, a method that is fairly robust with respect to problems typically encountered when implementing neural networks. The automatic relevance determination (ARD) method, an integrated feature of this framework, allows to assess the relative importance of the inputs. The basic response models use operationalisations of the traditionally discussed Recency, Frequency and Monetary (RFM) predictor categories. In a second experiment, the RFM response framework is enriched by the inclusion of other (non-RFM) customer profiling predictors. We contribute to the literature by providing experimental evidence that: (1) Bayesian neural networks offer a viable alternative for purchase incidence modelling; (2) a combined use of all three RFM predictor categories is advocated by the ARD method; (3) the inclusion of non-RFM variables allows to significantly augment the predictive power of the constructed RFM classifiers; (4) this rise is mainly attributed to the inclusion of customer\slash company interaction variables and a variable measuring whether a customer uses the credit facilities of the direct mailing company.Marketing; Companies; Models; Model; Problems; Neural networks; Networks; Variables; Credit;

    Comprehensive ensemble in QSAR prediction for drug discovery

    Background Quantitative structure-activity relationship (QSAR) is a computational modeling method for revealing relationships between structural properties of chemical compounds and biological activities. QSAR modeling is essential for drug discovery, but it has many constraints. Ensemble-based machine learning approaches have been used to overcome constraints and obtain reliable predictions. Ensemble learning builds a set of diversified models and combines them. However, the most prevalent approach random forest and other ensemble approaches in QSAR prediction limit their model diversity to a single subject. Results The proposed ensemble method consistently outperformed thirteen individual models on 19 bioassay datasets and demonstrated superiority over other ensemble approaches that are limited to a single subject. The comprehensive ensemble method is publicly available at http://data.snu.ac.kr/QSAR/ Conclusions We propose a comprehensive ensemble method that builds multi-subject diversified models and combines them through second-level meta-learning. In addition, we propose an end-to-end neural network-based individual classifier that can automatically extract sequential features from a simplified molecular-input line-entry system (SMILES). The proposed individual models did not show impressive results as a single model, but it was considered the most important predictor when combined, according to the interpretation of the meta-learning.Publication costs were funded by Seoul National University. This research was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science and ICT) [2014M3C9A3063541, 2018R1A2B3001628], the Brain Korea 21 Plus Project in 2018, and the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health and Welfare, Republic of Korea [HI15C3224]. The funding bodies did not play any roles in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript

    Intelligent Modelling of the Environmental Behaviour of Chemicals

    In view of the new European Union chemical policy REACH (Registration, Evaluation, and Authorization of Chemicals), interest in "non-animal" methods for assessing the risk potentials of chemicals towards human health and environment has increased. The incapability of classical modelling approaches in the complex and ill-defined modelling problems of chemicals' environmental behavior, together with an availability of large computing power in modern times raise an interest in applying computational models inspired by the approaches coming from the area of artificial intelligence. This thesis is devoted to promote the applications of neuro/fuzzy techniques in assessing the environmental behavior of chemicals. Some of the bottlenecks lying in the neuro/fuzzy modelling of chemicals' behavior towards environment have been identified and the solutions have been provided based on the techniques of computational intelligence.Diese Dissertation beinhaltet die Anwendung von neuronalen bzw. fuzzy Netzen, um das Umweltverhalten von Chemikalien beurteilen zu können. In dieser Arbeit werden die Probleme der Modellierung von Chemikalien gegenüber der Umwelt aufgezeigt und Lösungen angeboten. Die Lösungen basieren auf künstlichen Intelligenztechniken. Die Qualität der Modellierungstechniken hängt von mehreren Faktoren ab, z.B. der Eingabe, der Struktur und so weiter. In vielen Fällen werden keine geeigneten Resultate erhalten. So läuft es auf die Entwicklung eines Modells mit einer niedrigen Generalisierungsfähigkeit (Verallgemeinerungsfähigkeit)hinaus

    Multi-tier framework for the inferential measurement and data-driven modeling

    A framework for the inferential measurement and data-driven modeling has been proposed and assessed in several real-world application domains. The architecture of the framework has been structured in multiple tiers to facilitate extensibility and the integration of new components. Each of the proposed four tiers has been assessed in an uncoupled way to verify their suitability. The first tier, dealing with exploratory data analysis, has been assessed with the characterization of the chemical space related to the biodegradation of organic chemicals. This analysis has established relationships between physicochemical variables and biodegradation rates that have been used for model development. At the preprocessing level, a novel method for feature selection based on dissimilarity measures between Self-Organizing maps (SOM) has been developed and assessed. The proposed method selected more features than others published in literature but leads to models with improved predictive power. Single and multiple data imputation techniques based on the SOM have also been used to recover missing data in a Waste Water Treatment Plant benchmark. A new dynamic method to adjust the centers and widths of in Radial basis Function networks has been proposed to predict water quality. The proposed method outperformed other neural networks. The proposed modeling components have also been assessed in the development of prediction and classification models for biodegradation rates in different media. The results obtained proved the suitability of this approach to develop data-driven models when the complex dynamics of the process prevents the formulation of mechanistic models. The use of rule generation algorithms and Bayesian dependency models has been preliminary screened to provide the framework with interpretation capabilities. Preliminary results obtained from the classification of Modes of Toxic Action (MOA) indicate that this could be a promising approach to use MOAs as proxy indicators of human health effects of chemicals.Finally, the complete framework has been applied to three different modeling scenarios. A virtual sensor system, capable of inferring product quality indices from primary process variables has been developed and assessed. The system was integrated with the control system in a real chemical plant outperforming multi-linear correlation models usually adopted by chemical manufacturers. A model to predict carcinogenicity from molecular structure for a set of aromatic compounds has been developed and tested. Results obtained after the application of the SOM-dissimilarity feature selection method yielded better results than models published in the literature. Finally, the framework has been used to facilitate a new approach for environmental modeling and risk management within geographical information systems (GIS). The SOM has been successfully used to characterize exposure scenarios and to provide estimations of missing data through geographic interpolation. The combination of SOM and Gaussian Mixture models facilitated the formulation of a new probabilistic risk assessment approach.Aquesta tesi proposa i avalua en diverses aplicacions reals, un marc general de treball per al desenvolupament de sistemes de mesurament inferencial i de modelat basats en dades. L'arquitectura d'aquest marc de treball s'organitza en diverses capes que faciliten la seva extensibilitat així com la integració de nous components. Cadascun dels quatre nivells en que s'estructura la proposta de marc de treball ha estat avaluat de forma independent per a verificar la seva funcionalitat. El primer que nivell s'ocupa de l'anàlisi exploratòria de dades ha esta avaluat a partir de la caracterització de l'espai químic corresponent a la biodegradació de certs compostos orgànics. Fruit d'aquest anàlisi s'han establert relacions entre diverses variables físico-químiques que han estat emprades posteriorment per al desenvolupament de models de biodegradació. A nivell del preprocés de les dades s'ha desenvolupat i avaluat una nova metodologia per a la selecció de variables basada en l'ús del Mapes Autoorganitzats (SOM). Tot i que el mètode proposat selecciona, en general, un major nombre de variables que altres mètodes proposats a la literatura, els models resultants mostren una millor capacitat predictiva. S'han avaluat també tot un conjunt de tècniques d'imputació de dades basades en el SOM amb un conjunt de dades estàndard corresponent als paràmetres d'operació d'una planta de tractament d'aigües residuals. Es proposa i avalua en un problema de predicció de qualitat en aigua un nou model dinàmic per a ajustar el centre i la dispersió en xarxes de funcions de base radial. El mètode proposat millora els resultats obtinguts amb altres arquitectures neuronals. Els components de modelat proposat s'han aplicat també al desenvolupament de models predictius i de classificació de les velocitats de biodegradació de compostos orgànics en diferents medis. Els resultats obtinguts demostren la viabilitat d'aquesta aproximació per a desenvolupar models basats en dades en aquells casos en els que la complexitat de dinàmica del procés impedeix formular models mecanicistes. S'ha dut a terme un estudi preliminar de l'ús de algorismes de generació de regles i de grafs de dependència bayesiana per a introduir una nova capa que faciliti la interpretació dels models. Els resultats preliminars obtinguts a partir de la classificació dels Modes d'acció Tòxica (MOA) apunten a que l'ús dels MOA com a indicadors intermediaris dels efectes dels compostos químics en la salut és una aproximació factible.Finalment, el marc de treball proposat s'ha aplicat en tres escenaris de modelat diferents. En primer lloc, s'ha desenvolupat i avaluat un sensor virtual capaç d'inferir índexs de qualitat a partir de variables primàries de procés. El sensor resultant ha estat implementat en una planta química real millorant els resultats de les correlacions multilineals emprades habitualment. S'ha desenvolupat i avaluat un model per a predir els efectes carcinògens d'un grup de compostos aromàtics a partir de la seva estructura molecular. Els resultats obtinguts desprès d'aplicar el mètode de selecció de variables basat en el SOM milloren els resultats prèviament publicats. Aquest marc de treball s'ha usat també per a proporcionar una nova aproximació al modelat ambiental i l'anàlisi de risc amb sistemes d'informació geogràfica (GIS). S'ha usat el SOM per a caracteritzar escenaris d'exposició i per a desenvolupar un nou mètode d'interpolació geogràfica. La combinació del SOM amb els models de mescla de gaussianes dona una nova formulació al problema de l'anàlisi de risc des d'un punt de vista probabilístic

    Quantitative Structure-Property Relationships Modeling of Rate Constants of Selected Micropollutants in Drinking Water Treatment Using Ozonation and UV/H2O2

    Concern over the occurrence of micropollutants in drinking water and their health effects is increasing. Therefore, there is a growing interest in understanding micropollutant removal during drinking water treatment. Ozonation and advanced oxidation processes (AOPs) have been found to be effective in the degradation of many micropollutants. Ozonation involves reactions with both molecular ozone (direct pathway) and hydroxyl radicals (indirect pathway), while hydroxyl radicals are the main oxidants in advanced oxidation processes. Reaction rate constants of micropollutants with molecular ozone (kO3) and hydroxyl radicals (kOH) are indicators of their reactivity and are therefore useful in assessing their removal efficiency in ozonation and AOPs. However, to date, only a limited number of rate constants are available for micropollutants, especially emerging micropollutants such as endocrine disrupting chemicals (EDCs) and pharmaceuticals. Quantitative structure-property relationships (QSPR) are therefore desirable for predicting rate constants of numerous untested micropollutants without experimentation. The overall objective of this thesis was to develop predictive QSPR models which correlate the rate constants of a wide range of structural diverse micropollutants to their structural characteristics. To ensure the wide applicability of the QSPR models, the training set compound selection is critical and a group of heterogeneous compounds which are structurally representative of many others is preferred. A systematic compound selection approach which involves principal component analysis (PCA) and D-optimal onion design was applied for the first time in water treatment research. As a result, 22 micropollutants with diverse structures were selected as representatives from a large pool of micropollutants of interest (182 compounds). In addition, 12 molecular descriptors were identified which link relevant structural features to the removal mechanisms of oxidation processes. The kO3 and kOH values of the 22 selected micropollutants were then determined experimentally in bench-scale reactors at neutral pH using high performance liquid chromatography equipped with a photodiode array detector (HPLC-PDA). Three methods, competition kinetics, compound monitoring, and ozone monitoring were used for kO3 measurement, and competition kinetics was used for kOH measurement. As expected, kO3 values span a wide range from 10-2 to 107 M-1 s-1 because of the selective nature of molecular ozone. The general trends of micropollutant reactivity with ozone can be explained by the micropollutant structures and the electrophilic nature of ozone reactions. The kOH values range from 108 to 1010 M-1 s-1 because hydroxyl radicals are relatively non-selective in their reactions. For the majority of these micropollutants kO3 and kOH values were not reported prior to this study. Thus they provide valuable information for modeling and designing of ozonation and AOP treatment. QSPR models for kO3 and kOH prediction were then developed with special attention to model validation, applicability domain and mechanistic interpretation. With the experimentally determined rate constants, QSPR models were developed for predicting kO3 values using the selected 22 micropollutants as the training set and the 12 identified descriptors as model variables. As a result, two QSPR models were developed using piecewise linear regression (PLR) both showing an excellent goodness-of-fit. Model 1 was governed by average molecular weight and number of phenolic functional groups, and Model 2 was dominated by two principal components extracted from the descriptor matrix. The models were then validated using an external validation set collected from the literature, showing good predictive power of both models. Prior to applying these models to unknown micropollutants they need to be classified as high-reactive (logkO3 > 2 M-1 s-1) or low-reactive (logkO3 2 M-1 s-1), so that the appropriate submodel of the PLR can be applied. A classification function using linear discriminant analysis (LDA) was therefore developed which worked very well for both training and validation sets. With the help of additional compounds collected from the literature, and DRAGON molecular descriptors, a QSPR model for kOH prediction in the aqueous phase was developed using multiple linear regression. As a result, 7 DRAGON descriptors were found to be significant in modeling kOH, which related kOH of micropollutants to their electronegativity, polarizability, presence of double bonds and H-bond acceptors. The model fitted the training set very well and showed great predictive power as assessed by the external validation set. In addition, the model is applicable to a wide range of micropollutants. The model’s applicability domain was defined using a leverage approach. The main contributions of this thesis lie in the successful development of QSPR models for kO3 and kOH value prediction which, for the first time, can be used for a wide range of structurally diverse micropollutants. In addition, all QSPR models were externally validated to verify their predictive power, and the applicability domains were defined so that the applicability of the models to new compounds can be determined. Finally, the applicability of the model to natural water was explored by combining the QSPR models with the established Rct concept which predicts micropollutant removals during ozone treatment of natural water but requires kinetic data as input. Results show that the kinetic data from the QSPR model predictions worked well in the Rct model providing reliable estimations for most of the selected micropollutants. This approach can therefore be used in water treatment for initial assessment and estimation of ozonation efficiency.1 yea