15 research outputs found

    Optimization Algorithms for Chemoinformatics and Material-informatics

    Get PDF
    Modeling complex phenomena in chemoinformatics and material-informatics can often be formulated as single-objective or multi-objective optimization problems (SOOPs or MOOPs). For example, the design of new drugs or new materials is inherently a MOOP since drugs/materials require the simultaneous optimization of multiple parameters

    RANdom SAmple Consensus (RANSAC) algorithm for material-informatics: application to photovoltaic solar cells

    No full text
    Abstract An important aspect of chemoinformatics and material-informatics is the usage of machine learning algorithms to build Quantitative Structure Activity Relationship (QSAR) models. The RANdom SAmple Consensus (RANSAC) algorithm is a predictive modeling tool widely used in the image processing field for cleaning datasets from noise. RANSAC could be used as a “one stop shop” algorithm for developing and validating QSAR models, performing outlier removal, descriptors selection, model development and predictions for test set samples using applicability domain. For “future” predictions (i.e., for samples not included in the original test set) RANSAC provides a statistical estimate for the probability of obtaining reliable predictions, i.e., predictions within a pre-defined number of standard deviations from the true values. In this work we describe the first application of RNASAC in material informatics, focusing on the analysis of solar cells. We demonstrate that for three datasets representing different metal oxide (MO) based solar cell libraries RANSAC-derived models select descriptors previously shown to correlate with key photovoltaic properties and lead to good predictive statistics for these properties. These models were subsequently used to predict the properties of virtual solar cells libraries highlighting interesting dependencies of PV properties on MO compositions

    Optimization of Molecular Representativeness

    No full text
    Representative subsets selected from within larger data sets are useful in many chemoinformatics applications including the design of information-rich compound libraries, the selection of compounds for biological evaluation, and the development of reliable quantitative structure–activity relationship (QSAR) models. Such subsets can overcome many of the problems typical of diverse subsets, most notably the tendency of the latter to focus on outliers. Yet only a few algorithms for the selection of representative subsets have been reported in the literature. Here we report on the development of two algorithms for the selection of representative subsets from within parent data sets based on the optimization of a newly devised representativeness function either alone or simultaneously with the MaxMin function. The performances of the new algorithms were evaluated using several measures representing their ability to produce (1) subsets which are, on average, close to data set compounds; (2) subsets which, on average, span the same space as spanned by the entire data set; (3) subsets mirroring the distribution of biological indications in a parent data set; and (4) test sets which are well predicted by qualitative QSAR models built on data set compounds. We demonstrate that for three data sets (containing biological indication data, logBBB permeation data, and Plasmodium falciparum inhibition data), subsets obtained using the new algorithms are more representative than subsets obtained by hierarchical clustering, <i>k</i>-means clustering, or the MaxMin optimization at least in three of these measures

    Optimization of Molecular Representativeness

    No full text
    Representative subsets selected from within larger data sets are useful in many chemoinformatics applications including the design of information-rich compound libraries, the selection of compounds for biological evaluation, and the development of reliable quantitative structure–activity relationship (QSAR) models. Such subsets can overcome many of the problems typical of diverse subsets, most notably the tendency of the latter to focus on outliers. Yet only a few algorithms for the selection of representative subsets have been reported in the literature. Here we report on the development of two algorithms for the selection of representative subsets from within parent data sets based on the optimization of a newly devised representativeness function either alone or simultaneously with the MaxMin function. The performances of the new algorithms were evaluated using several measures representing their ability to produce (1) subsets which are, on average, close to data set compounds; (2) subsets which, on average, span the same space as spanned by the entire data set; (3) subsets mirroring the distribution of biological indications in a parent data set; and (4) test sets which are well predicted by qualitative QSAR models built on data set compounds. We demonstrate that for three data sets (containing biological indication data, logBBB permeation data, and Plasmodium falciparum inhibition data), subsets obtained using the new algorithms are more representative than subsets obtained by hierarchical clustering, <i>k</i>-means clustering, or the MaxMin optimization at least in three of these measures

    Textual sentiment analysis and description characteristics in crowdfunding success : the case of cybersecurity and IoT industries

    Get PDF
    Crowdfunding platforms offer entrepreneurs the opportunity to evaluate their technologies, validate their market, and raise funding. Such platforms also provide technologies with an opportunity to rapidly transition from research to market, which is especially crucial in fast-changing industries. In this study, we investigated how the sentiments expressed in the text of the project campaigns and project characteristics influence the success of crowdfunding in innovative industries such as cybersecurity and the Internet of Things (IoT). We examined 657 cybersecurity and Internet of Things (IoT) projects between 2010 and 2020 that were promoted on Kickstarter and IndieGoGo, two rewards-based crowdfunding platforms. We extracted technological topic attributes that may influence project success and measured the sentiments of project descriptions using a Valence Aware Dictionary and sEntiment Reasoner (VADER) model. We found that the sentiment of the description and the textual topic characteristics are associated with the success of funding campaigns for cybersecurity and IoT projects

    Data Mining and Machine Learning Models for Predicting Drug Likeness and Their Disease or Organ Category

    No full text
    Data mining approaches can uncover underlying patterns in chemical and pharmacological property space decisive for drug discovery and development. Two of the most common approaches are visualization and machine learning methods. Visualization methods use dimensionality reduction techniques in order to reduce multi-dimension data into 2D or 3D representations with a minimal loss of information. Machine learning attempts to find correlations between specific activities or classifications for a set of compounds and their features by means of recurring mathematical models. Both models take advantage of the different and deep relationships that can exist between features of compounds, and helpfully provide classification of compounds based on such features or in case of visualization methods uncover underlying patterns in the feature space. Drug-likeness has been studied from several viewpoints, but here we provide the first implementation in chemoinformatics of the t-Distributed Stochastic Neighbor Embedding (t-SNE) method for the visualization and the representation of chemical space, and the use of different machine learning methods separately and together to form a new ensemble learning method called AL Boost. The models obtained from AL Boost synergistically combine decision tree, random forests (RF), support vector machine (SVM), artificial neural network (ANN), k nearest neighbors (kNN), and logistic regression models. In this work, we show that together they form a predictive model that not only improves the predictive force but also decreases bias. This resulted in a corrected classification rate of over 0.81, as well as higher sensitivity and specificity rates for the models. In addition, separation and good models were also achieved for disease categories such as antineoplastic compounds and nervous system diseases, among others. Such models can be used to guide decision on the feature landscape of compounds and their likeness to either drugs or other characteristics, such as specific or multiple disease-category(ies) or organ(s) of action of a molecule

    Nucleophilic and Electrophilic Reactions of Polyynes Catalyzed by an Electric Field: Toward Barcoding of Carbon Nanotubes Like Long Homogeneous Substrates

    No full text
    Computational studies at the B3LYP/6-31+G* level were carried out on the addition of pyridine to polyynes (C6–C18) and on the protonation of polyynes by methyl ammonium fluoride under electric fields of 2.5 and 5 MV/cm. The electric field in each case was oriented along the polyyne axis in a direction that enhances the reaction by stabilizing the incipient dipole. It was found that the reaction of pyridine addition is endothermic with a late transition state. The longer the polyynes and the stronger the field, the electric field catalysis was more efficient. Extrapolation of the data to long polyynes shows that at 1000 nm an electric field of 50 000 V/cm will reduce the barrier by 10 kcal/mol. This reduction is equivalent to 7 orders of magnitude in rate enhancement. A similar barrier reduction could be achieved with a 2.5 MV/cm field at a polyyne length of 20 nm. Protonation reactions were found to be much more affected by the electric field. A reduction of the reaction barrier by 10 kcal/mol using a 2.5 MV/cm electric field could be achieved at a polyyne length of 10 nm. Thus the electric field along the long axis of a substrate could induce a gradient of reactivity which could, in principle, enable the barcoding of substrates by using a sequence of reactants having different reactivities

    Metamorphosis of a Transition State into a Stable Species

    No full text
    Medium variations usually affect the shape of the bimolecular nucleophilic reaction profile at the reactants’ and products’ ends and, to a much lesser extent, the shape around the transition state. In water, the reactions of extended allylic systems such as have been computationally shown (for <i>n</i> = 2) to have a single transition state. As the polarity is decreased the transition state is gradually transformed into a double-humped profile that then changes smoothly through a triple-well profile into a single-well profile where the symmetric structure of the transition state is retained. The depth of the well is ca. 16 kcal/mol for <i>n</i> = 2 and reaches 40 kcal/mol for <i>n</i> = 7, resembling the stability of a weak chemical bond. This is traced to electrostatic effects as well as to the effect of an intermediate VB configuration. In the analogous polyynes, a stable adduct is already formed at <i>n</i> = 1. This is attributed to the formation of the relatively stable vinylic carbanion. As the number of acetylene units increases, the vinylic geometry (a CCC angle of 123°) is gradually lost until at <i>n</i> = 5 the adduct attains a linear geometry

    Dynamics and characteristics of misinformation related to earthquake predictions on Twitter

    Get PDF
    Abstract The spread of misinformation on social media can lead to inappropriate behaviors that can make disasters worse. In our study, we focused on tweets containing misinformation about earthquake predictions and analyzed their dynamics. To this end, we retrieved 82,129 tweets over a period of 2 years (March 2020–March 2022) and hand-labeled 4157 tweets. We used RoBERTa to classify the complete dataset and analyzed the results. We found that (1) there are significantly more not-misinformation than misinformation tweets; (2) earthquake predictions are continuously present on Twitter with peaks after felt events; and (3) prediction misinformation tweets sometimes link or tag official earthquake notifications from credible sources. These insights indicate that official institutions present on social media should continuously address misinformation (even in quiet times when no event occurred), check that their institution is not tagged/linked in misinformation tweets, and provide authoritative sources that can be used to support their arguments against unfounded earthquake predictions

    A Multi-Objective Genetic Algorithm for Outlier Removal

    No full text
    Quantitative structure activity relationship (QSAR) or quantitative structure property relationship (QSPR) models are developed to correlate activities for sets of compounds with their structure-derived descriptors by means of mathematical models. The presence of outliers, namely, compounds that differ in some respect from the rest of the data set, compromise the ability of statistical methods to derive QSAR models with good prediction statistics. Hence, outliers should be removed from data sets prior to model derivation. Here we present a new multi-objective genetic algorithm for the identification and removal of outliers based on the <i>k</i> nearest neighbors (<i>k</i>NN) method. The algorithm was used to remove outliers from three different data sets of pharmaceutical interest (logBBB, factor 7 inhibitors, and dihydrofolate reductase inhibitors), and its performances were compared with those of five other methods for outlier removal. The results suggest that the new algorithm provides filtered data sets that (1) better maintain the internal diversity of the parent data sets and (2) give rise to QSAR models with much better prediction statistics. Equally good filtered data sets in terms of these metrics were obtained when another objective function was added to the algorithm (termed “preservation”), forcing it to remove certain compounds with low probability only. This option is highly useful when specific compounds should be preferably kept in the final data set either because they have favorable activities or because they represent interesting molecular scaffolds. We expect this new algorithm to be useful in future QSAR applications
    corecore