15 research outputs found
Optimization Algorithms for Chemoinformatics and Material-informatics
Modeling complex phenomena in chemoinformatics and material-informatics can often be formulated as single-objective or multi-objective optimization problems (SOOPs or MOOPs). For example, the design of new drugs or new materials is inherently a MOOP since drugs/materials require the simultaneous optimization of multiple parameters
RANdom SAmple Consensus (RANSAC) algorithm for material-informatics: application to photovoltaic solar cells
Abstract An important aspect of chemoinformatics and material-informatics is the usage of machine learning algorithms to build Quantitative Structure Activity Relationship (QSAR) models. The RANdom SAmple Consensus (RANSAC) algorithm is a predictive modeling tool widely used in the image processing field for cleaning datasets from noise. RANSAC could be used as a “one stop shop” algorithm for developing and validating QSAR models, performing outlier removal, descriptors selection, model development and predictions for test set samples using applicability domain. For “future” predictions (i.e., for samples not included in the original test set) RANSAC provides a statistical estimate for the probability of obtaining reliable predictions, i.e., predictions within a pre-defined number of standard deviations from the true values. In this work we describe the first application of RNASAC in material informatics, focusing on the analysis of solar cells. We demonstrate that for three datasets representing different metal oxide (MO) based solar cell libraries RANSAC-derived models select descriptors previously shown to correlate with key photovoltaic properties and lead to good predictive statistics for these properties. These models were subsequently used to predict the properties of virtual solar cells libraries highlighting interesting dependencies of PV properties on MO compositions
Optimization of Molecular Representativeness
Representative
subsets selected from within larger data sets are
useful in many chemoinformatics applications including the design
of information-rich compound libraries, the selection of compounds
for biological evaluation, and the development of reliable quantitative
structure–activity relationship (QSAR) models. Such subsets
can overcome many of the problems typical of diverse subsets, most
notably the tendency of the latter to focus on outliers. Yet only
a few algorithms for the selection of representative subsets have
been reported in the literature. Here we report on the development
of two algorithms for the selection of representative subsets from
within parent data sets based on the optimization of a newly devised
representativeness function either alone or simultaneously with the
MaxMin function. The performances of the new algorithms were evaluated
using several measures representing their ability to produce (1) subsets
which are, on average, close to data set compounds; (2) subsets which,
on average, span the same space as spanned by the entire data set;
(3) subsets mirroring the distribution of biological indications in
a parent data set; and (4) test sets which are well predicted by qualitative
QSAR models built on data set compounds. We demonstrate that for three
data sets (containing biological indication data, logBBB permeation
data, and Plasmodium falciparum inhibition
data), subsets obtained using the new algorithms are more representative
than subsets obtained by hierarchical clustering, <i>k</i>-means clustering, or the MaxMin optimization at least in three of
these measures
Optimization of Molecular Representativeness
Representative
subsets selected from within larger data sets are
useful in many chemoinformatics applications including the design
of information-rich compound libraries, the selection of compounds
for biological evaluation, and the development of reliable quantitative
structure–activity relationship (QSAR) models. Such subsets
can overcome many of the problems typical of diverse subsets, most
notably the tendency of the latter to focus on outliers. Yet only
a few algorithms for the selection of representative subsets have
been reported in the literature. Here we report on the development
of two algorithms for the selection of representative subsets from
within parent data sets based on the optimization of a newly devised
representativeness function either alone or simultaneously with the
MaxMin function. The performances of the new algorithms were evaluated
using several measures representing their ability to produce (1) subsets
which are, on average, close to data set compounds; (2) subsets which,
on average, span the same space as spanned by the entire data set;
(3) subsets mirroring the distribution of biological indications in
a parent data set; and (4) test sets which are well predicted by qualitative
QSAR models built on data set compounds. We demonstrate that for three
data sets (containing biological indication data, logBBB permeation
data, and Plasmodium falciparum inhibition
data), subsets obtained using the new algorithms are more representative
than subsets obtained by hierarchical clustering, <i>k</i>-means clustering, or the MaxMin optimization at least in three of
these measures
Textual sentiment analysis and description characteristics in crowdfunding success : the case of cybersecurity and IoT industries
Crowdfunding platforms offer entrepreneurs the opportunity to evaluate their technologies, validate their market, and raise funding. Such platforms also provide technologies with an opportunity to rapidly transition from research to market, which is especially crucial in fast-changing industries. In this study, we investigated how the sentiments expressed in the text of the project campaigns and project characteristics influence the success of crowdfunding in innovative industries such as cybersecurity and the Internet of Things (IoT). We examined 657 cybersecurity and Internet of Things (IoT) projects between 2010 and 2020 that were promoted on Kickstarter and IndieGoGo, two rewards-based crowdfunding platforms. We extracted technological topic attributes that may influence project success and measured the sentiments of project descriptions using a Valence Aware Dictionary and sEntiment Reasoner (VADER) model. We found that the sentiment of the description and the textual topic characteristics are associated with the success of funding campaigns for cybersecurity and IoT projects
Data Mining and Machine Learning Models for Predicting Drug Likeness and Their Disease or Organ Category
Data mining approaches can uncover underlying patterns in chemical and pharmacological property space decisive for drug discovery and development. Two of the most common approaches are visualization and machine learning methods. Visualization methods use dimensionality reduction techniques in order to reduce multi-dimension data into 2D or 3D representations with a minimal loss of information. Machine learning attempts to find correlations between specific activities or classifications for a set of compounds and their features by means of recurring mathematical models. Both models take advantage of the different and deep relationships that can exist between features of compounds, and helpfully provide classification of compounds based on such features or in case of visualization methods uncover underlying patterns in the feature space. Drug-likeness has been studied from several viewpoints, but here we provide the first implementation in chemoinformatics of the t-Distributed Stochastic Neighbor Embedding (t-SNE) method for the visualization and the representation of chemical space, and the use of different machine learning methods separately and together to form a new ensemble learning method called AL Boost. The models obtained from AL Boost synergistically combine decision tree, random forests (RF), support vector machine (SVM), artificial neural network (ANN), k nearest neighbors (kNN), and logistic regression models. In this work, we show that together they form a predictive model that not only improves the predictive force but also decreases bias. This resulted in a corrected classification rate of over 0.81, as well as higher sensitivity and specificity rates for the models. In addition, separation and good models were also achieved for disease categories such as antineoplastic compounds and nervous system diseases, among others. Such models can be used to guide decision on the feature landscape of compounds and their likeness to either drugs or other characteristics, such as specific or multiple disease-category(ies) or organ(s) of action of a molecule
Nucleophilic and Electrophilic Reactions of Polyynes Catalyzed by an Electric Field: Toward Barcoding of Carbon Nanotubes Like Long Homogeneous Substrates
Computational
studies at the B3LYP/6-31+G* level were carried out
on the addition of pyridine to polyynes (C6–C18) and on the
protonation of polyynes by methyl ammonium fluoride under electric
fields of 2.5 and 5 MV/cm. The electric field in each case was oriented
along the polyyne axis in a direction that enhances the reaction by
stabilizing the incipient dipole. It was found that the reaction of
pyridine addition is endothermic with a late transition state. The
longer the polyynes and the stronger the field, the electric field
catalysis was more efficient. Extrapolation of the data to long polyynes
shows that at 1000 nm an electric field of 50 000 V/cm will
reduce the barrier by 10 kcal/mol. This reduction is equivalent to
7 orders of magnitude in rate enhancement. A similar barrier reduction
could be achieved with a 2.5 MV/cm field at a polyyne length of 20
nm. Protonation reactions were found to be much more affected by the
electric field. A reduction of the reaction barrier by 10 kcal/mol
using a 2.5 MV/cm electric field could be achieved at a polyyne length
of 10 nm. Thus the electric field along the long axis of a substrate
could induce a gradient of reactivity which could, in principle, enable
the barcoding of substrates by using a sequence of reactants having
different reactivities
Metamorphosis of a Transition State into a Stable Species
Medium variations usually affect
the shape of the bimolecular nucleophilic
reaction profile at the reactants’ and products’ ends
and, to a much lesser extent, the shape around the transition state.
In water, the reactions of extended allylic systems such as
have been computationally
shown (for <i>n</i> =
2) to have a single transition state. As the polarity is decreased
the transition state is gradually transformed into a double-humped
profile that then changes smoothly through a triple-well profile into
a single-well profile where the symmetric structure of the transition
state is retained. The depth of the well is ca. 16 kcal/mol for <i>n</i> = 2 and reaches 40 kcal/mol for <i>n</i> = 7,
resembling the stability of a weak chemical bond. This is traced to
electrostatic effects as well as to the effect of an intermediate
VB configuration. In the analogous polyynes, a stable adduct is already
formed at <i>n</i> = 1. This is attributed to the formation
of the relatively stable vinylic carbanion. As the number of acetylene
units increases, the vinylic geometry (a CCC angle of 123°) is
gradually lost until at <i>n</i> = 5 the adduct attains
a linear geometry
Dynamics and characteristics of misinformation related to earthquake predictions on Twitter
Abstract The spread of misinformation on social media can lead to inappropriate behaviors that can make disasters worse. In our study, we focused on tweets containing misinformation about earthquake predictions and analyzed their dynamics. To this end, we retrieved 82,129 tweets over a period of 2 years (March 2020–March 2022) and hand-labeled 4157 tweets. We used RoBERTa to classify the complete dataset and analyzed the results. We found that (1) there are significantly more not-misinformation than misinformation tweets; (2) earthquake predictions are continuously present on Twitter with peaks after felt events; and (3) prediction misinformation tweets sometimes link or tag official earthquake notifications from credible sources. These insights indicate that official institutions present on social media should continuously address misinformation (even in quiet times when no event occurred), check that their institution is not tagged/linked in misinformation tweets, and provide authoritative sources that can be used to support their arguments against unfounded earthquake predictions
A Multi-Objective Genetic Algorithm for Outlier Removal
Quantitative structure
activity relationship (QSAR) or quantitative
structure property relationship (QSPR) models are developed to correlate
activities for sets of compounds with their structure-derived descriptors
by means of mathematical models. The presence of outliers, namely,
compounds that differ in some respect from the rest of the data set,
compromise the ability of statistical methods to derive QSAR models
with good prediction statistics. Hence, outliers should be removed
from data sets prior to model derivation. Here we present a new multi-objective
genetic algorithm for the identification and removal of outliers based
on the <i>k</i> nearest neighbors (<i>k</i>NN)
method. The algorithm was used to remove outliers from three different
data sets of pharmaceutical interest (logBBB, factor 7 inhibitors,
and dihydrofolate reductase inhibitors), and its performances were
compared with those of five other methods for outlier removal. The
results suggest that the new algorithm provides filtered data sets
that (1) better maintain the internal diversity of the parent data
sets and (2) give rise to QSAR models with much better prediction
statistics. Equally good filtered data sets in terms of these metrics
were obtained when another objective function was added to the algorithm
(termed “preservation”), forcing it to remove certain
compounds with low probability only. This option is highly useful
when specific compounds should be preferably kept in the final data
set either because they have favorable activities or because they
represent interesting molecular scaffolds. We expect this new algorithm
to be useful in future QSAR applications