74 research outputs found
Evolutionary Computation and QSAR Research
[Abstract] The successful high throughput screening of molecule libraries for a specific biological property is one of the main improvements in drug discovery. The virtual molecular filtering and screening relies greatly on quantitative structure-activity relationship (QSAR) analysis, a mathematical model that correlates the activity of a molecule with molecular descriptors. QSAR models have the potential to reduce the costly failure of drug candidates in advanced (clinical) stages by filtering combinatorial libraries, eliminating candidates with a predicted toxic effect and poor pharmacokinetic profiles, and reducing the number of experiments. To obtain a predictive and reliable QSAR model, scientists use methods from various fields such as molecular modeling, pattern recognition, machine learning or artificial intelligence. QSAR modeling relies on three main steps: molecular structure codification into molecular descriptors, selection of relevant variables in the context of the analyzed activity, and search of the optimal mathematical model that correlates the molecular descriptors with a specific activity. Since a variety of techniques from statistics and artificial intelligence can aid variable selection and model building steps, this review focuses on the evolutionary computation methods supporting these tasks. Thus, this review explains the basic of the genetic algorithms and genetic programming as evolutionary computation approaches, the selection methods for high-dimensional data in QSAR, the methods to build QSAR models, the current evolutionary feature selection methods and applications in QSAR and the future trend on the joint or multi-task feature selection methods.Instituto de Salud Carlos III, PIO52048Instituto de Salud Carlos III, RD07/0067/0005Ministerio de Industria, Comercio y Turismo; TSI-020110-2009-53)Galicia. Consellería de Economía e Industria; 10SIN105004P
Recommended from our members
An Overview of the Use of Neural Networks for Data Mining Tasks
In the recent years the area of data mining has experienced a considerable demand for technologies that extract knowledge from large and complex data sources. There is a substantial commercial interest as well as research investigations in the area that aim to develop new and improved approaches for extracting information, relationships, and patterns from datasets. Artificial Neural Networks (NN) are popular biologically inspired intelligent methodologies, whose classification, prediction and pattern recognition capabilities have been utilised successfully in many areas, including science, engineering, medicine, business, banking, telecommunication, and many other fields. This paper highlights from a data mining perspective the implementation of NN, using supervised and unsupervised learning, for pattern recognition, classification, prediction and cluster analysis, and focuses the discussion on their usage in bioinformatics and financial data analysis tasks
Bio-AIMS collection of chemoinformatics web tools based on molecular graph information and artificial intelligence models
[Abstract] The molecular information encoding into molecular descriptors is the first step into in silico Chemoinformatics methods in Drug Design. The Machine Learning methods are a complex solution to find prediction models for specific biological properties of molecules. These models connect the molecular structure information such as atom connectivity (molecular graphs) or physical-chemical properties of an atom/group of atoms to the molecular activity (Quantitative Structure - Activity Relationship, QSAR). Due to the complexity of the proteins, the prediction of their activity is a complicated task and the interpretation of the models is more difficult. The current review presents a series of 11 prediction models for proteins, implemented as free Web tools on an Artificial Intelligence Model Server in Biosciences, Bio-AIMS (http://bio-aims.udc.es/TargetPred.php). Six tools predict protein activity, two models evaluate drug - protein target interactions and the other three calculate protein - protein interactions. The input information is based on the protein 3D structure for nine models, 1D peptide amino acid sequence for three tools and drug SMILES formulas for two servers. The molecular graph descriptor-based Machine Learning models could be useful tools for in silico screening of new peptides/proteins as future drug targets for specific treatments.Red Gallega de Investigación y Desarrollo de Medicamentos; R2014/025Instituto de Salud Carlos III; PI13/0028
How to find simple and accurate rules for viral protease cleavage specificities
<p>Abstract</p> <p>Background</p> <p>Proteases of human pathogens are becoming increasingly important drug targets, hence it is necessary to understand their substrate specificity and to interpret this knowledge in practically useful ways. New methods are being developed that produce large amounts of cleavage information for individual proteases and some have been applied to extract cleavage rules from data. However, the hitherto proposed methods for extracting rules have been neither easy to understand nor very accurate. To be practically useful, cleavage rules should be accurate, compact, and expressed in an easily understandable way.</p> <p>Results</p> <p>A new method is presented for producing cleavage rules for viral proteases with seemingly complex cleavage profiles. The method is based on orthogonal search-based rule extraction (OSRE) combined with spectral clustering. It is demonstrated on substrate data sets for human immunodeficiency virus type 1 (HIV-1) protease and hepatitis C (HCV) NS3/4A protease, showing excellent prediction performance for both HIV-1 cleavage and HCV NS3/4A cleavage, agreeing with observed HCV genotype differences. New cleavage rules (consensus sequences) are suggested for HIV-1 and HCV NS3/4A cleavages. The practical usability of the method is also demonstrated by using it to predict the location of an internal cleavage site in the HCV NS3 protease and to correct the location of a previously reported internal cleavage site in the HCV NS3 protease. The method is fast to converge and yields accurate rules, on par with previous results for HIV-1 protease and better than previous state-of-the-art for HCV NS3/4A protease. Moreover, the rules are fewer and simpler than previously obtained with rule extraction methods.</p> <p>Conclusion</p> <p>A rule extraction methodology by searching for multivariate low-order predicates yields results that significantly outperform existing rule bases on out-of-sample data, but are more transparent to expert users. The approach yields rules that are easy to use and useful for interpreting experimental data.</p
Cat Swarm based Optimization of Gene Expression Data Classification
Abstract-An Artificial Neural Network (ANN) does have the capability to provide solutions of various complex problems. The generalization ability of ANN due to the massively parallel processing capability can be utilized to learn the patterns discovered in the data set which can be represented in terms of a set of rules. This rule can be used to find the solution to a classification problem. The learning ability of the ANN is degraded due to the high dimensionality of the datasets. Hence, to minimize this risk we have used Principal Component Analysis (PCA) and Factor Analysis (FA) which provides a feature reduced dataset to the Multi Layer Perceptron (MLP), the classifier used. Again, since the weight matrices are randomly initialized, hence, in this paper we have used Cat Swarm Optimization (CSO) method to update the weight values of the weight matrix. From the experimental evaluation, it was found that using CSO with the MLP classifier provides better classification accuracy as compared to when the classifier is solely used
An improved bees algorithm local search mechanism for numerical dataset
Bees Algorithm (BA), a heuristic optimization procedure, represents one of the fundamental search techniques is based on the food foraging activities of bees. This algorithm performs a kind of exploitative neighbourhoods search combined with random explorative search. However, the main issue of BA is that it requires long computational time as well as numerous computational processes to obtain a good solution, especially in more complicated issues. This approach does not guarantee any
optimum solutions for the problem mainly because of lack of accuracy. To solve this
issue, the local search in the BA is investigated by Simple swap, 2-Opt and 3-Opt were proposed as Massudi methods for Bees Algorithm Feature Selection (BAFS). In this
study, the proposed extension methods is 4-Opt as search neighbourhood is presented. This proposal was implemented and comprehensively compares and analyse their performances with respect to accuracy and time. Furthermore, in this study the feature selection algorithm is implemented and tested using most popular dataset from Machine Learning Repository (UCI). The obtained results from experimental work confirmed that the proposed extension of the search neighbourhood including 4-Opt approach has provided better accuracy with suitable time than the Massudi methods
Investigating the structural diversity within a committee of classifiers and their generalization performance
This study investigates the measures of diversity within ensembles of classifiers. The use of
neural networks is carried out in measuring ensemble diversity by the use of statistical and
ecological methods and to some extent information theory. A new way of looking at ensemble
diversity is proposed. This ensemble diversity is called ensemble structural diversity, for this
study is concerned with the diversity within the structure of the individual classifiers forming an
ensemble and not via the outcomes of the individual classifiers. Ensemble structural diversity
was also induced within the ensemble by varying the structural parameters (learning parameters)
of the artificial machines (classifiers). The importance or the use of these measures was judged
by comparing the measure of structural diversity and the ensemble generalization performance.
This was done so that comparisons can be made on the robustness of the idea of structural
diversity and its relationship with ensemble generalization performance. It was found that
diversity could be induced by having ensembles with different structural and implicit (e.g
learning) parameters and that this diversity does influence the predictive ability of the ensemble.
This was concurrent with literature even though within literature ensemble diversity was viewed
from the output as opposed to the structure of the individual classifiers. As the structural
diversity increased so did the generalization performance. However there was a point where
structural diversity decreased the generalization performance of the ensemble, where from that
point onwards as the structural diversity increased the generalization performance decreased.
This makes sense because too much of diversity within the ensemble might mean no consensus
is reached at all. The disadvantages of comparing structural diversity and the generalization
performance (accuracy) of the ensemble are that: an ensemble can be structurally diverse even
though all the classifiers within the ensemble approximate the same function which means in this
case structural diversity is meaningless in terms of improving the accuracy of the ensemble. The
use of ensemble structural diversity measures in developing efficient ensembles still remains to
be explored. This study, however, has also shown that diversity can be measured from the
structural parameters and moreover reducing the abstractness of diversity by being able to
quantify structural diversity making it possible to map a relationship between structural diversity
and accuracy. It was observed that structural diversity does improve the accuracy of the
ensemble, however, within a limited region of structural diversity
Coarse-grained modelling of protein structure and internal dynamics: comparative methods and applications
The first chapter is devoted to a brief summary of the basic techniques commonly
used to characterise protein's internal dynamics, and to perform those primary analyses
which are the basis for our further developments. To this purpose we recall the
basics of Principal Component Analysis of the covariance matrix of molecular dynamics
(MD) trajectories. The overview is aimed at motivating and justifying a posteriori the
introduction of coarse-grained models of proteins.
In the second chapter we shall discuss dynamical features shared by different conformers
of a protein. We'll review previously obtained results, concerning the universality
of the vibrational spectrum of globular proteins and the self-similar free energy
landscape of specific molecules, namely the G-protein and Adk. Finally, a novel technique
will be discussed, based on the theory of Random Matrices, to extract the robust
collective coordinates in a set of protein conformers by comparison with a stochastic
reference model.
The third chapter reports on an extensive investigation of protein internal dynamics
modelled in terms of the relative displacement of quasi-rigid groups of amino acids.
Making use of the results obtained in the previous chapters, we shall discuss the development
of a strategy to optimally partition a protein in units, or domains, whose
internal strain is negligible compared to their relative
uctuation. These partitions will
be used in turn to characterise the dynamical properties of proteins in the framework
of a simplified, coarse-grained, description of their motion.
In the fourth chapter we shall report on the possibility to use the collective
uctuations
of proteins as a guide to recognise relationships between them that may not be
captured as significant when sequence or structural alignment methods are used. We
shall review a method to perform the superposition of two proteins optimising the similarity
of the structures as well as the dynamical consistency of the aligned regions; then,
we shall next discuss a generalisation of this scheme to accelerate the dynamics-based
alignment, in the perspective of dataset-wide applications.
Finally, the fifth chapter focuses on a different topic, namely the occurrence of
topologically-entangled states (knots) in proteins. Specifically, we shall investigate
the sequence and structural properties of knotted proteins, reporting on an exhaustive
dataset-wide comparison with unknotted ones. The correspondence, or the lack thereof,
between knotted and unknotted proteins allowed us to identify, in knotted chains, small
segments of the backbone whose `virtual' excision results in an unknotted structure.
These `knot-promoting' loops are thus hypothesised to be involved in the formation of
the protein knot, which in turn is likely to cover some role in the biological function of
the knotted proteins
Protein function and inhibitor prediction by statistical learning approach
Ph.DDOCTOR OF PHILOSOPH
The use of machine learning to improve the effectiveness of ANRS in predicting HIV drug resistance.
Master of TeleHealth in Medical Informatics. University of KwaZulu-Natal, Durban, 2016.BACKGROUD
HIV has placed a large burden of disease in developing countries. HIV drug resistance is
inevitable due to selective pressure. Computer algorithms have been proven to help in
determining optimal treatment for HIV drug resistance patients. One such algorithm is the
ANRS gold standard interpretation algorithm developed by the French National Agency for
AIDS Research AC11 Resistance group.
OBJECTIVES
The aim of this study is to investigate the possibility of improving the accuracy of the ANRS
gold standard in predicting HIV drug resistance.
METHODS
Data consisting of genome sequence and a HIV drug resistance measure was obtained from
the Stanford HIV database. Machine learning factor analysis was performed to determine
sequence positions where mutations lead to drug resistance. Sequence positions not found
in ANRS were added to the ANRS rules and accuracy was recalculated.
RESULTS
The machine learning algorithm did find sequence positions, not associated with ANRS, but
the model suggests they are important in the prediction of HIV drug resistance. Preliminary
results show that for IDV 10 sequence positions where found that were not associated
with ANRS rules, 4 for LPV, and 8 for NFV. For NFV, ANRS misclassified 74 resistant profiles
as being susceptible to the ARV. Sixty eight of the 74 sequences (92%) were classified as
resistance with the inclusion of the eight new sequence positions. No change was found
for LPV and a 78% improvement was associated with IDV.
CONCLUSION
The study shows that there is a possibility of improving ANRS accuracy
- …