260 research outputs found
Genetic Programming for Biomarker Detection in Classification of Mass Spectrometry Data
Mass spectrometry (MS) is currently the most commonly used technology in biochemical research for proteomic analysis. The primary goal of proteomic profiling using mass spectrometry is the classification of samples from different experimental states. To classify the MS samples, the identification of protein or peptides (biomarker detection) that are expressed differently between the classes, is required.
However, due to the high dimensionality of the data and the small number of samples, classification of MS data is extremely challenging. Another important aspect of biomarker detection is the verification of the detected biomarker that acts as an intermediate step before passing these biomarkers to the experimental validation stage.
Biomarker detection aims at altering the input space of the learning algorithm for improving classification of proteomic or metabolomic data. This task is performed through feature manipulation.
Feature manipulation consists of three aspects: feature ranking, feature selection, and feature construction. Genetic programming (GP) is an evolutionary computation algorithm that has the intrinsic capability for the three aspects of feature manipulation. The ability of GP for feature manipulation in proteomic biomarker discovery has not been fully investigated. This thesis, therefore, proposes an embedded methodology for these three aspects of feature manipulation in high dimensional MS data using GP. The thesis also presents a method for biomarker verification, using GP. The thesis investigates the use of GP for both single-objective and multi-objective feature selection and construction.
In feature ranking, the thesis proposes a GP-based method for ranking subsets of features by using GP as an ensemble approach. The proposed algorithm uses GP capability to combine the advantages of different feature ranking metrics and evolve a new ranking scheme for the subset of the features selected from the top ranked features. The capability of GP as a classifier is also investigated by this method. The results show that GP can select a smaller number of features and provide a better ranking of the selected features, which can improve the classification performance of five classifiers.
In feature construction, this thesis proposes a novel multiple feature construction method, which uses a single GP tree to generate a new set of high-level features from the original set of selected features. The results show that the proposed new algorithm outperforms two feature selection algorithms.
In feature selection, the thesis introduces the first GP multi-objective method for biomarker detection, which simultaneously increase the classification accuracy and reduce the number of detected features. The proposed multi-objective method can obtain better subsets of features than the single-objective algorithm and two traditional multi-objective approaches for feature selection. This thesis also develops the first multi-objective multiple feature construction algorithm for MS data. The proposed method aims at both maximising the classification performance and minimizing the cardinality of the constructed new high-level features. The results show that GP can dis- cover the complex relationships between the features and can significantly improve classification performance and reduce the cardinality.
For biomarker verification, the thesis proposes the first GP biomarker verification method through measuring the peptide detectability. The method solves the imbalance problem in the data and shows improvement over the benchmark algorithms. Also, the algorithm outperforms a well-known peptide detection method. The thesis also introduces a new GP method for alignment of MS data as a preprocessing stage, which will further help in improving the biomarker detection process
Representation and decision making in the immune system
The immune system has long been attributed cognitive capacities such as "recognition" of pathogenic agents; "memory" of previous infections; "regulation" of a cavalry of detector and effector cells; and "adaptation" to a changing environment and evolving threats. Ostensibly, in preventing disease the immune system must be capable of discriminating states of pathology in the organism; identifying causal agents or ``pathogens''; and correctly deploying lethal effector mechanisms. What is more, these behaviours must be learnt insomuch as the paternal genes cannot encode the pathogenic environment of the child. Insights into the mechanisms underlying these phenomena are of interest, not only to immunologists, but to computer scientists pushing the envelope of machine autonomy. This thesis approaches these phenomena from the perspective that immunological processes are inherently inferential processes. By considering the immune system as a statistical decision maker, we attempt to build a bridge between the traditionally distinct fields of biological modelling and statistical modelling. Through a mixture of novel theoretical and empirical analysis we assert the efficacy of competitive exclusion as a general principle that benefits both. For the immunologist, the statistical modelling perspective allows us to better determine that which is phenomenologically sufficient from the mass of observational data, providing quantitative insight that may offer relief from existing dichotomies. For the computer scientist, the biological modelling perspective results in a theoretically transparent and empirically effective numerical method that is able to finesse the trade-off between myopic greediness and intractability in domains such as sparse approximation, continuous learning and boosting weak heuristics. Together, we offer this as a modern reformulation of the interface between computer science and immunology, established in the seminal work of Perelson and collaborators, over 20 years ago.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
Computational studies of protein-ligand molecular recognition
Structure-based drug design is made possible by our understanding of molecular recognition.
The utility of this approach was apparent in the development of the clinically e V ective HIV-1
PR inhibitors, where crystal structures of complexes of HIV-1 protease and inhibitors gave
pivotal information. Computational methods drawing upon structural data are of increasing
relevance to the drug design process. Nonetheless, these methods are quite rudimentary and
signicant improvements are needed. The aim of this thesis was to investigate techniques
which may lead to improved modelling of molecular recognition and a better ability to make
predictions about the binding a Y nity of ligands. The two main themes were the modelling
of acidbase titration behaviour of ligand and receptor, and the application of the simulation
technique of congurational bias Monte Carlo (CBMC). The studies were performed with
HIV-1 PR and its inhibitors as a model system.
Biological processes are inuenced by the pH of the medium in which they take place.
Ligandreceptor binding equilibria are often thermodynamically linked to protonation changes
in ligand and/or receptor, as seen in the the binding of a number of HIV-1 PR inhibitors.
In Chapter 2, a series of sixteen continuum electrostatics pKa calculations of HIV-1 PR
inhibitor complexes was done, in order to characterize the nature and size of these linkages.
The most important e V ects concern changes in the pKa of the enzyme active site aspartate
dyad. Large pKa shifts were predicted in all cases, and at least one of the two dyad pKas
became more basic on binding. At physiologically relevant pH, di V erent ligands induced
di V erent protonation states, with di V erent tautomeric forms favoured. The fully deprotonated
form of the dyad was not signicantly populated for any of the complexes. For about a third
of the complexes, both singly and doubly protonated forms were predicted to be populated.
The predicted predominant protonation states of MVT-101 and VX-478 were consistent with
previous theoretical studies. The size of the predicted pKa shifts for MVT-101 and XK263
di V ered from a previous study using similar methods. The paucity and ambiguity of available
experimental data makes it di Y cult to evaluate the results fully; however the tendency to
exaggerate shifts, as observed in other studies, appears to be present.
Scoring is the prediction of binding a Y nity from the structure of the ligandreceptor
complex, according to an empirical scheme. Scoring studies usually neglect or grossly simplify
the contribution of protonation equilibria to a Y nity, so in Chapter 3 proton linkage data was
included in a regression analysis of the HIV-1 PR complexes from Chapter 2. Parameters
previously shown to correlate with binding, namely electrostatic free energy changes and
buried surface areas, were the basis for the analysis, and terms describing proton linkage,
in the form of a correction for assay pH and an indicator variable for predicted dyad pKa
shift on binding, were also considered. The complex with MVT-101 was an outlier in the
analysis and was excluded. Further analysis demonstrated that the correction for assay pH made a signicant contribution to the regression equation. Amendment of the parameters
for XK263 according to the available experimental data led to an improved regression in
which the term for calculated pKa shifts also made a signicant contribution. The regression
equations obtained had the same form and similar coe Y cients to scoring functions of the
master equation type, and t the experimental data with comparable accuracy.
More physically realistic simulations of ligandreceptor binding using the techniques of
molecular dynamics (MD) or Monte Carlo (MC) are potentially more accurate than scoring
function approaches. These methods are slow, so the alternative of CBMC, which has been
shown to give faster convergence for polymer simulations, was implemented for C harmm 22,
an all-atom protein force eld (Chapter 4). The correctness of the implementation was
demonstrated by comparison with exact and stochastic dynamics (SD) results for individual
terms in the force eld. The algorithm is more complex than those typically used with alkane
force elds, and this has possible consequences for the e Y ciency. CBMC was used to generate
a Ramachandran plot for the alanine dipeptide, and the results were found to be in agreement
with those generated by a SD simulation. Analysis of statistical errors suggests that CBMC
should be competitive with umbrella sampling for simulating conformational equilibria, par-
ticularly when the cost of non-bonded energy evaluations dominates the simulation.
CBMC can be applied to ligandreceptor binding, as demonstrated in grand canonical
simulations of alkane adsorption in zeolites. The more limited problem of nding the pre-
dominant bound conformation of a exible ligand given a rigid protein receptor (i.e. dock-
ing) was treated in Chapter 5, using the example of a tripeptide inhibitor which binds to
HIV-1 PR. Attempts to perform the docking using the Metropolis MC/simulated annealing
and Lamarckian genetic algorithm methods implemented in the program AutoDock failed
to reproduce the native conguration (with runs on the order of two days execution time).
Docking using CBMC, combined with parallel tempering to further improve sampling, was
successful in nding the native binding mode, although this success was dependent on ad hoc
adjustments to the force eld, and a priori knowledge of the ligand protonation state and bind-
ing site. The e Y ciency of the method was considerably lower than hoped, with problems
due to the force eld- and model-dependent coupling between terms in the potential energy
function, and the greedy nature of the CBMC algorithm.
Various conclusions can be drawn from these studies. Chapters 2 and 3 provide evidence
of the importance of protonation equilibria in ligandprotein molecular recognition, and un-
derline the sizable contribution of electrostatic interactions to binding energies. In the face of
this nding, neglect of electrostatic terms, as often seen past studies, appears to be counterpro-
ductive. The scoring study also shows how experimental data can be used more e V ectively if
factors such as assay conditions are carefully taken into account. Implementation of CBMC for
a widely-used protein force eld and application of the algorithm to docking (Chapters 4 and
5) represents a proof of concept for a broadly useful simulation technique. Further work will
be required to nd the right niche for CBMC and fully explore the potential of this and re-
lated techniques. A nal point is the demonstrated utility of the HIV-1 PR test system which
formed the focus of the studies. Abundant structural data has enabled many new approaches
to be tested, and further insights are expected from the analysis of unusual cases, such as the
anomalous results for MVT-101. As well as the question of scoring, studies of mutation and
resistance are likely to attract considerable interest in the future
Bayesian machine learning methods for predicting protein-peptide interactions and detecting mosaic structures in DNA sequences alignments
Short well-defined domains known as peptide recognition modules (PRMs) regulate many important protein-protein interactions involved in the formation of macromolecular complexes
and biochemical pathways. High-throughput experiments like yeast two-hybrid and phage
display are expensive and intrinsically noisy, therefore it would be desirable to target informative interactions and pursue in silico approaches. We propose a probabilistic discriminative
approach for predicting PRM-mediated protein-protein interactions from sequence data. The
model suffered from over-fitting, so Laplacian regularisation was found to be important in
achieving a reasonable generalisation performance. A hybrid approach yielded the best performance, where the binding site motifs were initialised with the predictions of a generative
model. We also propose another discriminative model which can be applied to all sequences
present in the organism at a significantly lower computational cost. This is due to its additional
assumption that the underlying binding sites tend to be similar.It is difficult to distinguish between the binding site motifs of the PRM due to the small
number of instances of each binding site motif. However, closely related species are expected
to share similar binding sites, which would be expected to be highly conserved. We investigated
rate variation along DNA sequence alignments, modelling confounding effects such as recombination. Traditional approaches to phylogenetic inference assume that a single phylogenetic
tree can represent the relationships and divergences between the taxa. However, taxa sequences
exhibit varying levels of conservation, e.g. due to regulatory elements and active binding sites,
and certain bacteria and viruses undergo interspecific recombination. We propose a phylogenetic factorial hidden Markov model to infer recombination and rate variation. We examined
the performance of our model and inference scheme on various synthetic alignments, and compared it to state of the art breakpoint models. We investigated three DNA sequence alignments:
one of maize actin genes, one bacterial (Neisseria), and the other of HIV-1. Inference is carried
out in the Bayesian framework, using Reversible Jump Markov Chain Monte Carlo
Recommended from our members
Evolutionary and deep mining models for effective biomarker discovery
With the advent of high-throughput biology, large amounts of molecular data are available for purposeful analysis and evaluation. Extracting relevant knowledge from high-throughput biomedical datasets has become a common goal of current approaches to personalised cancer medicine and understanding cancer genotype and phenotype. However, the datasets are characterised by high dimensionality and relatively small sample sizes with small signal-to-noise ratios. Extracting and interpreting relevant knowledge from such complex datasets therefore remains a significant challenge for the fields of machine learning and data mining. This is evidenced by the limited success these methods have had in detecting robust and reliable biomarkers for cancers and other complicated diseases. This could also explain the lack of finding generic biomarkers among the identified published genes for identical diseases or clinical conditions.
This thesis proposes and evaluates the efficacy of two novel feature mining models established on the basis of the evolutionary computation and deep learning paradigms to position and solve biomarker discovery as an optimisation problem. Deep learning methods lack the transparency and interpretability found in the evolutionary paradigm. To overcome the inherent issue of poor explanatory power associated with the deep learning, this research also introduces a novel deep mining model that helps to deconstruct the internal state of such deep learning models to reveal key determinants underlying its latent representations to aid feature selection. As a result, salient biomarkers for breast cancer and the positivity of the Estrogen and Progesterone receptors are discovered robustly and validated reliably across a wide range of independently generated breast cancer data samples
Representation and Decision Making in the Immune System
The immune system has long been attributed cognitive capacities such as "recognition" of pathogenic agents; "memory" of previous infections; "regulation" of a cavalry of detector and effector cells; and "adaptation" to a changing environment and evolving threats. Ostensibly, in preventing disease the immune system must be capable of discriminating states of pathology in the organism; identifying causal agents or "pathogens"; and correctly deploying lethal effector mechanisms. What is more, these behaviours must be learnt insomuch as the paternal genes cannot encode the pathogenic environment of the child. Insights into the mechanisms underlying these phenomena are of interest, not only to immunologists, but to computer scientists pushing the envelope of machine autonomy.This thesis approaches these phenomena from the perspective that immunological processes are inherently inferential processes. By considering the immune system as a statistical decision maker, we attempt to build a bridge between the traditionally distinct fields of biological modelling and statistical modelling. Through a mixture of novel theoretical and empirical analysis we assert the efficacy of competitive exclusion as a general principle that benefits both. For the immunologist, the statistical modelling perspective allows us to better determine that which is phenomenologically sufficient from the mass of observational data, providing quantitative insight that may offer relief from existing dichotomies. For the computer scientist, the biological modelling perspective results in a theoretically transparent and empirically effective numerical method that is able to finesse the trade-off between myopic greediness and intractability in domains such as sparse approximation, continuous learning and boosting weak heuristics. Together, we offer this as a modern reformulation of the interface between computer science and immunology, established in the seminal work of Perelson and collaborators, over 20 years ago
Personalisation of heart failure care using clinical trial data
Heart failure is a common, debilitating and life limiting disease, resulting in a large burden for both the individual patient and healthcare provision. Therefore, optimisation of treatments for these patients is of prime importance. Heart failure with reduced ejection fraction has a large evidence base for effective treatments, and more recently effective treatments have started to be identified for those with preserved ejection fraction. The effectiveness of these treatments is calculated at a population level, and there is a great deal of interest to try and identify if different patients may benefit more from certain treatments. In addition, we wish to understand more about different phenotypes in heart failure, to help understand what the patient might expect for the trajectory of their illness and potentially develop targeted treatments. To explore these issues further, this thesis presents several approaches using heart failure clinical trial data to try and further understand the patient journey and explore how treatment may be delivered in a more personalised fashion.
The first analyses look at the patterns of heart failure hospitalisations, including the timing of admissions, and the relationship with different modes of death. This was examined in both heart failure with preserved and reduced ejection fraction. The accepted trajectory of recurrent admissions falling closer together over time was confirmed, and admissions closer together were linked to a higher risk of cardiovascular death, particularly due to progressive pump failure. Sudden death did appear to be truly sudden and not strongly linked to hospitalisations.
The next approach was to perform latent class analysis to try and identify clusters of patients, or phenotypes, within heart failure with preserved and reduced ejection fraction separately using a data driven method. Phenotypes were identified with consistency across different data and using different approaches. These phenotypes were clinically recognisable. Identifying phenotypes in this way may be a route to looking for differential responses to treatments.
Lastly, supervised machine learning methods were used to predict outcomes in patients with heart failure and reduced ejection fraction. These techniques provide more analytical
flexibility, but did not show performance benefit compared with prognostic models based on survival analysis methods. Overall, the predictive abilities were modest.
In conclusion, several avenues were explored to help understand the patient journey in heart failure, aiming to give more detail about the expected patient trajectory and exploring methods to examine for differential treatment responses in phenotypes of patients in heart failure
Läänemere kirdeosa põhjaelustiku bioloogiline mitmekesisus: kaardistamise metoodikad, ruumilised mustrid ja seosed keskkonnamuutujatega
Väitekirja elektrooniline versioon ei sisalda publikatsiooneBioloogiline mitmekesisus tagab ökosüsteemi funktsioneerimise muutuvates keskkonnatingimustes. Merealade ja ranniku kasutamine inimeste poolt on aga järjest intensiivistunud ning merealad on üha suureneva inimtegevustest tuleneva surve all. Mere kaitse ja majandamise otsuste jaoks on vaja bioloogilise mitmekesisuse kaarte, kuid tavapärane proovide kogumisel põhinev metoodika ei sobi suurte merealade kaardistamiseks. Käesolevas töös modelleeriti merepõhja suurtaimestiku ja suurselgrootute liigirikkuse ruumilist levikut Eestis kasutades sisendina proovipunktipõhiseid bioloogilise mitmekesisuse andmeid ja erinevate keskkonnamuutujate kaardikihte. Kõrgeimad põhjaelustiku liigirikkuse väärtused registreeriti Lääne-Eesti saarestikus. Võimalikke muutusi põhjaelustiku liigirikkuses tuleviku kliima tingimustes hinnati samuti modelleerimise abil ja leiti, et nii taimestiku kui loomastiku liigirikkus väheneb suuremal osal Eesti merealast. Modelleeritud liigirikkuse kaardikihte kasutati mereranna geomorfoloogia ja põhjaelustiku liigirikkuse vaheliste seoste uurimiseks, mille tulemusel leiti, et ranna geomorfoloogiliste tüüpide vahel esinesid statistiliselt olulise erinevused merepõhja elustiku liigirikkuses. Arendati välja metoodika merepõhja substraadi ja elustiku kaardistamiseks kasutades sonarit, veealuseid videoid ja matemaatilist modelleerimist. Metoodika võimaldab punktipõhise kaardistamisega võrreldes kõrgema detailsusega merepõhja substraadi ja elustiku leviku kaardistamist. Lisaks töötati välja põhjaelustiku liigirikkuse ja teiste mere loodusväärtuste (põhjaelustiku liigid, linnud, hülged) levikuandmete kasutamise hõlbustamiseks mere majandamisel spetsiaalsed meetodid – merekeskkonna tundlikkuse (EVP) ja riski profiilide (ERP) kaardikihid. EVP näitab merekeskkonna potentsiaalset tundlikkust häiringutele ja ERP võimaldab kindlaks teha piirkonnad, kus oht keskkonnale on kõrgeim nii elustiku pikaajalise taastumise kui kõrge inimtegevustest tuleneva surve tõttu.Biodiversity is important for keeping marine ecosystem functionality under changing environmental conditions. The human use of marine areas is increasing worldwide and intensively used marine areas are under increasing pressures. Decrease of marine biodiversity has already taken place. Therefore, knowledge about spatial patterns of biodiversity and its connections with environmental gradients is crucial to detect and follow changes in biodiversity and to form a well-informed basis for the protection and management of marine resources. In this study, the distribution of species richness of seabed macrovegetation and macroinvertebrates was modeled in the Estonian marine area based on previous point-wise sampling data and map layers of environmental variables (water depth, salinity etc). Highest biodiversity values were detected in the western Estonia archipelago. Potential changes of species richness in the conditions of future climate change were also estimated by modeling. It was found that biodiversity of both seabed flora and fauna will probably decrease across Estonian sea area. Modeled benthic biodiversity layers were further used to test the relationships between underwater biodiversity and shore geomorphology and it was shown that differences in the benthic biodiversity values close to different geomorphological shore types exist. Methodology for mapping seabed substrate and biota using acoustic scanning (sonar), underwater video and mathematical modeling was developed. Compared to the previous point-wise mapping the new sonar- and modelling-based methodology enables mapping of seabed substrate and biota with significantly higher resolution. To facilitate the use of spatial data of biodiversity and other nature values (benthic species, seals, birds) in marine management, marine environmental vulnerability (EVP) and risk (ERP) profiles were developed. EVP identifies environmentally vulnerable areas and ERP identifies areas where environmental risks are highest.https://www.ester.ee/record=b5251349~S
- …