260 research outputs found

    Genetic Programming for Biomarker Detection in Classification of Mass Spectrometry Data

    No full text
    Mass spectrometry (MS) is currently the most commonly used technology in biochemical research for proteomic analysis. The primary goal of proteomic profiling using mass spectrometry is the classification of samples from different experimental states. To classify the MS samples, the identification of protein or peptides (biomarker detection) that are expressed differently between the classes, is required. However, due to the high dimensionality of the data and the small number of samples, classification of MS data is extremely challenging. Another important aspect of biomarker detection is the verification of the detected biomarker that acts as an intermediate step before passing these biomarkers to the experimental validation stage. Biomarker detection aims at altering the input space of the learning algorithm for improving classification of proteomic or metabolomic data. This task is performed through feature manipulation. Feature manipulation consists of three aspects: feature ranking, feature selection, and feature construction. Genetic programming (GP) is an evolutionary computation algorithm that has the intrinsic capability for the three aspects of feature manipulation. The ability of GP for feature manipulation in proteomic biomarker discovery has not been fully investigated. This thesis, therefore, proposes an embedded methodology for these three aspects of feature manipulation in high dimensional MS data using GP. The thesis also presents a method for biomarker verification, using GP. The thesis investigates the use of GP for both single-objective and multi-objective feature selection and construction. In feature ranking, the thesis proposes a GP-based method for ranking subsets of features by using GP as an ensemble approach. The proposed algorithm uses GP capability to combine the advantages of different feature ranking metrics and evolve a new ranking scheme for the subset of the features selected from the top ranked features. The capability of GP as a classifier is also investigated by this method. The results show that GP can select a smaller number of features and provide a better ranking of the selected features, which can improve the classification performance of five classifiers. In feature construction, this thesis proposes a novel multiple feature construction method, which uses a single GP tree to generate a new set of high-level features from the original set of selected features. The results show that the proposed new algorithm outperforms two feature selection algorithms. In feature selection, the thesis introduces the first GP multi-objective method for biomarker detection, which simultaneously increase the classification accuracy and reduce the number of detected features. The proposed multi-objective method can obtain better subsets of features than the single-objective algorithm and two traditional multi-objective approaches for feature selection. This thesis also develops the first multi-objective multiple feature construction algorithm for MS data. The proposed method aims at both maximising the classification performance and minimizing the cardinality of the constructed new high-level features. The results show that GP can dis- cover the complex relationships between the features and can significantly improve classification performance and reduce the cardinality. For biomarker verification, the thesis proposes the first GP biomarker verification method through measuring the peptide detectability. The method solves the imbalance problem in the data and shows improvement over the benchmark algorithms. Also, the algorithm outperforms a well-known peptide detection method. The thesis also introduces a new GP method for alignment of MS data as a preprocessing stage, which will further help in improving the biomarker detection process

    Representation and decision making in the immune system

    Get PDF
    The immune system has long been attributed cognitive capacities such as "recognition" of pathogenic agents; "memory" of previous infections; "regulation" of a cavalry of detector and effector cells; and "adaptation" to a changing environment and evolving threats. Ostensibly, in preventing disease the immune system must be capable of discriminating states of pathology in the organism; identifying causal agents or ``pathogens''; and correctly deploying lethal effector mechanisms. What is more, these behaviours must be learnt insomuch as the paternal genes cannot encode the pathogenic environment of the child. Insights into the mechanisms underlying these phenomena are of interest, not only to immunologists, but to computer scientists pushing the envelope of machine autonomy. This thesis approaches these phenomena from the perspective that immunological processes are inherently inferential processes. By considering the immune system as a statistical decision maker, we attempt to build a bridge between the traditionally distinct fields of biological modelling and statistical modelling. Through a mixture of novel theoretical and empirical analysis we assert the efficacy of competitive exclusion as a general principle that benefits both. For the immunologist, the statistical modelling perspective allows us to better determine that which is phenomenologically sufficient from the mass of observational data, providing quantitative insight that may offer relief from existing dichotomies. For the computer scientist, the biological modelling perspective results in a theoretically transparent and empirically effective numerical method that is able to finesse the trade-off between myopic greediness and intractability in domains such as sparse approximation, continuous learning and boosting weak heuristics. Together, we offer this as a modern reformulation of the interface between computer science and immunology, established in the seminal work of Perelson and collaborators, over 20 years ago.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Computational studies of protein-ligand molecular recognition

    Get PDF
    Structure-based drug design is made possible by our understanding of molecular recognition. The utility of this approach was apparent in the development of the clinically e V ective HIV-1 PR inhibitors, where crystal structures of complexes of HIV-1 protease and inhibitors gave pivotal information. Computational methods drawing upon structural data are of increasing relevance to the drug design process. Nonetheless, these methods are quite rudimentary and signicant improvements are needed. The aim of this thesis was to investigate techniques which may lead to improved modelling of molecular recognition and a better ability to make predictions about the binding a Y nity of ligands. The two main themes were the modelling of acidbase titration behaviour of ligand and receptor, and the application of the simulation technique of congurational bias Monte Carlo (CBMC). The studies were performed with HIV-1 PR and its inhibitors as a model system. Biological processes are inuenced by the pH of the medium in which they take place. Ligandreceptor binding equilibria are often thermodynamically linked to protonation changes in ligand and/or receptor, as seen in the the binding of a number of HIV-1 PR inhibitors. In Chapter 2, a series of sixteen continuum electrostatics pKa calculations of HIV-1 PR inhibitor complexes was done, in order to characterize the nature and size of these linkages. The most important e V ects concern changes in the pKa of the enzyme active site aspartate dyad. Large pKa shifts were predicted in all cases, and at least one of the two dyad pKas became more basic on binding. At physiologically relevant pH, di V erent ligands induced di V erent protonation states, with di V erent tautomeric forms favoured. The fully deprotonated form of the dyad was not signicantly populated for any of the complexes. For about a third of the complexes, both singly and doubly protonated forms were predicted to be populated. The predicted predominant protonation states of MVT-101 and VX-478 were consistent with previous theoretical studies. The size of the predicted pKa shifts for MVT-101 and XK263 di V ered from a previous study using similar methods. The paucity and ambiguity of available experimental data makes it di Y cult to evaluate the results fully; however the tendency to exaggerate shifts, as observed in other studies, appears to be present. Scoring is the prediction of binding a Y nity from the structure of the ligandreceptor complex, according to an empirical scheme. Scoring studies usually neglect or grossly simplify the contribution of protonation equilibria to a Y nity, so in Chapter 3 proton linkage data was included in a regression analysis of the HIV-1 PR complexes from Chapter 2. Parameters previously shown to correlate with binding, namely electrostatic free energy changes and buried surface areas, were the basis for the analysis, and terms describing proton linkage, in the form of a correction for assay pH and an indicator variable for predicted dyad pKa shift on binding, were also considered. The complex with MVT-101 was an outlier in the analysis and was excluded. Further analysis demonstrated that the correction for assay pH made a signicant contribution to the regression equation. Amendment of the parameters for XK263 according to the available experimental data led to an improved regression in which the term for calculated pKa shifts also made a signicant contribution. The regression equations obtained had the same form and similar coe Y cients to scoring functions of the master equation type, and t the experimental data with comparable accuracy. More physically realistic simulations of ligandreceptor binding using the techniques of molecular dynamics (MD) or Monte Carlo (MC) are potentially more accurate than scoring function approaches. These methods are slow, so the alternative of CBMC, which has been shown to give faster convergence for polymer simulations, was implemented for C harmm 22, an all-atom protein force eld (Chapter 4). The correctness of the implementation was demonstrated by comparison with exact and stochastic dynamics (SD) results for individual terms in the force eld. The algorithm is more complex than those typically used with alkane force elds, and this has possible consequences for the e Y ciency. CBMC was used to generate a Ramachandran plot for the alanine dipeptide, and the results were found to be in agreement with those generated by a SD simulation. Analysis of statistical errors suggests that CBMC should be competitive with umbrella sampling for simulating conformational equilibria, par- ticularly when the cost of non-bonded energy evaluations dominates the simulation. CBMC can be applied to ligandreceptor binding, as demonstrated in grand canonical simulations of alkane adsorption in zeolites. The more limited problem of nding the pre- dominant bound conformation of a exible ligand given a rigid protein receptor (i.e. dock- ing) was treated in Chapter 5, using the example of a tripeptide inhibitor which binds to HIV-1 PR. Attempts to perform the docking using the Metropolis MC/simulated annealing and Lamarckian genetic algorithm methods implemented in the program AutoDock failed to reproduce the native conguration (with runs on the order of two days execution time). Docking using CBMC, combined with parallel tempering to further improve sampling, was successful in nding the native binding mode, although this success was dependent on ad hoc adjustments to the force eld, and a priori knowledge of the ligand protonation state and bind- ing site. The e Y ciency of the method was considerably lower than hoped, with problems due to the force eld- and model-dependent coupling between terms in the potential energy function, and the greedy nature of the CBMC algorithm. Various conclusions can be drawn from these studies. Chapters 2 and 3 provide evidence of the importance of protonation equilibria in ligandprotein molecular recognition, and un- derline the sizable contribution of electrostatic interactions to binding energies. In the face of this nding, neglect of electrostatic terms, as often seen past studies, appears to be counterpro- ductive. The scoring study also shows how experimental data can be used more e V ectively if factors such as assay conditions are carefully taken into account. Implementation of CBMC for a widely-used protein force eld and application of the algorithm to docking (Chapters 4 and 5) represents a proof of concept for a broadly useful simulation technique. Further work will be required to nd the right niche for CBMC and fully explore the potential of this and re- lated techniques. A nal point is the demonstrated utility of the HIV-1 PR test system which formed the focus of the studies. Abundant structural data has enabled many new approaches to be tested, and further insights are expected from the analysis of unusual cases, such as the anomalous results for MVT-101. As well as the question of scoring, studies of mutation and resistance are likely to attract considerable interest in the future

    Bayesian machine learning methods for predicting protein-peptide interactions and detecting mosaic structures in DNA sequences alignments

    Get PDF
    Short well-defined domains known as peptide recognition modules (PRMs) regulate many important protein-protein interactions involved in the formation of macromolecular complexes and biochemical pathways. High-throughput experiments like yeast two-hybrid and phage display are expensive and intrinsically noisy, therefore it would be desirable to target informative interactions and pursue in silico approaches. We propose a probabilistic discriminative approach for predicting PRM-mediated protein-protein interactions from sequence data. The model suffered from over-fitting, so Laplacian regularisation was found to be important in achieving a reasonable generalisation performance. A hybrid approach yielded the best performance, where the binding site motifs were initialised with the predictions of a generative model. We also propose another discriminative model which can be applied to all sequences present in the organism at a significantly lower computational cost. This is due to its additional assumption that the underlying binding sites tend to be similar.It is difficult to distinguish between the binding site motifs of the PRM due to the small number of instances of each binding site motif. However, closely related species are expected to share similar binding sites, which would be expected to be highly conserved. We investigated rate variation along DNA sequence alignments, modelling confounding effects such as recombination. Traditional approaches to phylogenetic inference assume that a single phylogenetic tree can represent the relationships and divergences between the taxa. However, taxa sequences exhibit varying levels of conservation, e.g. due to regulatory elements and active binding sites, and certain bacteria and viruses undergo interspecific recombination. We propose a phylogenetic factorial hidden Markov model to infer recombination and rate variation. We examined the performance of our model and inference scheme on various synthetic alignments, and compared it to state of the art breakpoint models. We investigated three DNA sequence alignments: one of maize actin genes, one bacterial (Neisseria), and the other of HIV-1. Inference is carried out in the Bayesian framework, using Reversible Jump Markov Chain Monte Carlo

    Representation and Decision Making in the Immune System

    Get PDF
    The immune system has long been attributed cognitive capacities such as "recognition" of pathogenic agents; "memory" of previous infections; "regulation" of a cavalry of detector and effector cells; and "adaptation" to a changing environment and evolving threats. Ostensibly, in preventing disease the immune system must be capable of discriminating states of pathology in the organism; identifying causal agents or "pathogens"; and correctly deploying lethal effector mechanisms. What is more, these behaviours must be learnt insomuch as the paternal genes cannot encode the pathogenic environment of the child. Insights into the mechanisms underlying these phenomena are of interest, not only to immunologists, but to computer scientists pushing the envelope of machine autonomy.This thesis approaches these phenomena from the perspective that immunological processes are inherently inferential processes. By considering the immune system as a statistical decision maker, we attempt to build a bridge between the traditionally distinct fields of biological modelling and statistical modelling. Through a mixture of novel theoretical and empirical analysis we assert the efficacy of competitive exclusion as a general principle that benefits both. For the immunologist, the statistical modelling perspective allows us to better determine that which is phenomenologically sufficient from the mass of observational data, providing quantitative insight that may offer relief from existing dichotomies. For the computer scientist, the biological modelling perspective results in a theoretically transparent and empirically effective numerical method that is able to finesse the trade-off between myopic greediness and intractability in domains such as sparse approximation, continuous learning and boosting weak heuristics. Together, we offer this as a modern reformulation of the interface between computer science and immunology, established in the seminal work of Perelson and collaborators, over 20 years ago

    Personalisation of heart failure care using clinical trial data

    Get PDF
    Heart failure is a common, debilitating and life limiting disease, resulting in a large burden for both the individual patient and healthcare provision. Therefore, optimisation of treatments for these patients is of prime importance. Heart failure with reduced ejection fraction has a large evidence base for effective treatments, and more recently effective treatments have started to be identified for those with preserved ejection fraction. The effectiveness of these treatments is calculated at a population level, and there is a great deal of interest to try and identify if different patients may benefit more from certain treatments. In addition, we wish to understand more about different phenotypes in heart failure, to help understand what the patient might expect for the trajectory of their illness and potentially develop targeted treatments. To explore these issues further, this thesis presents several approaches using heart failure clinical trial data to try and further understand the patient journey and explore how treatment may be delivered in a more personalised fashion. The first analyses look at the patterns of heart failure hospitalisations, including the timing of admissions, and the relationship with different modes of death. This was examined in both heart failure with preserved and reduced ejection fraction. The accepted trajectory of recurrent admissions falling closer together over time was confirmed, and admissions closer together were linked to a higher risk of cardiovascular death, particularly due to progressive pump failure. Sudden death did appear to be truly sudden and not strongly linked to hospitalisations. The next approach was to perform latent class analysis to try and identify clusters of patients, or phenotypes, within heart failure with preserved and reduced ejection fraction separately using a data driven method. Phenotypes were identified with consistency across different data and using different approaches. These phenotypes were clinically recognisable. Identifying phenotypes in this way may be a route to looking for differential responses to treatments. Lastly, supervised machine learning methods were used to predict outcomes in patients with heart failure and reduced ejection fraction. These techniques provide more analytical flexibility, but did not show performance benefit compared with prognostic models based on survival analysis methods. Overall, the predictive abilities were modest. In conclusion, several avenues were explored to help understand the patient journey in heart failure, aiming to give more detail about the expected patient trajectory and exploring methods to examine for differential treatment responses in phenotypes of patients in heart failure

    Läänemere kirdeosa põhjaelustiku bioloogiline mitmekesisus: kaardistamise metoodikad, ruumilised mustrid ja seosed keskkonnamuutujatega

    Get PDF
    Väitekirja elektrooniline versioon ei sisalda publikatsiooneBioloogiline mitmekesisus tagab ökosüsteemi funktsioneerimise muutuvates keskkonnatingimustes. Merealade ja ranniku kasutamine inimeste poolt on aga järjest intensiivistunud ning merealad on üha suureneva inimtegevustest tuleneva surve all. Mere kaitse ja majandamise otsuste jaoks on vaja bioloogilise mitmekesisuse kaarte, kuid tavapärane proovide kogumisel põhinev metoodika ei sobi suurte merealade kaardistamiseks. Käesolevas töös modelleeriti merepõhja suurtaimestiku ja suurselgrootute liigirikkuse ruumilist levikut Eestis kasutades sisendina proovipunktipõhiseid bioloogilise mitmekesisuse andmeid ja erinevate keskkonnamuutujate kaardikihte. Kõrgeimad põhjaelustiku liigirikkuse väärtused registreeriti Lääne-Eesti saarestikus. Võimalikke muutusi põhjaelustiku liigirikkuses tuleviku kliima tingimustes hinnati samuti modelleerimise abil ja leiti, et nii taimestiku kui loomastiku liigirikkus väheneb suuremal osal Eesti merealast. Modelleeritud liigirikkuse kaardikihte kasutati mereranna geomorfoloogia ja põhjaelustiku liigirikkuse vaheliste seoste uurimiseks, mille tulemusel leiti, et ranna geomorfoloogiliste tüüpide vahel esinesid statistiliselt olulise erinevused merepõhja elustiku liigirikkuses. Arendati välja metoodika merepõhja substraadi ja elustiku kaardistamiseks kasutades sonarit, veealuseid videoid ja matemaatilist modelleerimist. Metoodika võimaldab punktipõhise kaardistamisega võrreldes kõrgema detailsusega merepõhja substraadi ja elustiku leviku kaardistamist. Lisaks töötati välja põhjaelustiku liigirikkuse ja teiste mere loodusväärtuste (põhjaelustiku liigid, linnud, hülged) levikuandmete kasutamise hõlbustamiseks mere majandamisel spetsiaalsed meetodid – merekeskkonna tundlikkuse (EVP) ja riski profiilide (ERP) kaardikihid. EVP näitab merekeskkonna potentsiaalset tundlikkust häiringutele ja ERP võimaldab kindlaks teha piirkonnad, kus oht keskkonnale on kõrgeim nii elustiku pikaajalise taastumise kui kõrge inimtegevustest tuleneva surve tõttu.Biodiversity is important for keeping marine ecosystem functionality under changing environmental conditions. The human use of marine areas is increasing worldwide and intensively used marine areas are under increasing pressures. Decrease of marine biodiversity has already taken place. Therefore, knowledge about spatial patterns of biodiversity and its connections with environmental gradients is crucial to detect and follow changes in biodiversity and to form a well-informed basis for the protection and management of marine resources. In this study, the distribution of species richness of seabed macrovegetation and macroinvertebrates was modeled in the Estonian marine area based on previous point-wise sampling data and map layers of environmental variables (water depth, salinity etc). Highest biodiversity values were detected in the western Estonia archipelago. Potential changes of species richness in the conditions of future climate change were also estimated by modeling. It was found that biodiversity of both seabed flora and fauna will probably decrease across Estonian sea area. Modeled benthic biodiversity layers were further used to test the relationships between underwater biodiversity and shore geomorphology and it was shown that differences in the benthic biodiversity values close to different geomorphological shore types exist. Methodology for mapping seabed substrate and biota using acoustic scanning (sonar), underwater video and mathematical modeling was developed. Compared to the previous point-wise mapping the new sonar- and modelling-based methodology enables mapping of seabed substrate and biota with significantly higher resolution. To facilitate the use of spatial data of biodiversity and other nature values (benthic species, seals, birds) in marine management, marine environmental vulnerability (EVP) and risk (ERP) profiles were developed. EVP identifies environmentally vulnerable areas and ERP identifies areas where environmental risks are highest.https://www.ester.ee/record=b5251349~S
    corecore