    Investigation of sequence features of hinge-bending regions in proteins with domain movements using kernel logistic regression

    Background: Hinge-bending movements in proteins comprising two or more domains form a large class of functional movements. Hinge-bending regions demarcate protein domains and collectively control the domain movement. Consequently, the ability to recognise sequence features of hinge-bending regions and to be able to predict them from sequence alone would benefit various areas of protein research. For example, an understanding of how the sequence features of these regions relate to dynamic properties in multi-domain proteins would aid in the rational design of linkers in therapeutic fusion proteins. Results: The DynDom database of protein domain movements comprises sequences annotated to indicate whether the amino acid residue is located within a hinge-bending region or within an intradomain region. Using statistical methods and Kernel Logistic Regression (KLR) models, this data was used to determine sequence features that favour or disfavour hinge-bending regions. This is a difficult classification problem as the number of negative cases (intradomain residues) is much larger than the number of positive cases (hinge residues). The statistical methods and the KLR models both show that cysteine has the lowest propensity for hinge-bending regions and proline has the highest, even though it is the most rigid amino acid. As hinge-bending regions have been previously shown to occur frequently at the terminal regions of the secondary structures, the propensity for proline at these regions is likely due to its tendency to break secondary structures. The KLR models also indicate that isoleucine may act as a domain-capping residue. We have found that a quadratic KLR model outperforms a linear KLR model and that improvement in performance occurs up to very long window lengths (eighty residues) indicating long-range correlations. Conclusion: In contrast to the only other approach that focused solely on interdomain hinge-bending regions, the method provides a modest and statistically significant improvement over a random classifier. An explanation of the KLR results is that in the prediction of hinge-bending regions a long-range correlation is at play between a small number amino acids that either favour or disfavour hinge-bending regions. The resulting sequence-based prediction tool, HingeSeek, is available to run through a webserver at hingeseek.cmp.uea.ac.uk

    New strategy for the identification of prostate cancer: The combination of Proclarix and the prostate health index

    Prostate health index (PHI) and, more recently, Proclarix have been proposed as serum biomarkers for prostate cancer (PCa). In this study, we aimed to evaluate Proclarix and PHI for predicting clinically significant prostate cancer (csPCa)

    Evaluation of lntelligent Medical Systems

    This thesis presents novel, robust, analytic and algorithmic methods for calculating Bayesian posterior intervals of receiver operating characteristic (ROC) curves and confusion matrices used for the evaluation of intelligent medical systems tested with small amounts of data. Intelligent medical systems are potentially important in encapsulating rare and valuable medical expertise and making it more widely available. The evaluation of intelligent medical systems must make sure that such systems are safe and cost effective. To ensure systems are safe and perform at expert level they must be tested against human experts. Human experts are rare and busy which often severely restricts the number of test cases that may be used for comparison. The performance of expert human or machine can be represented objectively by ROC curves or confusion matrices. ROC curves and confusion matrices are complex representations and it is sometimes convenient to summarise them as a single value. In the case of ROC curves, this is given as the Area Under the Curve (AUC), and for confusion matrices by kappa, or weighted kappa statistics. While there is extensive literature on the statistics of ROC curves and confusion matrices they are not applicable to the measurement of intelligent systems when tested with small data samples, particularly when the AUC or kappa statistic is high. A fundamental Bayesian study has been carried out, and new methods devised, to provide better statistical measures for ROC curves and confusion matrices at low sample sizes. They enable exact Bayesian posterior intervals to be produced for: (1) the individual points on a ROC curve; (2) comparison between matching points on two uncorrelated curves; . (3) the AUC of a ROC curve, using both parametric and nonparametric assumptions; (4) the parameters of a parametric ROC curve; and (5) the weight of a weighted confusion matrix. These new methods have been implemented in software to provide a powerful and accurate tool for developers and evaluators of intelligent medical systems in particular, and to a much wider audience using ROC curves and confusion matrices in general. This should enhance the ability to prove intelligent medical systems safe and effective and should lead to their widespread deployment. The mathematical and computational methods developed in this thesis should also provide the basis for future research into determination of posterior intervals for other statistics at small sample sizes

    Genetic ancestry inference from cancer-derived molecular data across genomic and transcriptomic platforms

    Genetic ancestry-oriented cancer research requires the ability to perform accurate and robust genetic ancestry inference from existing cancer-derived data, including whole exome sequencing, transcriptome sequencing, and targeted gene panels, very often in the absence of matching cancer-free genomic data. Here we examined the feasibility and accuracy of computational inference of genetic ancestry relying exclusively on cancer-derived data. A data synthesis framework was developed to optimize and assess the performance of the ancestry inference for any given input cancer-derived molecular profile. In its core procedure, the ancestral background of the profiled patient is replaced with one of any number of individuals with known ancestry. The data synthesis framework is applicable to multiple profiling platforms, making it possible to assess the performance of inference specifically for a given molecular profile and separately for each continental-level ancestry; this ability extends to all ancestries, including those without statistically sufficient representation in the existing cancer data. The inference procedure was demonstrated to be accurate and robust in a wide range of sequencing depths. Testing of the approach in four representative cancer types and across three molecular profiling modalities showed that continental-level ancestry of patients can be inferred with high accuracy, as quantified by its agreement with the gold standard of deriving ancestry from matching cancer-free molecular data. This study demonstrates that vast amounts of existing cancer-derived molecular data are potentially amenable to ancestry-oriented studies of the disease without requiring matching cancer-free genomes or patient self-reported ancestry

    Development of a minimally invasive molecular biomarker for early detection of lung cancer

    The diagnostic evaluation of ever smokers with pulmonary nodules represents a growing clinical challenge due to the implementation of lung cancer screening. The high false-positive rate of screening frequently results in the use of unnecessary invasive procedures in patients who are ultimately diagnosed as benign, clearly highlighting the need for additional diagnostic approaches. We previously derived and validated a bronchial epithelial gene-expression biomarker to detect lung cancer in ever smokers. However, bronchoscopy is not always chosen as a diagnostic modality. Given that bronchial and nasal epithelial gene-expression are similarly altered by cigarette smoke exposure, we sought to determine if cancer-associated gene-expression might also be detectable in the more readily accessible nasal epithelium. Nasal epithelial brushings were prospectively collected from ever smokers undergoing diagnostic evaluation for lung cancer in the AEGIS-1 (n=375) and AEGIS-2 (n=130) clinical trials and gene-expression profiled using microarrays. The computational framework used to discover biomarkers in these data was formalized and implemented in an open-source R-package. We identified 535 genes in the nasal epithelium of AEGIS-1 patients whose expression was associated with lung cancer status. Using matched bronchial gene-expression data from a subset of these patients, we found significantly concordant cancer-associated gene-expression alterations between the two airway sites. A nasal lung cancer classifier derived in the AEGIS-1 cohort that combined clinical factors and nasal gene-expression had significantly higher AUC (0.81) and sensitivity (0.91) than the clinical-factor model alone in independent samples from the AEGIS-2 cohort. These results support that the airway epithelial field of lung cancer-associated injury extends to the nose and demonstrates the potential of using nasal gene-expression as a non-invasive biomarker for lung cancer detection. The framework for deriving this biomarker was generalized and implemented in an open-source R-package. The package provides a computational pipeline to compare biomarker development strategies using microarray data. The results from this pipeline can be used to highlight the optimal model development parameters for a given dataset leading to more robust and accurate models. This package provides the community with a novel and powerful tool to facilitate biomarker discovery in microarray data

    BRAIN CONNECTIVITY AND TREATMENT RESPONSE IN ADULT ADHD:understanding the relationship between individual differences in fronto-parietal and fronto-striatal brain networks and response to chronic treatment with methylphenidate

    Attention-deficit/hyperactivity disorder (ADHD) is a common neurodevelopmental disorder, characterised by disrupted anatomical and/or functional connectivity, mainly in the fronto-striatal and fronto-parietal networks. Stimulants, such as methylphenidate (MPH), represent a first-line treatment in ADHD, but one third of patients fail to respond, with severe consequences for the individual and the society at large. Hence, a comprehensive understanding of the relationship between individual differences in brain abnormalities and treatment response is needed.This thesis focused on two main brain networks: the fronto-striatal network, a central theme in ADHD research, and the fronto-parietal attentive network, formed by the three branches of the superior longitudinal fasciculus (SLF). The SLF branches have been only recently described in humans, and there is no detailed analysis of their distinct functional roles and involvement in disorders such as ADHD. Therefore, I first investigated the functional anatomy of the SLF branches by combining a meta-analytic approach with tractography, and revealed novel findings about the anatomical and functional segregation and integration of brain functions within fronto-parietal networks. Then, I showed, for the first time, that the three SLF branches are all significantly right-lateralised in ADHD patients but not in controls, and provided preliminary evidence that the pattern of lateralisation of the SLF I may be related to poor attentive performance in ADHD patients.Finally, I conducted functional and structural connectivity analysis to test whether a relationship exists between brain abnormalities and treatment response in adult ADHD. I employed a longitudinal crossover follow-up design. 60 non-medicated adult ADHD patients were recruited and underwent behavioural assessment (Qb test) and magnetic resonance imaging (MRI) scanning twice, once under placebo and once under a clinically effective dose of MPH. Clinical and behavioural response was measured after two months of treatment with MPH. I demonstrated for the first time that there is a relationship between ‘connectivity’ abnormalities within fronto-parietal networks and treatment response in adult ADHD, both at the anatomical and functional level.Ultimately, my investigation contributed towards the identification of potential biomarkers of treatment response, which in the future may help clinicians deliver more individualised treatments.<br/

    Machine Learning based Protein Sequence to (un)Structure Mapping and Interaction Prediction

    Proteins are the fundamental macromolecules within a cell that carry out most of the biological functions. The computational study of protein structure and its functions, using machine learning and data analytics, is elemental in advancing the life-science research due to the fast-growing biological data and the extensive complexities involved in their analyses towards discovering meaningful insights. Mapping of protein’s primary sequence is not only limited to its structure, we extend that to its disordered component known as Intrinsically Disordered Proteins or Regions in proteins (IDPs/IDRs), and hence the involved dynamics, which help us explain complex interaction within a cell that is otherwise obscured. The objective of this dissertation is to develop machine learning based effective tools to predict disordered protein, its properties and dynamics, and interaction paradigm by systematically mining and analyzing large-scale biological data. In this dissertation, we propose a robust framework to predict disordered proteins given only sequence information, using an optimized SVM with RBF kernel. Through appropriate reasoning, we highlight the structure-like behavior of IDPs in disease-associated complexes. Further, we develop a fast and effective predictor of Accessible Surface Area (ASA) of protein residues, a useful structural property that defines protein’s exposure to partners, using regularized regression with 3rd-degree polynomial kernel function and genetic algorithm. As a key outcome of this research, we then introduce a novel method to extract position specific energy (PSEE) of protein residues by modeling the pairwise thermodynamic interactions and hydrophobic effect. PSEE is found to be an effective feature in identifying the enthalpy-gain of the folded state of a protein and otherwise the neutral state of the unstructured proteins. Moreover, we study the peptide-protein transient interactions that involve the induced folding of short peptides through disorder-to-order conformational changes to bind to an appropriate partner. A suite of predictors is developed to identify the residue-patterns of Peptide-Recognition Domains from protein sequence that can recognize and bind to the peptide-motifs and phospho-peptides with post-translational-modifications (PTMs) of amino acid, responsible for critical human diseases, using the stacked generalization ensemble technique. The involved biologically relevant case-studies demonstrate possibilities of discovering new knowledge using the developed tools

    데이터사이언스를 위한 확률과 통계

    이 노트는 본저자가 2020넌 가을학기 서울대학교 데이터사이언스대학원에서 강의한 ‘데이터사이언스를 위한 확률과 통계(Probability and Statistics for Data Science)’ 과목의 강의 슬라이드를 모아서 출간한 것이

    Artificial Intelligence for the prediction of weaning readiness outcome in a multi-centrical clinical cohort of mechanically ventilated patients

    Quando un paziente soffre di insufficienza respiratoria acuta, viene praticata la ventilazione meccanica (VM) finché questa non riesce a respirare di nuovo in autonomia. Il medico di Terapia Intensiva verifica ogni giorno se la VM può essere interrotta. Questo screening consiste in una prima fase, il Readiness Test (RT), che è composta da vari parametri clinici. Se questo test ha esito positivo, si sottopone il paziente a 30 minuti di respirazione spontanea (SBT). Se anche l'SBT viene superato con successo, la VM viene interrotta. Al contrario, se l’RT o l’SBT falliscono, il paziente rimane in VM e verrà rivalutato il giorno successivo. Quindi ogni giorno possono verificarsi tre scenari mutuamente esclusivi: l’SBT non verrà tentato, l’SBT fallirà o l’SBT avrà successo (portando quindi all’estubazione del paziente). Il modello di intelligenza artificiale sviluppato, è progettato per dedurre fin dalle prime ore del mattino quale dei tre scenari si verificherà probabilmente nel corso della giornata, partendo dai dati clinici del paziente, dalle informazioni raccolte nel diario clinico dei giorni precedenti e dall'intera storia di registrazione minuto-per-minuto dei vari parametri del ventilatore meccanico, provenienti da uno studio osservazionale retrospettivo multicentrico, condotto in Italia nel corso di 27 mesi. Questi dati vengono elaborati con un approccio di Deep Learning, attraverso una topologia di rete neurale multi-sorgente, alimentata da architetture ricorrenti multiple. Gli iper-parametri sono ottimizzati per selezionare il modello desiderato attraverso la convalida incrociata, riservando 36 pazienti su 182 per testare le prestazioni finali del modello su una serie di metriche, tra cui uno score personalizzato progettato per evidenziare l'impatto clinico. Il modello di intelligenza artificiale finale mostra un'accuratezza del 79% [74, 83%], uno score personalizzato di 0,01 [-0,04, 0,05], un MCC di 0,28 [0,17, 0,39], ottenendo un punteggio migliore rispetto agli altri modelli di confronto, tra cui XG Boost, addestrato solo sui dati clinici giornalieri del giorno precedente, che ha avuto un'accuratezza del 61% [56%, 66%], un MCC di 0,14 [0,06, 0,2] e uno score personalizzato di -0,05 [-0,08, -0,01]. Complessivamente, il modello di intelligenza artificiale è in grado di approssimare bene l'attuale gestione clinica giorno per giorno, fornendo suggerimenti al mattino presto. Inoltre, c'è ancora spazio per migliorare l'utilità clinica del modello considerando ulteriori dati di addestramento personalizzati.When someone suffers from acute respiratory failure, mechanical ventilation (MV) is performed until they can breathe on their own again. The doctor checks every day whether the MV can be stopped. This screening consists of a first phase, the Readiness Testing (RT), which includes various clinical parameters. If this test is successful, 30 minutes of spontaneous breathing (SBT) is attempted. If also the SBT is passed successfully, the VM is stopped. On the contrary, if RT or SBT fails, the patient will be re-evaluated the next day. So, every day three mutually exclusive scenarios may happen: SBT will not be attempted, SBT will fail, or SBT will succeed. Our artificial intelligence model is designed to infer early in the morning which of the three scenarios will probably occur during the day, starting from the patient's clinical data, from the information collected in the previous day’s clinical diary, and from whole minute-by-minute recording history of the various parameters of the mechanical ventilator, coming from a retrospective observational multi-centrical study, conducted in Italy over a course of 27 months. Those data are processed with a deep learning approach, through a multi-source neural network topology, powered by multiple recurrent architectures. Hyper-parameters are optimized to select the purposed model through cross-validation, setting aside 36 out of 182 patients for testing final model performance over a variety of metrics, including a custom score designed to highlight clinical impact. The final AI model had an accuracy of 79% [74, 83%], a custom score of 0.01 [-0.04, 0.05], a MCC of 0.28 [0.17, 0.39], scoring better than the other comparison models, including XG Boost that was trained on daily and baseline clinical data of the previous day only, which had an accuracy of 61% [56%, 66%], a MCC of 0.14 [0.06, 0.2] and a custom score of -0.05 [-0.08, -0.01]. Overall, AI model could approximate well what is the current clinical management throughout day-by-day providing suggestions early in the morning. Moreover, there are still space to improve the model clinical utility considering additional tailored training data