    Machine learning on normalized protein sequences

    <p>Abstract</p> <p>Background</p> <p>Machine learning techniques have been widely applied to biological sequences, e.g. to predict drug resistance in HIV-1 from sequences of drug target proteins and protein functional classes. As deletions and insertions are frequent in biological sequences, a major limitation of current methods is the inability to handle varying sequence lengths.</p> <p>Findings</p> <p>We propose to normalize sequences to uniform length. To this end, we tested one linear and four different non-linear interpolation methods for the normalization of sequence lengths of 19 classification datasets. Classification tasks included prediction of HIV-1 drug resistance from drug target sequences and sequence-based prediction of protein function. We applied random forests to the classification of sequences into "positive" and "negative" samples. Statistical tests showed that the linear interpolation outperforms the non-linear interpolation methods in most of the analyzed datasets, while in a few cases non-linear methods had a small but significant advantage. Compared to other published methods, our prediction scheme leads to an improvement in prediction accuracy by up to 14%.</p> <p>Conclusions</p> <p>We found that machine learning on sequences normalized by simple linear interpolation gave better or at least competitive results compared to state-of-the-art procedures, and thus, is a promising alternative to existing methods, especially for protein sequences of variable length.</p

    HIV drug resistance prediction with weighted categorical kernel functions

    Background: Antiretroviral drugs are a very effective therapy against HIV infection. However, the high mutation rate of HIV permits the emergence of variants that can be resistant to the drug treatment. Predicting drug resistance to previously unobserved variants is therefore very important for an optimum medical treatment. In this paper, we propose the use of weighted categorical kernel functions to predict drug resistance from virus sequence data. These kernel functions are very simple to implement and are able to take into account HIV data particularities, such as allele mixtures, and to weigh the different importance of each protein residue, as it is known that not all positions contribute equally to the resistance. Results: We analyzed 21 drugs of four classes: protease inhibitors (PI), integrase inhibitors (INI), nucleoside reverse transcriptase inhibitors (NRTI) and non-nucleoside reverse transcriptase inhibitors (NNRTI). We compared two categorical kernel functions, Overlap and Jaccard, against two well-known noncategorical kernel functions (Linear and RBF) and Random Forest (RF). Weighted versions of these kernels were also considered, where the weights were obtained from the RF decrease in node impurity. The Jaccard kernel was the best method, either in its weighted or unweighted form, for 20 out of the 21 drugs. Conclusions: Results show that kernels that take into account both the categorical nature of the data and the presence of mixtures consistently result in the best prediction model. The advantage of including weights depended on the protein targeted by the drug. In the case of reverse transcriptase, weights based in the relative importance of each position clearly increased the prediction performance, while the improvement in the protease was much smaller. This seems to be related to the distribution of weights, as measured by the Gini index. All methods described, together with documentation and examples, are freely available at https://bitbucket.org/elies_ramon/catkern.Peer ReviewedPostprint (published version

    Disain ja modelleerimine HIV-1 pöördtranskriptaasi ja Malaaria ravimite väljatöötamise varajases faasis

    Väitekirja elektrooniline versioon ei sisalda publikatsiooneKäesolev uurimus keskendub kahele ohtlikule infektsioonhaigusele: inimese immuunpuudulikkuse viirus tüüp 1 (HIV-1) ja malaaria. Uue ravimi väljatöötamine algusest lõppuni on aega nõudev ning kulukas protsess, mis jaotatakse viieks etapiks: baas uurimistöö, põhi sihtmärgi ja baas ühendi(te) leidmine, eelkliiniline arendus, kliiniline arendus ja vajalike dokumentide esitamine ravimiametisse. Antud väitekirjas keskendutakse kahele esimesele etappidele, mida tuntakse ka varajase ravimiarenduse faasina. HIV-1 uurimisel oli kaks põhisuunda. Esmalt tuginedes eelnevalt tehtud virtuaalsõelumise tulemustele teostati uudsete s-triasiini derivaatide avastamine, disainimine, ja süntees, mille tulemused valideeriti eksperimentaalselt ning analüüsiti valk-ligand interaktsioonimudelite abil. Kõige tõhusam HIV-1 mitte-nukleosiidne pöördtranskriptaasi inhibiitor oli madala molekulmassiga, heade ligandi efektiivsust näitavate parameetritega, ja madala toksilisusega, võimaldades edasist modifitseerimist ja arendamist. Tehtud aktiivse keemilise struktuuri avastus motiveeris HIV-1 inhibiitorite keemilise struktuuriruumi laiemat uurimist, et kindlaks teha kas uudsed s-triasiinid moodustavad ka unikaalsed keemiliste ühendite grupi HIV-1 mitte-nukleosiidsete pöördtranskriptaasi inhibiitorite maastikul. Selle läbiviimiseks koostati, korrastati ja kureeriti ChEMBL-i andmebaasist saadud andmetest fokusseeritud andmeseeriad HIV-1 mitte-nukleosiidne ja nukelosiidsete pöördtranskriptaasi inhibiitorite jaoks, kuhu lisati ka avastatud s-triasiini derivaadid. Andmeseeriate struktuuride analüüs hierarhilise klassifitseerimise meetodil grupeeris ühendid keemiliste struktuuritüüpide (nn. vanematüüp) järgi. Selgus, et avastatud s-triasiinid moodustasid eraldiseisva struktuuritüübi grupi. Leitud struktuuritüüpe analüüsiti, lisades juurde ka vastavad mõõdetud seondumise afiinsuse tasakaalukonstandid (Ki). Selle analüüsi käigus toodi välja struktuurifragmendid, mis omavad olulist rolli afiinsuse ning stabiilsuse seisukohast. Lisaks võimaldasid struktuurselt mitmekesised ja unikaalsed HIV-1 mitte-nukleosiidne ja nukelosiidsete pöördtranskriptaasi inhibiitorite andmeseeriad esmakordselt arendada kirjeldavaid kvantitatiivsete struktuur-aktiivsus sõltuvuste prognoosmudeleid, mida on võimalik kasutada järgnevas uurimustöös uute aktiivsete keemiliste ühendite avastamisel. Selleks et leida uudseid malaaria ravimikanditaate koostati ja kureeriti süsteemselt andmebaas eksperimentaalsete anti-Plasmodium andmetega kasutades nii asutusesisesed, kui ka ChEMBL-i andmebaasis olevad andmed. Saadud andmete ulatusliku kureerimise, filtreerimise ning ühendamise tulemusena saadi kolmkümmend modelleeritavat andmeseeriat, millele koostati klassifitseerimise mudelid, eesmärgiga eristada aktiivsed ja mitteaktiivsed ühendid. Nendest seitsmeteistkümnele andmeseeriale saadi ennustusvõimelised nn. üksmeele (inglise keeles consensus) mudelid. Loodud mudelitega teostati ennustusi asutusesiseselt olemasolevatele curcuminoidide seerjale ning nende analoogidele, millest parima ennustusvõimega ühenditele teostati eksperimentaalne valideerimine in vitro katsetega, kus aktiivseks osutusid seitseteist ühendit, mida saab edasistes uuringutes täpsemini uurida. Samuti tehti kindaks, et arvutuslikult tuvastatud mitteaktiivsed ühendid jäid mitteaktiivseks ka eksperimentaalse valideerimise käigus, mis näitas süsteemselt kureeritud ja koostatud andmeseeriate ning prognoosmudelite jätkusuutlikust.Current thesis focused on study of two highly prevalent infections affecting many regions in the world: alaria and human immunodeficiency virus 1 (HIV-1). Developing a new drug from scratch is time consuming and costly process. This could be divided into five stages: basic research, lead target and lead compound(s) discovery, preclinical development, clinical development and filing to drug administration agency. Present thesis focused on basic research and lead compound discovery stages, i.e. to the early drug discovery. For the HIV-1, the focus was two-fold. First, based on the earlier multi-objective in silico screening, novel s-triazine derivatives were designed, discovered, synthesized, and findings where supported by the modelling tasks and validated with biological evaluation. The most potent compound is with small molecular size, potent ligand efficiencies, and measured low toxicity permitting further exploration and modifications. Second, the discovered new bioactive s-triazines motivated to analyse the chemical landscape of HIV-1 RT inhibitors. For this the dataset was systematically created and curated for HIV-1 NNRT (non-nucleoside reverse transcriptase) and NRT (nucleoside reverse transcriptase) inhibitors based on data from ChEMBL database. The hierarchical classification of scaffold structures of curated datasets revealed common chemical parent types for the compounds, hierarchy in chemical structures and showed that discovered s-triazines formed a separate structural parent type group. Each group of compounds related to the parent type was analysed and examined together with corresponding binding affinity equilibrium constants (Ki). The structural fragments affecting the potency and stability of compounds were highlighted. The structurally diverse datasets for the HIV-1 NNRTIs and NRTIs with binding affinity equilibrium constants allowed development of novel descriptive and predictive QSAR models for log Ki, that in future will help in design of new compounds. In order to discover new promising antimalarial compounds, the experimental anti-Plasmodium data was gathered and systematically curated from in-house experimental studies and expanded with data from ChEMBL database. Extracted data was carefully extensively curated, fused, filtered, and grouped into thirty data sets for the modelling. The consensus models for each dataset for the classification of active/inactive compounds were established and seventeen models with promising prediction ability were used in consensus predictions and in identifying the series of curcuminoids and their structural analogues as potential inhibitors for the malaria. The selection of compounds was experimentally validated, i.e. tested in vitro, revealing seventeen potentially active compounds for further testing and modifications. The validation showed that computationally predicted inactive compounds were also inactive in experiment, being additional proof for the quality of data curation and dataset assembly process forming the ground for the modelling task

    HIV Drug Resistant Prediction and Featured Mutants Selection using Machine Learning Approaches

    HIV/AIDS is widely spread and ranks as the sixth biggest killer all over the world. Moreover, due to the rapid replication rate and the lack of proofreading mechanism of HIV virus, drug resistance is commonly found and is one of the reasons causing the failure of the treatment. Even though the drug resistance tests are provided to the patients and help choose more efficient drugs, such experiments may take up to two weeks to finish and are expensive. Because of the fast development of the computer, drug resistance prediction using machine learning is feasible. In order to accurately predict the HIV drug resistance, two main tasks need to be solved: how to encode the protein structure, extracting the more useful information and feeding it into the machine learning tools; and which kinds of machine learning tools to choose. In our research, we first proposed a new protein encoding algorithm, which could convert various sizes of proteins into a fixed size vector. This algorithm enables feeding the protein structure information to most state of the art machine learning algorithms. In the next step, we also proposed a new classification algorithm based on sparse representation. Following that, mean shift and quantile regression were included to help extract the feature information from the data. Our results show that encoding protein structure using our newly proposed method is very efficient, and has consistently higher accuracy regardless of type of machine learning tools. Furthermore, our new classification algorithm based on sparse representation is the first application of sparse representation performed on biological data, and the result is comparable to other state of the art classification algorithms, for example ANN, SVM and multiple regression. Following that, the mean shift and quantile regression provided us with the potentially most important drug resistant mutants, and such results might help biologists/chemists to determine which mutants are the most representative candidates for further research

    A Rough Set-Based Model of HIV-1 Reverse Transcriptase Resistome

    Reverse transcriptase (RT) is a viral enzyme crucial for HIV-1 replication. Currently, 12 drugs are targeted against the RT. The low fidelity of the RT-mediated transcription leads to the quick accumulation of drug-resistance mutations. The sequence-resistance relationship remains only partially understood. Using publicly available data collected from over 15 years of HIV proteome research, we have created a general and predictive rule-based model of HIV-1 resistance to eight RT inhibitors. Our rough set-based model considers changes in the physicochemical properties of a mutated sequence as compared to the wild-type strain. Thanks to the application of the Monte Carlo feature selection method, the model takes into account only the properties that significantly contribute to the resistance phenomenon. The obtained results show that drug-resistance is determined in more complex way than believed. We confirmed the importance of many resistance-associated sites, found some sites to be less relevant than formerly postulated and—more importantly—identified several previously neglected sites as potentially relevant. By mapping some of the newly discovered sites on the 3D structure of the RT, we were able to suggest possible molecular-mechanisms of drug-resistance. Importantly, our model has the ability to generalize predictions to the previously unseen cases. The study is an example of how computational biology methods can increase our understanding of the HIV-1 resistome

    A Prognostic Model for Estimating the Time to Virologic Failure in HIV-1 Infected Patients Undergoing a New Combination Antiretroviral Therapy Regimen

    <p>Abstract</p> <p>Background</p> <p>HIV-1 genotypic susceptibility scores (GSSs) were proven to be significant prognostic factors of fixed time-point virologic outcomes after combination antiretroviral therapy (cART) switch/initiation. However, their relative-hazard for the time to virologic failure has not been thoroughly investigated, and an expert system that is able to predict how long a new cART regimen will remain effective has never been designed.</p> <p>Methods</p> <p>We analyzed patients of the Italian ARCA cohort starting a new cART from 1999 onwards either after virologic failure or as treatment-naïve. The time to virologic failure was the endpoint, from the 90<sup>th </sup>day after treatment start, defined as the first HIV-1 RNA > 400 copies/ml, censoring at last available HIV-1 RNA before treatment discontinuation. We assessed the relative hazard/importance of GSSs according to distinct interpretation systems (Rega, ANRS and HIVdb) and other covariates by means of Cox regression and random survival forests (RSF). Prediction models were validated via the bootstrap and c-index measure.</p> <p>Results</p> <p>The dataset included 2337 regimens from 2182 patients, of which 733 were previously treatment-naïve. We observed 1067 virologic failures over 2820 persons-years. Multivariable analysis revealed that low GSSs of cART were independently associated with the hazard of a virologic failure, along with several other covariates. Evaluation of predictive performance yielded a modest ability of the Cox regression to predict the virologic endpoint (c-index≈0.70), while RSF showed a better performance (c-index≈0.73, p < 0.0001 vs. Cox regression). Variable importance according to RSF was concordant with the Cox hazards.</p> <p>Conclusions</p> <p>GSSs of cART and several other covariates were investigated using linear and non-linear survival analysis. RSF models are a promising approach for the development of a reliable system that predicts time to virologic failure better than Cox regression. Such models might represent a significant improvement over the current methods for monitoring and optimization of cART.</p