Search CORE

14 research outputs found

Exploring the animal GPS system: a machine learning approach to study the hippocampal function

Author: Bzhalava Zurab
Publication venue
Publication date: 01/01/2015
Field of study

2014. aasta Nobeli preemia füsioloogias said Dr. John M. O’Keefe, Dr. May-Britt Moser ja Dr. Edvard I teatud kindlate rakkude avastamise eest ajus, mis vastutavad ruumi- ja suunataju eest. Need avastused võimaldavad arvata, et aju loob sisemise kaardi ümbritsevast keskkonnast. See aitab meil ära tunda tuttavaid kohti ja ruumis hästi orienteeruda. Antud magistritöös kasutasime me roti “GPS” süsteemi tundma õppimiseks arvutuslikku lähenemist. Konkreetsemalt võrdlesime, kui hästi suudavad erinevad masinõppe algoritmid ette ennustada roti asukohta, saades sisendiks ainult tema hipokampuses toimuva neuronaalse aktiivsuse. Võrreldud meetodite seas olid juhumets (random forest), tugivektorklassifitseerijad (support vector machine, SVM), lähima naabri meetod (nearest neighbor) ja mõningad hajusa lineaarse regressiooni algoritmid. Neuronitest mõõdetud elektrofüsioloogilised andmed olid pärit Buxsaki laborist New Yorgis. Keskendusime roti hipokampuse neuraalsele aktiivsusele - aju osale, kus on senistes uurimustöödes enamik koharakke tuvastatud. Esimese sammuna jagasime ala, kus rott eksperimendi ajal viibis, neljaks väiksemaks tsooniks. Seejärel üritasime ennustada, missuguses alas katsealune loom mingil suvalisel ajahetkel asus. Leidsime, et juhumets andis parima ennustustäpsuse, milleks oli 57.8% ja mis on oluliselt suurem juhusliku valiku tõenäosusest. Sellegipoolest oli mõnedes katseala regioonides tugivektorklassfitseerija mõnikord parem kui juhumets. Järgmise sammuna tegime asukoha identifitseerimise veelgi raskemaks ja jagasime eksperimentaalala 16 väiksemaks tsooniks. Juhumets ja SVM saavutasid tugevalt statisiliselt olulised tulemused, vastavalt 38% ja 37% (juhusliku ennustuse täpsus oleks olnud umbes 11%). Mõlema probleemülesande puhul kasutasime me ka lähima naabri algoritmi, aga selle täpsus oli võrreldes eelmainitud meetoditega märgatavalt väiksem. Kuna roti asukoht on pidev muutuja, siis me proovisime käsitleda seda ka pideva ennustuse probleemina. Suurem osa regressiooni algoritme, mida selles töös analüüsitakse (kantregressioon (ridge regression), lassoregressioon (lasso regression), elastne võrk (elastic net)), andsid juhuslikule ennustustäpsusele lähedasi tulemusi. Ainult juhumets andis pideva ennustuse probleemi puhul teistest meetoditest oluliselt parema täpsuse. Seejärel analüüsisime me andmeid, mis olid salvestatud eksperimendist, kus rotid olid treenitud valima vasakut või paremat suunda number 8 kujulises labürindis, olles samal ajal ise jooksurattal. Nende mõõtmistulemuste puhul teostasime me esimese sammuna andmetele mõõdete vähendamise (dimensionality reduction), et visualiseerida muutusi andmetes otsuse langetamise hetkel. Muuhulgas identifitseerisime ja tõime joonistel välja ka episoodirakud - neuronid, mis on rohkem aktiivsed kindlal ajal antud ülesande jooksul. Episoodirakud võivad kaasa aidata aja tajumisel ja episoodilise mälu loomisel. Samuti visualiseerisime neuronaalseid trajektoore otsuse langetamise ajal, et ette aimata, millise otsuse loom vastu võtab. Kokkuvõtteks andis roti asukoha ennustamisel algoritmidest täpseimaid tulemusi juhumets. See võib muuhulgas näidata seda, et informatsioon roti asukoha kohta sisaldub mitte-lineaarses neuraalses aktiivsuses, mida lineaarregressiooni meetodid ei olnud võimelised tuvastama. Edasises uurimistöös plaanime me dekodeerida roti asukohta, kasutades meetodeid, mis on sarnasemad aju enda mehhanismidele. Neurovõrgud (neural networks) on laialt levinud masinõppe meetod, mis sarnaselt juhumetsadega suudab ära tunda mitte-lineaarseid mustreid. Selles töös loodud andmetöötluskonveiereid (data processing pipeline), mis tegelevad üsnagi keerulise andmete eeltöötluse, tunnuste eraldamise ja andmestiku visualiseerimisega, panevad tulevikuks tugeva aluse hipokampuse dünaamika uurimisele TÜ arvutusliku neuroteaduse töögrupis.The 2014 Nobel prize in Physiology was awarded to Dr. John M. O’Keefe, Dr. May-Britt Moser and Dr. Edvard I for discovering particular cells in the brain that provide the sense of place and navigation. These discoveries suggest that the brain creates internal map-like representation of the environment which helps us recognize familiar places and navigate well. In this thesis, we used a computational approach to study the animal "GPS" system. In particular, we set to compare how well different machine learning algorithms are able to predict a rat's position just based on its hippocampal neural activity. Methods compared included Random Forest, Support Vector Machines, k-Nearest Neighbors, and several sparse linear regression algorithms. Data was obtained multi-neuron electrophysiological data recorded from the Buzsaki lab in New York, and we focus on the activity of rat hippocampus, the brain region where most the place cells have been identified. In a first step, we divided the experimental arena into 4 blocks and tried to classify in which one of those blocks the rat was at a given time. In this case, we found that Random Forest gave the best accuracy which was 57.8%, well beyond the chance level. However, in some particular regions of the arena, Support Vector Machine was sometimes better than Random Forest. For the next step, we made the classification problem even harder by dividing the arena into 16 blocks. Random Forest and SVM produced highly significant results with 38% and 37% accuracy respectively (random classifier accuracy would be approximately ~11%). We also used K-Nearest Neighbors for both classification problems but its accuracy was less in both cases than the above mentioned algorithms. Since the rat position is a continuous variable we also considered the continuous prediction problem. Most regression algorithms we analyzed (Ridge Regression, LASSO, Elastic Net) provided results near chance level while Random Forest outperformed the algorithms and gave the best results in this case. Furthermore, we analysed data recorded from an experiment where rats were trained to choose left or right direction in a 8-shaped maze while they were running in a wheel. In this case we perform a dimensionality reduction of the neuronal data to visualize its dynamics during the decision time. We also identified and provided plots of episodic cells (neurons who are more active at particular times in the task) which might contribute to the sense of time and create episodic memory. Also, we visualized neuronal trajectories while animal makes decisions in order to predict its future decision. In conclusion, from the algorithms we analysed Random Forest gave the best accuracy while predicting a rat's location. This might also indicate that the information about rat location is contained in non-linear patterns of neuronal activity, which linear regression methods were unable to extract. In future research we plan to decode a rat position using a method more similar to the brain own mechanisms such as neural networks, which as Random Forest can detect non-linear patterns. More generally, the pipelines developed during this thesis to handle the complex pre-processing, feature extraction, and visualization of the dataset will set the basis for future studies on hippocampal dynamics by the group of computational neuroscience in the University of Tartu

DSpace at Tartu University Library

Machine learning and data-parallel processing for viral metagenomics

Author: Bzhalava Zurab
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 03/04/2020
Field of study

More than 2 million cancer cases around the world each year are caused by viruses. In addition, there are epidemiological indications that other cancer-associated viruses may also exist. However, the identification of highly divergent and yet unknown viruses in human biospecimens is one of the biggest challenges in bio- informatics. Modern-day Next Generation Sequencing (NGS) technologies can be used to directly sequence biospecimens from clinical cohorts with unprecedented speed and depth. These technologies are able to generate billions of bases with rapidly decreasing cost but current bioinformatics tools are inefficient to effectively process these massive datasets. Thus, the objective of this thesis was to facilitate both the detection of highly divergent viruses among generated sequences as well as large-scale analysis of human metagenomic datasets. To re-analyze human sample-derived sequences that were classified as being of “unknown” origin by conventional alignment-based methods, we used a meth- odology based on profile Hidden Markov Models (HMM) which can capture evolutionary changes by using multiple sequence alignments. We thus identified 510 sequences that were classified as distantly related to viruses. Many of these sequences were homologs to large viruses such as Herpesviridae and Mimiviridae but some of them were also related to small circular viruses such as Circoviridae. We found that bioinformatics analysis using viral profile HMM is capable of extending the classification of previously unknown sequences and consequently the detection of viruses in biospecimens from humans. Different organisms use synonymous codons differently to encode the same amino acids. To investigate whether codon usage bias could predict the presence of virus in metagenomic sequencing data originating from human samples, we trained Random Forest and Artificial Neural Networks based on Relative Synonymous Codon Usage (RSCU) frequency. Our analysis showed that machine learning tech- niques based on RSCU could identify putative viral sequences with area under the ROC curve of 0.79 and provide important information for taxonomic classification. For identification of viral genomes among raw metagenomic sequences, we devel- oped the tool ViraMiner, a deep learning-based method which uses Convolutional Neural Networks with two convolutional branches. Using 300 base-pair length sequences, ViraMiner achieved 0.923 area under the ROC curve which is con- siderably improved performance in comparison with previous machine learning methods for virus sequence classification. The proposed architecture, to the best of our knowledge, is the first deep learning tool which can detect viral genomes on raw metagenomic sequences originating from a variety of human samples. To enable large-scale analysis of massive metagenomic sequencing data we used Apache Hadoop and Apache Spark to develop ViraPipe, a scalable parallel bio- informatics pipeline for viral metagenomics. Comparing ViraPipe (executed on 23 nodes) with the sequential pipeline (executed on a single node) was 11 times faster in the metagenome analysis. The new distributed workflow contains several standard bioinformatics tools and can scale to terabytes of data by accessing more computer power from the nodes. To analyze terabytes of RNA-seq data originating from head and neck squamous cell carcinoma samples, we used our parallel bioinformatics pipeline ViraPipe and the most recent version of the HPV sequence database. We detected transcription of HPV viral oncogenes in 92/500 cancers. HPV 16 was the most important HPV type, followed by HPV 33 as the second most common infection. If these cancers are indeed caused by HPV, we estimated that vaccination might prevent about 36 000 head and neck cancer cases in the United States every year. In conclusion, the work in this thesis improves the prospects for biomedical researchers to classify the sequence contents of ultra-deep datasets, conduct large- scale analysis of metagenome studies, and detect presence of viral genomes in human biospecimens. Hopefully, this work will contribute to our understanding of biodiversity of viruses in humans which in turn can help exploring infectious causes of human disease

Publications from Karolinska Institutet

ViraMiner: Deep learning on raw DNA sequences for identifying viral genomes in human samples.

Author: Ardi Tampuu
Joakim Dillner
Raul Vicente
Zurab Bzhalava
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2019
Field of study

Despite its clinical importance, detection of highly divergent or yet unknown viruses is a major challenge. When human samples are sequenced, conventional alignments classify many assembled contigs as "unknown" since many of the sequences are not similar to known genomes. In this work, we developed ViraMiner, a deep learning-based method to identify viruses in various human biospecimens. ViraMiner contains two branches of Convolutional Neural Networks designed to detect both patterns and pattern-frequencies on raw metagenomics contigs. The training dataset included sequences obtained from 19 metagenomic experiments which were analyzed and labeled by BLAST. The model achieves significantly improved accuracy compared to other machine learning methods for viral genome classification. Using 300 bp contigs ViraMiner achieves 0.923 area under the ROC curve. To our knowledge, this is the first machine learning methodology that can detect the presence of viral sequences among raw metagenomic contigs from diverse human samples. We suggest that the proposed model captures different types of information of genome composition, and can be used as a recommendation system to further investigate sequences labeled as "unknown" by conventional alignment methods. Exploring these highly-divergent viruses, in turn, can enhance our knowledge of infectious causes of diseases

Directory of Open Access Journals

Extension of the viral ecology in humans using viral profile hidden Markov models

Author: Emilie Hultin (8702)
Joakim Dillner (305089)
Zurab Bzhalava (3812458)
Publication venue
Publication date: 19/01/2018
Field of study

<div>When human samples are sequenced, many assembled contigs are “unknown”, as conventional alignments find no similarity to known sequences. Hidden Markov models (HMM) exploit the positions of specific nucleotides in protein-encoding codons in various microbes. The algorithm HMMER3 implements HMM using a reference set of sequences encoding viral proteins, “vFam”. We used HMMER3 analysis of “unknown” human sample-derived sequences and identified 510 contigs distantly related to viruses (Anelloviridae (n = 1), Baculoviridae (n = 34), Circoviridae (n = 35), Caulimoviridae (n = 3), Closteroviridae (n = 5), Geminiviridae (n = 21), Herpesviridae (n = 10), Iridoviridae (n = 12), Marseillevirus (n = 26), Mimiviridae (n = 80), Phycodnaviridae (n = 165), Poxviridae (n = 23), Retroviridae (n = 6) and 89 contigs related to described viruses not yet assigned to any taxonomic family). In summary, we find that analysis using the HMMER3 algorithm and the “vFam” database greatly extended the detection of viruses in biospecimens from humans.</div

Directory of Open Access Journals

FigShare

Number of contigs, classified as virus-related by HMM, stratified by related virus family and types of samples.

Author: Emilie Hultin (8702)
Joakim Dillner (305089)
Zurab Bzhalava (3812458)
Publication venue
Publication date
Field of study

FFPE: Formalin-fixed paraffin-embedded tissue specimens.</p

FigShare

Maximum likelihood phylogenetic tree (PhyML v3.0 www.atgc-montpellier.fr/phyml/) based on the RCR Rep proteins from genbank and 21 previously not described Rep proteins related to Circoviridae, that were found in the present study (shown in black color with the prefix SE).

Author: Emilie Hultin (8702)
Joakim Dillner (305089)
Zurab Bzhalava (3812458)
Publication venue
Publication date
Field of study

Maximum likelihood phylogenetic tree (PhyML v3.0 <a href="http://www.atgc-montpellier.fr/phyml/" target="_blank">www.atgc-montpellier.fr/phyml/</a>) based on the RCR Rep proteins from genbank and 21 previously not described Rep proteins related to Circoviridae, that were found in the present study (shown in black color with the prefix SE).</p

FigShare

Viruses in case series of tumors: Consistent presence in different cancers in the same subject.

Author: Davit Bzhalava
Emilie Hultin
Joakim Dillner
Laila Sara Arroyo Mühr
Maria Hortlund
Sara Nordqvist Kleppe
Zurab Bzhalava
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2017
Field of study

Studies investigating presence of viruses in cancer often analyze case series of cancers, resulting in detection of many viruses that are not etiologically linked to the tumors where they are found. The incidence of virus-associated cancers is greatly increased in immunocompromised individuals. Non-melanoma skin cancer (NMSC) is also greatly increased and a variety of viruses have been detected in NMSC. As immunosuppressed patients often develop multiple independent NMSCs, we reasoned that viruses consistently present in independent tumors might be more likely to be involved in tumorigenesis. We sequenced 8 different NMSCs from 1 patient in comparison to 8 different NMSCs from 8 different patients. Among the latter, 12 different virus sequences were detected, but none in more than 1 tumor each. In contrast, the patient with multiple NMSCs had human papillomavirus type 15 and type 38 present in 6 out of 8 NMSCs

Directory of Open Access Journals

PubMed Central

Machine Learning for detection of viral sequences in human metagenomic datasets

Author: Ardi Tampuu
Joakim Dillner
Piotr Bała
Raul Vicente
Zurab Bzhalava
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/09/2018
Field of study

Abstract Background Detection of highly divergent or yet unknown viruses from metagenomics sequencing datasets is a major bioinformatics challenge. When human samples are sequenced, a large proportion of assembled contigs are classified as “unknown”, as conventional methods find no similarity to known sequences. We wished to explore whether machine learning algorithms using Relative Synonymous Codon Usage frequency (RSCU) could improve the detection of viral sequences in metagenomic sequencing data. Results We trained Random Forest and Artificial Neural Network using metagenomic sequences taxonomically classified into virus and non-virus classes. The algorithms achieved accuracies well beyond chance level, with area under ROC curve 0.79. Two codons (TCG and CGC) were found to have a particularly strong discriminative capacity. Conclusion RSCU-based machine learning techniques applied to metagenomic sequencing data can help identify a large number of putative viral sequences and provide an addition to conventional methods for taxonomic classification

Directory of Open Access Journals

Number of contigs classified into different taxonomy groups by blastn and blastx.

Author: Emilie Hultin (8702)
Joakim Dillner (305089)
Zurab Bzhalava (3812458)
Publication venue
Publication date
Field of study

Number of contigs classified into different taxonomy groups by blastn and blastx.</p

FigShare

Classification of viral reads.

Author: Davit Bzhalava (422179)
Emilie Hultin (8702)
Joakim Dillner (305089)
Laila Sara Arroyo Mühr (3812461)
Maria Hortlund (520128)
Sara Nordqvist Kleppe (3812455)
Zurab Bzhalava (3812458)
Publication venue
Publication date
Field of study

Classification of viral reads.</p

FigShare