14 research outputs found

    Exploring the animal GPS system: a machine learning approach to study the hippocampal function

    Get PDF
    2014. aasta Nobeli preemia fĂŒsioloogias said Dr. John M. O’Keefe, Dr. May-Britt Moser ja Dr. Edvard I teatud kindlate rakkude avastamise eest ajus, mis vastutavad ruumi- ja suunataju eest. Need avastused vĂ”imaldavad arvata, et aju loob sisemise kaardi ĂŒmbritsevast keskkonnast. See aitab meil Ă€ra tunda tuttavaid kohti ja ruumis hĂ€sti orienteeruda. Antud magistritöös kasutasime me roti “GPS” sĂŒsteemi tundma Ă”ppimiseks arvutuslikku lĂ€henemist. Konkreetsemalt vĂ”rdlesime, kui hĂ€sti suudavad erinevad masinĂ”ppe algoritmid ette ennustada roti asukohta, saades sisendiks ainult tema hipokampuses toimuva neuronaalse aktiivsuse. VĂ”rreldud meetodite seas olid juhumets (random forest), tugivektorklassifitseerijad (support vector machine, SVM), lĂ€hima naabri meetod (nearest neighbor) ja mĂ”ningad hajusa lineaarse regressiooni algoritmid. Neuronitest mÔÔdetud elektrofĂŒsioloogilised andmed olid pĂ€rit Buxsaki laborist New Yorgis. Keskendusime roti hipokampuse neuraalsele aktiivsusele - aju osale, kus on senistes uurimustöödes enamik koharakke tuvastatud. Esimese sammuna jagasime ala, kus rott eksperimendi ajal viibis, neljaks vĂ€iksemaks tsooniks. SeejĂ€rel ĂŒritasime ennustada, missuguses alas katsealune loom mingil suvalisel ajahetkel asus. Leidsime, et juhumets andis parima ennustustĂ€psuse, milleks oli 57.8% ja mis on oluliselt suurem juhusliku valiku tĂ”enĂ€osusest. Sellegipoolest oli mĂ”nedes katseala regioonides tugivektorklassfitseerija mĂ”nikord parem kui juhumets. JĂ€rgmise sammuna tegime asukoha identifitseerimise veelgi raskemaks ja jagasime eksperimentaalala 16 vĂ€iksemaks tsooniks. Juhumets ja SVM saavutasid tugevalt statisiliselt olulised tulemused, vastavalt 38% ja 37% (juhusliku ennustuse tĂ€psus oleks olnud umbes 11%). MĂ”lema probleemĂŒlesande puhul kasutasime me ka lĂ€hima naabri algoritmi, aga selle tĂ€psus oli vĂ”rreldes eelmainitud meetoditega mĂ€rgatavalt vĂ€iksem. Kuna roti asukoht on pidev muutuja, siis me proovisime kĂ€sitleda seda ka pideva ennustuse probleemina. Suurem osa regressiooni algoritme, mida selles töös analĂŒĂŒsitakse (kantregressioon (ridge regression), lassoregressioon (lasso regression), elastne vĂ”rk (elastic net)), andsid juhuslikule ennustustĂ€psusele lĂ€hedasi tulemusi. Ainult juhumets andis pideva ennustuse probleemi puhul teistest meetoditest oluliselt parema tĂ€psuse. SeejĂ€rel analĂŒĂŒsisime me andmeid, mis olid salvestatud eksperimendist, kus rotid olid treenitud valima vasakut vĂ”i paremat suunda number 8 kujulises labĂŒrindis, olles samal ajal ise jooksurattal. Nende mÔÔtmistulemuste puhul teostasime me esimese sammuna andmetele mÔÔdete vĂ€hendamise (dimensionality reduction), et visualiseerida muutusi andmetes otsuse langetamise hetkel. Muuhulgas identifitseerisime ja tĂ”ime joonistel vĂ€lja ka episoodirakud - neuronid, mis on rohkem aktiivsed kindlal ajal antud ĂŒlesande jooksul. Episoodirakud vĂ”ivad kaasa aidata aja tajumisel ja episoodilise mĂ€lu loomisel. Samuti visualiseerisime neuronaalseid trajektoore otsuse langetamise ajal, et ette aimata, millise otsuse loom vastu vĂ”tab. KokkuvĂ”tteks andis roti asukoha ennustamisel algoritmidest tĂ€pseimaid tulemusi juhumets. See vĂ”ib muuhulgas nĂ€idata seda, et informatsioon roti asukoha kohta sisaldub mitte-lineaarses neuraalses aktiivsuses, mida lineaarregressiooni meetodid ei olnud vĂ”imelised tuvastama. Edasises uurimistöös plaanime me dekodeerida roti asukohta, kasutades meetodeid, mis on sarnasemad aju enda mehhanismidele. NeurovĂ”rgud (neural networks) on laialt levinud masinĂ”ppe meetod, mis sarnaselt juhumetsadega suudab Ă€ra tunda mitte-lineaarseid mustreid. Selles töös loodud andmetöötluskonveiereid (data processing pipeline), mis tegelevad ĂŒsnagi keerulise andmete eeltöötluse, tunnuste eraldamise ja andmestiku visualiseerimisega, panevad tulevikuks tugeva aluse hipokampuse dĂŒnaamika uurimisele TÜ arvutusliku neuroteaduse töögrupis.The 2014 Nobel prize in Physiology was awarded to Dr. John M. O’Keefe, Dr. May-Britt Moser and Dr. Edvard I for discovering particular cells in the brain that provide the sense of place and navigation. These discoveries suggest that the brain creates internal map-like representation of the environment which helps us recognize familiar places and navigate well. In this thesis, we used a computational approach to study the animal "GPS" system. In particular, we set to compare how well different machine learning algorithms are able to predict a rat's position just based on its hippocampal neural activity. Methods compared included Random Forest, Support Vector Machines, k-Nearest Neighbors, and several sparse linear regression algorithms. Data was obtained multi-neuron electrophysiological data recorded from the Buzsaki lab in New York, and we focus on the activity of rat hippocampus, the brain region where most the place cells have been identified. In a first step, we divided the experimental arena into 4 blocks and tried to classify in which one of those blocks the rat was at a given time. In this case, we found that Random Forest gave the best accuracy which was 57.8%, well beyond the chance level. However, in some particular regions of the arena, Support Vector Machine was sometimes better than Random Forest. For the next step, we made the classification problem even harder by dividing the arena into 16 blocks. Random Forest and SVM produced highly significant results with 38% and 37% accuracy respectively (random classifier accuracy would be approximately ~11%). We also used K-Nearest Neighbors for both classification problems but its accuracy was less in both cases than the above mentioned algorithms. Since the rat position is a continuous variable we also considered the continuous prediction problem. Most regression algorithms we analyzed (Ridge Regression, LASSO, Elastic Net) provided results near chance level while Random Forest outperformed the algorithms and gave the best results in this case. Furthermore, we analysed data recorded from an experiment where rats were trained to choose left or right direction in a 8-shaped maze while they were running in a wheel. In this case we perform a dimensionality reduction of the neuronal data to visualize its dynamics during the decision time. We also identified and provided plots of episodic cells (neurons who are more active at particular times in the task) which might contribute to the sense of time and create episodic memory. Also, we visualized neuronal trajectories while animal makes decisions in order to predict its future decision. In conclusion, from the algorithms we analysed Random Forest gave the best accuracy while predicting a rat's location. This might also indicate that the information about rat location is contained in non-linear patterns of neuronal activity, which linear regression methods were unable to extract. In future research we plan to decode a rat position using a method more similar to the brain own mechanisms such as neural networks, which as Random Forest can detect non-linear patterns. More generally, the pipelines developed during this thesis to handle the complex pre-processing, feature extraction, and visualization of the dataset will set the basis for future studies on hippocampal dynamics by the group of computational neuroscience in the University of Tartu

    Machine learning and data-parallel processing for viral metagenomics

    Get PDF
    More than 2 million cancer cases around the world each year are caused by viruses. In addition, there are epidemiological indications that other cancer-associated viruses may also exist. However, the identification of highly divergent and yet unknown viruses in human biospecimens is one of the biggest challenges in bio- informatics. Modern-day Next Generation Sequencing (NGS) technologies can be used to directly sequence biospecimens from clinical cohorts with unprecedented speed and depth. These technologies are able to generate billions of bases with rapidly decreasing cost but current bioinformatics tools are inefficient to effectively process these massive datasets. Thus, the objective of this thesis was to facilitate both the detection of highly divergent viruses among generated sequences as well as large-scale analysis of human metagenomic datasets. To re-analyze human sample-derived sequences that were classified as being of “unknown” origin by conventional alignment-based methods, we used a meth- odology based on profile Hidden Markov Models (HMM) which can capture evolutionary changes by using multiple sequence alignments. We thus identified 510 sequences that were classified as distantly related to viruses. Many of these sequences were homologs to large viruses such as Herpesviridae and Mimiviridae but some of them were also related to small circular viruses such as Circoviridae. We found that bioinformatics analysis using viral profile HMM is capable of extending the classification of previously unknown sequences and consequently the detection of viruses in biospecimens from humans. Different organisms use synonymous codons differently to encode the same amino acids. To investigate whether codon usage bias could predict the presence of virus in metagenomic sequencing data originating from human samples, we trained Random Forest and Artificial Neural Networks based on Relative Synonymous Codon Usage (RSCU) frequency. Our analysis showed that machine learning tech- niques based on RSCU could identify putative viral sequences with area under the ROC curve of 0.79 and provide important information for taxonomic classification. For identification of viral genomes among raw metagenomic sequences, we devel- oped the tool ViraMiner, a deep learning-based method which uses Convolutional Neural Networks with two convolutional branches. Using 300 base-pair length sequences, ViraMiner achieved 0.923 area under the ROC curve which is con- siderably improved performance in comparison with previous machine learning methods for virus sequence classification. The proposed architecture, to the best of our knowledge, is the first deep learning tool which can detect viral genomes on raw metagenomic sequences originating from a variety of human samples. To enable large-scale analysis of massive metagenomic sequencing data we used Apache Hadoop and Apache Spark to develop ViraPipe, a scalable parallel bio- informatics pipeline for viral metagenomics. Comparing ViraPipe (executed on 23 nodes) with the sequential pipeline (executed on a single node) was 11 times faster in the metagenome analysis. The new distributed workflow contains several standard bioinformatics tools and can scale to terabytes of data by accessing more computer power from the nodes. To analyze terabytes of RNA-seq data originating from head and neck squamous cell carcinoma samples, we used our parallel bioinformatics pipeline ViraPipe and the most recent version of the HPV sequence database. We detected transcription of HPV viral oncogenes in 92/500 cancers. HPV 16 was the most important HPV type, followed by HPV 33 as the second most common infection. If these cancers are indeed caused by HPV, we estimated that vaccination might prevent about 36 000 head and neck cancer cases in the United States every year. In conclusion, the work in this thesis improves the prospects for biomedical researchers to classify the sequence contents of ultra-deep datasets, conduct large- scale analysis of metagenome studies, and detect presence of viral genomes in human biospecimens. Hopefully, this work will contribute to our understanding of biodiversity of viruses in humans which in turn can help exploring infectious causes of human disease

    ViraMiner: Deep learning on raw DNA sequences for identifying viral genomes in human samples.

    No full text
    Despite its clinical importance, detection of highly divergent or yet unknown viruses is a major challenge. When human samples are sequenced, conventional alignments classify many assembled contigs as "unknown" since many of the sequences are not similar to known genomes. In this work, we developed ViraMiner, a deep learning-based method to identify viruses in various human biospecimens. ViraMiner contains two branches of Convolutional Neural Networks designed to detect both patterns and pattern-frequencies on raw metagenomics contigs. The training dataset included sequences obtained from 19 metagenomic experiments which were analyzed and labeled by BLAST. The model achieves significantly improved accuracy compared to other machine learning methods for viral genome classification. Using 300 bp contigs ViraMiner achieves 0.923 area under the ROC curve. To our knowledge, this is the first machine learning methodology that can detect the presence of viral sequences among raw metagenomic contigs from diverse human samples. We suggest that the proposed model captures different types of information of genome composition, and can be used as a recommendation system to further investigate sequences labeled as "unknown" by conventional alignment methods. Exploring these highly-divergent viruses, in turn, can enhance our knowledge of infectious causes of diseases

    Extension of the viral ecology in humans using viral profile hidden Markov models

    No full text
    <div><p>When human samples are sequenced, many assembled contigs are “unknown”, as conventional alignments find no similarity to known sequences. Hidden Markov models (HMM) exploit the positions of specific nucleotides in protein-encoding codons in various microbes. The algorithm HMMER3 implements HMM using a reference set of sequences encoding viral proteins, “vFam”. We used HMMER3 analysis of “unknown” human sample-derived sequences and identified 510 contigs distantly related to viruses (Anelloviridae (n = 1), Baculoviridae (n = 34), Circoviridae (n = 35), Caulimoviridae (n = 3), Closteroviridae (n = 5), Geminiviridae (n = 21), Herpesviridae (n = 10), Iridoviridae (n = 12), Marseillevirus (n = 26), Mimiviridae (n = 80), Phycodnaviridae (n = 165), Poxviridae (n = 23), Retroviridae (n = 6) and 89 contigs related to described viruses not yet assigned to any taxonomic family). In summary, we find that analysis using the HMMER3 algorithm and the “vFam” database greatly extended the detection of viruses in biospecimens from humans.</p></div

    Maximum likelihood phylogenetic tree (PhyML v3.0 www.atgc-montpellier.fr/phyml/) based on the RCR Rep proteins from genbank and 21 previously not described Rep proteins related to <i>Circoviridae</i>, that were found in the present study (shown in black color with the prefix SE).

    No full text
    <p>Maximum likelihood phylogenetic tree (PhyML v3.0 <a href="http://www.atgc-montpellier.fr/phyml/" target="_blank">www.atgc-montpellier.fr/phyml/</a>) based on the RCR Rep proteins from genbank and 21 previously not described Rep proteins related to <i>Circoviridae</i>, that were found in the present study (shown in black color with the prefix SE).</p

    Viruses in case series of tumors: Consistent presence in different cancers in the same subject.

    No full text
    Studies investigating presence of viruses in cancer often analyze case series of cancers, resulting in detection of many viruses that are not etiologically linked to the tumors where they are found. The incidence of virus-associated cancers is greatly increased in immunocompromised individuals. Non-melanoma skin cancer (NMSC) is also greatly increased and a variety of viruses have been detected in NMSC. As immunosuppressed patients often develop multiple independent NMSCs, we reasoned that viruses consistently present in independent tumors might be more likely to be involved in tumorigenesis. We sequenced 8 different NMSCs from 1 patient in comparison to 8 different NMSCs from 8 different patients. Among the latter, 12 different virus sequences were detected, but none in more than 1 tumor each. In contrast, the patient with multiple NMSCs had human papillomavirus type 15 and type 38 present in 6 out of 8 NMSCs

    Machine Learning for detection of viral sequences in human metagenomic datasets

    No full text
    Abstract Background Detection of highly divergent or yet unknown viruses from metagenomics sequencing datasets is a major bioinformatics challenge. When human samples are sequenced, a large proportion of assembled contigs are classified as “unknown”, as conventional methods find no similarity to known sequences. We wished to explore whether machine learning algorithms using Relative Synonymous Codon Usage frequency (RSCU) could improve the detection of viral sequences in metagenomic sequencing data. Results We trained Random Forest and Artificial Neural Network using metagenomic sequences taxonomically classified into virus and non-virus classes. The algorithms achieved accuracies well beyond chance level, with area under ROC curve 0.79. Two codons (TCG and CGC) were found to have a particularly strong discriminative capacity. Conclusion RSCU-based machine learning techniques applied to metagenomic sequencing data can help identify a large number of putative viral sequences and provide an addition to conventional methods for taxonomic classification

    Number of contigs classified into different taxonomy groups by blastn and blastx.

    No full text
    <p>Number of contigs classified into different taxonomy groups by blastn and blastx.</p
    corecore