70 research outputs found

    Domain adaptation algorithms for biological sequence classification

    Get PDF
    Doctor of PhilosophyDepartment of Computing and Information SciencesDoina CarageaThe large volume of data generated in the recent years has created opportunities for discoveries in various fields. In biology, next generation sequencing technologies determine faster and cheaper the exact order of nucleotides present within a DNA or RNA fragment. This large volume of data requires the use of automated tools to extract information and generate knowledge. Machine learning classification algorithms provide an automated means to annotate data but require some of these data to be manually labeled by human experts, a process that is costly and time consuming. An alternative to labeling data is to use existing labeled data from a related domain, the source domain, if any such data is available, to train a classifier for the domain of interest, the target domain. However, the classification accuracy usually decreases for the domain of interest as the distance between the source and target domains increases. Another alternative is to label some data and complement it with abundant unlabeled data from the same domain, and train a semi-supervised classifier, although the unlabeled data can mislead such classifier. In this work another alternative is considered, domain adaptation, in which the goal is to train an accurate classifier for a domain with limited labeled data and abundant unlabeled data, the target domain, by leveraging labeled data from a related domain, the source domain. Several domain adaptation classifiers are proposed, derived from a supervised discriminative classifier (logistic regression) or a supervised generative classifier (naïve Bayes), and some of the factors that influence their accuracy are studied: features, data used from the source domain, how to incorporate the unlabeled data, and how to combine all available data. The proposed approaches were evaluated on two biological problems -- protein localization and ab initio splice site prediction. The former is motivated by the fact that predicting where a protein is localized provides an indication for its function, whereas the latter is an essential step in gene prediction

    Inferring latent task structure for Multitask Learning by Multiple Kernel Learning

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The lack of sufficient training data is the limiting factor for many Machine Learning applications in Computational Biology. If data is available for several different but related problem domains, Multitask Learning algorithms can be used to learn a model based on all available information. In Bioinformatics, many problems can be cast into the Multitask Learning scenario by incorporating data from several organisms. However, combining information from several tasks requires careful consideration of the degree of similarity between tasks. Our proposed method simultaneously learns or refines the similarity between tasks along with the Multitask Learning classifier. This is done by formulating the Multitask Learning problem as Multiple Kernel Learning, using the recently published <it>q</it>-Norm MKL algorithm.</p> <p>Results</p> <p>We demonstrate the performance of our method on two problems from Computational Biology. First, we show that our method is able to improve performance on a splice site dataset with given hierarchical task structure by refining the task relationships. Second, we consider an MHC-I dataset, for which we assume no knowledge about the degree of task relatedness. Here, we are able to learn the task similarities<it> ab initio</it> along with the Multitask classifiers. In both cases, we outperform baseline methods that we compare against.</p> <p>Conclusions</p> <p>We present a novel approach to Multitask Learning that is capable of learning task similarity along with the classifiers. The framework is very general as it allows to incorporate prior knowledge about tasks relationships if available, but is also able to identify task similarities in absence of such prior information. Both variants show promising results in applications from Computational Biology.</p

    Automatic detection of exonic splicing enhancers (ESEs) using SVMs

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Exonic splicing enhancers (ESEs) activate nearby splice sites and promote the inclusion (vs. exclusion) of exons in which they reside, while being a binding site for SR proteins. To study the impact of ESEs on alternative splicing it would be useful to have a possibility to detect them in exons. Identifying SR protein-binding sites in human DNA sequences by machine learning techniques is a formidable task, since the exon sequences are also constrained by their functional role in coding for proteins.</p> <p>Results</p> <p>The choice of training examples needed for machine learning approaches is difficult since there are only few exact locations of human ESEs described in the literature which could be considered as positive examples. Additionally, it is unclear which sequences are suitable as negative examples. Therefore, we developed a motif-oriented data-extraction method that extracts exon sequences around experimentally or theoretically determined ESE patterns. Positive examples are restricted by heuristics based on known properties of ESEs, e.g. location in the vicinity of a splice site, whereas negative examples are taken in the same way from the middle of long exons. We show that a suitably chosen SVM using optimized sequence kernels (e.g., combined oligo kernel) can extract meaningful properties from these training examples. Once the classifier is trained, every potential ESE sequence can be passed to the SVM for verification. Using SVMs with the combined oligo kernel yields a high accuracy of about 90 percent and well interpretable parameters.</p> <p>Conclusion</p> <p>The motif-oriented data-extraction method seems to produce consistent training and test data leading to good classification rates and thus allows verification of potential ESE motifs. The best results were obtained using an SVM with the combined oligo kernel, while oligo kernels with oligomers of a certain length could be used to extract relevant features.</p

    A knowledge engineering approach to the recognition of genomic coding regions

    Get PDF
    ได้ทุนอุดหนุนการวิจัยจากมหาวิทยาลัยเทคโนโลยีสุรนารี ปีงบประมาณ พ.ศ.2556-255

    A novel approach to infer orthologs and produce gene annotations at scale

    Get PDF
    Aufgrund von Fortschritten im Bereich der DNA-Sequenzierung hat die Anzahl verfügbarer Genome in den letzten Jahrzehnten rapide zugenommen. Tausende bereits heute zur Verfügung stehende Genome ermöglichen detaillierte vergleichende Analysen, welche für die Beantwortung relevanter Fragestellungen essentiell sind. Dies betrifft die Assoziation von Genotyp und Phänotyp, die Erforschung der Besonderheiten komplexer Proteine und die Weiterentwicklung medizinischer Anwendungen. Um all diese Fragen zu beantworten ist es notwendig, proteinkodierende Gene in neu sequenzierten Genomen zu annotieren und ihre Homologieverhältnisse zu bestimmen. Die bestehenden Methoden der Genomanalyse sind jedoch nicht für Menge heutzutage anfallender Datenmengen ausgelegt. Daher ist die zentrale Herausforderung in der vergleichenden Genomik nicht die Anzahl der verfügbaren Genome, sondern die Entwicklung neuer Methoden zur Datenanalyse im Hochdurchsatz. Um diese Probleme zu adressieren, schlage ich ein neues Paradigma der Annotation von Genomen und der Inferenz von Homologieverhältnissen vor, welches auf dem Alignment gesamter Genome basiert. Während die derzeit angewendeten Methoden zur Gen-Annotation und Bestimmung der Homologie ausschließlich auf codierenden Sequenzen beruhen, könnten durch die Einbeziehung des umgebenden neutral evolvierenden genomischen Kontextes bessere und vollständigere Annotationen vorgenommen werden. Die Verwendung von Genom-Alignments ermöglicht eine beliebige Skalierung der vorgeschlagenen Methodik auf Tausende Genome. In dieser Arbeit stelle ich TOGA (Tool to infer Orthologs from Genome Alignments) vor, eine bioinformatische Methode, welche dieses Konzept implementiert und Homologie- Klassifizierung und Gen-Annotation in einer einzelnen Pipeline kombiniert. TOGA verwendet Machine-Learning, um Orthologe von Paralogen basierend auf dem Alignment von intronischer und intergener Regionen zu unterscheiden. Die Ergebnisse des Benchmarkings zeigen, dass TOGA die herkömmlichen Ansätze innerhalb der Placentalia übertrifft. TOGA klassifiziert Homologieverhältnisse mit hoher Präzision und identifiziert zuverlässig inaktivierte Gene als solchet. Frühere Versionen von TOGA fanden in mehreren Studien Anwendung und wurden in zwei Publikationen verwendet. Außerdem wurde TOGA erfolgreich zur Annotation von 500 Säugetiergeenomen verwendet, dies ist der bisher umfangreichste solche Datensatz. Diese Ergebnisse zeigen, dass TOGA das Potenzial hat, sich zu einer etablierten Methode zur Gen-Annotation zu entwickeln und die derzeit angewandten Techniken zu ergänzen

    Annotation of marine eukaryotic genomes

    Get PDF

    Identification of co-regulated candidate genes by promoter analysis.

    Get PDF
    EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Comparative genomics of Dothideomycete fungi

    Get PDF
    Fungi are a diverse group of eukaryotic micro-organisms particularly suited for comparative genomics analyses. Fungi are important to industry, fundamental science and many of them are notorious pathogens of crops, thereby endangering global food supply. Dozens of fungi have been sequenced in the last decade and with the advances of the next generation sequencing, thousands of new genome sequences will become available in coming years. In this thesis I have used bioinformatics tools to study different biological and evolutionary processes in various genomes with a focus on the genomes of the Dothideomycetefungi Cladosporium fulvum, Dothistroma septosporumand Zymoseptoria tritici. Chapter 1introduces the scientific disciplines of mycology and bioinformatics from a historical perspective. It exemplifies a typical whole-genome sequence analysis of a fungal genome, and focusses in particular on structural gene annotation and detection of transposable elements. In addition it shortly reviews the microRNA pathway as known in animal and plants in the context of the putative existence of similar yet subtle different small RNA pathways in other branches of the eukaryotic tree of life. Chapter 2addresses the novel sequenced genomes of the closely related Dothideomyceteplant pathogenic fungi Cladosporium fulvumand Dothistroma septosporum. Remarkably, it revealed occurrence of a surprisingly high similarity at the protein level combined with striking differences at the DNA level, gene repertoire and gene expression. Most noticeably, the genome of C. fulvumappears to be at least twice as large, which is solely attributable to a much larger content in repetitive sequences. Chapter 3describes a novel alignment-based fungal gene prediction method (ABFGP) that is particularly suitable for plastic genomes like those of fungi. It shows excellent performance benchmarked on a dataset of 7,000 unigene-supported gene models from ten different fungi. Applicability of the method was shown by revisiting the annotations of C. fulvumand D. septosporumand of various other fungal genomes from the first-generation sequencing era. Thousands of gene models were revised in each of the gene catalogues, indeed revealing a correlation to the quality of the genome assembly, and to sequencing strategies used in the sequencing centres, highlighting different types of errors in different annotation pipelines. Chapter 4focusses on the unexpected high number of gene models that were identified by ABFGP that align nicely to informant genes, but only upon toleration of frame shifts and in-frame stop-codons. These discordances could represent sequence errors (SEs) and/or disruptive mutations (DMs) that caused these truncated and erroneous gene models. We revisited the same fungal gene catalogues as in chapter 3, confirmed SEs by resequencing and successively removed those, yielding a high-confidence and large dataset of nearly 1,000 pseudogenes caused by DMs. This dataset of fungal pseudogenes, containing genes listed as bona fide genes in current gene catalogues, does not correspond to various observations previously done on fungal pseudogenes. Moreover, the degree of pseudogenization showing up to a ten-fold variation for the lowest versus the highest affected species, is generally higher in species that reproduce asexually compared to those that in addition reproduce sexually. Chapter 5describes explorative genomics and comparative genomics analyses revealing the presence of introner-like elements (ILEs) in various Dothideomycetefungi including Zymoseptoria triticiin which they had not identified yet, although its genome sequence is already publicly available for several years. ILEs combine hallmark intron properties with the apparent capability of multiplying themselves as repetitive sequence. ILEs strongly associate with events of intron gain, thereby delivering in silico proof of their mobility. Phylogenetic analyses at the intra- and inter-species level showed that most ILEs are related and likely share common ancestry. Chapter 6provides additional evidence that ILE multiplication strongly dominates over other types of intron duplication in fungi. The observed high rate of ILE multiplication followed by rapid sequence degeneration led us to hypothesize that multiplication of ILEs has been the major cause and mechanism of intron gain in fungi, and we speculate that this could be generalized to all eukaryotes. Chapter 7describes a new strategy for miRNA hairpin prediction using statistical distributions of observed biological variation of properties (descriptors) of known miRNA hairpins. We show that the method outperforms miRNA prediction by previous, conventional methods that usually apply threshold filtering. Using this method, several novel candidate miRNAs were assigned in the genomes of Caenorhabditis elegansand two human viruses. Although this chapter is not applied on fungi, the study does provide a flexible method to find evidence for existence of a putative miRNA-like pathway in fungi. Chapter 8provides a general discussion on the advent of bioinformatics in mycological research and its implications. It highlights the necessity of a prioriplanning and integration of functional analysis and bioinformatics in order to achieve scientific excellence, and describes possible scenarios for the near future of fungal (comparative) genomics research. Moreover, it discusses the intrinsic error rate in large-scale, automatically inferred datasets and the implications of using and comparing those.</p
    corecore