Search CORE

9,407 research outputs found

FEATURE GENERATION AND ANALYSIS APPLIED TO SEQUENCE CLASSIFICATION FOR SPLICE-SITE PREDICTION

Author: Islamaj Rezarta
Publication venue
Publication date: 27/11/2007
Field of study

Sequence classification is an important problem in many real-world applications. Sequence data often contain no explicit "signals," or features, to enable the construction of classification algorithms. Extracting and interpreting the most useful features is challenging, and hand construction of good features is the basis of many classification algorithms. In this thesis, I address this problem by developing a feature-generation algorithm (FGA). FGA is a scalable method for automatic feature generation for sequences; it identifies sequence components and uses domain knowledge, systematically constructs features, explores the space of possible features, and identifies the most useful ones. In the domain of biological sequences, splice-sites are locations in DNA sequences that signal the boundaries between genetic information and intervening non-coding regions. Only when splice-sites are identified with nucleotide precision can the genetic information be translated to produce functional proteins. In this thesis, I address this fundamental process by developing a highly accurate splice-site prediction model that employs our sequence feature-generation framework. The FGA model shows statistically significant improvements over state-of-the-art splice-site prediction methods. So that biologists can understand and interpret the features FGA constructs, I developed SplicePort, a web-based tool for splice-site prediction and analysis. With SplicePort the user can explore the relevant features for splicing, and can obtain splice-site predictions for the sequences based on these features. For an experimental biologist trying to identify the critical sequence elements of splicing, SplicePort offers flexibility and a rich motif exploration functionality, which may help to significantly reduce the amount of experimentation needed. In this thesis, I present examples of the observed feature groups and describe efforts to detect biological signals that may be important for the splicing process. Naturally, FGA can be generalized to other biologically inspired classification problems, such as tissue-specific regulatory elements, polyadenylation sites, promoters, as well as other sequence classification problems, provided we have sufficient knowledge of the new domain

Digital Repository at the University of Maryland

Recommended from our members

The Expanding Landscape of Alternative Splicing Variation in Human Populations.

Author: Lin Lan
Pan Zhicheng
Park Eddie
Xing Yi
Zhang Zijun
Publication venue: eScholarship, University of California
Publication date: 01/01/2018
Field of study

Alternative splicing is a tightly regulated biological process by which the number of gene products for any given gene can be greatly expanded. Genomic variants in splicing regulatory sequences can disrupt splicing and cause disease. Recent developments in sequencing technologies and computational biology have allowed researchers to investigate alternative splicing at an unprecedented scale and resolution. Population-scale transcriptome studies have revealed many naturally occurring genetic variants that modulate alternative splicing and consequently influence phenotypic variability and disease susceptibility in human populations. Innovations in experimental and computational tools such as massively parallel reporter assays and deep learning have enabled the rapid screening of genomic variants for their causal impacts on splicing. In this review, we describe technological advances that have greatly increased the speed and scale at which discoveries are made about the genetic variation of alternative splicing. We summarize major findings from population transcriptomic studies of alternative splicing and discuss the implications of these findings for human genetics and medicine

eScholarship - University of California

Features generated for computational splice-site prediction correspond to functional elements

Author: A Goren
AJ McCullough
AJ McCullough
AL Blum
C Gooding
C Mathe
D Koller
G Kol
G Yeo
GE Crooks
H Liu
J Královicová
K Chua
K Han
KK Nelson
L Cartegni
L Mariño-Ramírez
Lise Getoor
LP Lim
LR Coulter
M Pertea
M Pertea
MB Stadler
ML Hastings
R Guigo
R Islamaj
R Islamaj Dogan
R Kohavi
R Singh
Rezarta Islamaj Dogan
S Degroeve
S Degroeve
Stephen M Mount
T Zhang
W John Wilbur
WG Fairbrother
XH Zhang
XH Zhang
XH Zhang
Y Yang
Z Wang
ZM Zheng
Publication venue: BioMed Central
Publication date: 01/10/2007
Field of study

Abstract Background Accurate selection of splice sites during the splicing of precursors to messenger RNA requires both relatively well-characterized signals at the splice sites and auxiliary signals in the adjacent exons and introns. We previously described a feature generation algorithm (FGA) that is capable of achieving high classification accuracy on human 3' splice sites. In this paper, we extend the splice-site prediction to 5' splice sites and explore the generated features for biologically meaningful splicing signals. Results We present examples from the observed features that correspond to known signals, both core signals (including the branch site and pyrimidine tract) and auxiliary signals (including GGG triplets and exon splicing enhancers). We present evidence that features identified by FGA include splicing signals not found by other methods. Conclusion Our generated features capture known biological signals in the expected sequence interval flanking splice sites. The method can be easily applied to other species and to similar classification problems, such as tissue-specific regulatory elements, polyadenylation sites, promoters, etc.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Digital Repository at the University of Maryland

Recommended from our members

VarSight: prioritizing clinically reported variants with binary classification algorithms.

Author: Anderson Julie A
Birch Camille L
Brown Donna M
Gajapathy Manavalan
Harris Jeremy M
Holt James M
Kelly Jacob M
Moss Alexander C
Shaterferdosian Fariba
Sosonkina Nadiya
Undiagnosed Diseases Network
Uno-Antonison Angelina E
Weborg Arthur
Wilk Brandon
Wilk Melissa A
Worthey Elizabeth A
Publication venue: eScholarship, University of California
Publication date: 01/10/2019
Field of study

BackgroundWhen applying genomic medicine to a rare disease patient, the primary goal is to identify one or more genomic variants that may explain the patient's phenotypes. Typically, this is done through annotation, filtering, and then prioritization of variants for manual curation. However, prioritization of variants in rare disease patients remains a challenging task due to the high degree of variability in phenotype presentation and molecular source of disease. Thus, methods that can identify and/or prioritize variants to be clinically reported in the presence of such variability are of critical importance.MethodsWe tested the application of classification algorithms that ingest variant annotations along with phenotype information for predicting whether a variant will ultimately be clinically reported and returned to a patient. To test the classifiers, we performed a retrospective study on variants that were clinically reported to 237 patients in the Undiagnosed Diseases Network.ResultsWe treated the classifiers as variant prioritization systems and compared them to four variant prioritization algorithms and two single-measure controls. We showed that the trained classifiers outperformed all other tested methods with the best classifiers ranking 72% of all reported variants and 94% of reported pathogenic variants in the top 20.ConclusionsWe demonstrated how freely available binary classification algorithms can be used to prioritize variants even in the presence of real-world variability. Furthermore, these classifiers outperformed all other tested methods, suggesting that they may be well suited for working with real rare disease patient datasets

eScholarship - University of California

SplicePort—An interactive splice-site analysis tool

Author: Dogan Rezarta Islamaj
Getoor Lise
Mount Stephen M.
Wilbur W. John
Publication venue: Oxford University Press
Publication date: 01/01/2007
Field of study

SplicePort is a web-based tool for splice-site analysis that allows the user to make splice-site predictions for submitted sequences. In addition, the user can also browse the rich catalog of features that underlies these predictions, and which we have found capable of providing high classification accuracy on human splice sites. Feature selection is optimized for human splice sites, but the selected features are likely to be predictive for other mammals as well. With our interactive feature browsing and visualization tool, the user can view and explore subsets of features used in splice-site prediction (either the features that account for the classification of a specific input sequence or the complete collection of features). Selected feature sets can be searched, ranked or displayed easily. The user can group features into clusters and frequency plot WebLogos can be generated for each cluster. The user can browse the identified clusters and their contributing elements, looking for new interesting signals, or can validate previously observed signals. The SplicePort web server can be accessed at http://www.cs.umd.edu/projects/SplicePort and http://www.spliceport.org

CiteSeerX

PubMed Central

NOVEL APPLICATIONS OF MACHINE LEARNING IN BIOINFORMATICS

Author: Zhang Yi
Publication venue: UKnowledge
Publication date: 01/01/2019
Field of study

Technological advances in next-generation sequencing and biomedical imaging have led to a rapid increase in biomedical data dimension and acquisition rate, which is challenging the conventional data analysis strategies. Modern machine learning techniques promise to leverage large data sets for finding hidden patterns within them, and for making accurate predictions. This dissertation aims to design novel machine learning-based models to transform biomedical big data into valuable biological insights. The research presented in this dissertation focuses on three bioinformatics domains: splice junction classification, gene regulatory network reconstruction, and lesion detection in mammograms. A critical step in defining gene structures and mRNA transcript variants is to accurately identify splice junctions. In the first work, we built the first deep learning-based splice junction classifier, DeepSplice. It outperforms the state-of-the-art classification tools in terms of both classification accuracy and computational efficiency. To uncover transcription factors governing metabolic reprogramming in non-small-cell lung cancer patients, we developed TFmeta, a machine learning approach to reconstruct relationships between transcription factors and their target genes in the second work. Our approach achieves the best performance on benchmark data sets. In the third work, we designed deep learning-based architectures to perform lesion detection in both 2D and 3D whole mammogram images

University of Kentucky

Machine learning and mapping algorithms applied to proteomics problems

Author: Sanders William Shane
Publication venue: Scholars Junction
Publication date: 30/04/2011
Field of study

Proteins provide evidence that a given gene is expressed, and machine learning algorithms can be applied to various proteomics problems in order to gain information about the underlying biology. This dissertation applies machine learning algorithms to proteomics data in order to predict whether or not a given peptide is observable by mass spectrometry, whether a given peptide can serve as a cell penetrating peptide, and then utilizes the peptides observed through mass spectrometry to aid in the structural annotation of the chicken genome. Peptides observed by mass spectrometry are used to identify proteins, and being able to accurately predict which peptides will be seen can allow researchers to analyze to what extent a given protein is observable. Cell penetrating peptides can possibly be utilized to allow targeted small molecule delivery across cellular membranes and possibly serve a role as drug delivery peptides. Peptides and proteins identified through mass spectrometry can help refine computational gene models and improve structural genome annotations

Scholars Junction - Mississippi State University Institutional Repository

Using machine learning to predict pathogenicity of genomic variants throughout the human genome

Author: Rentzsch Philipp
Publication venue: Humboldt-Universität zu Berlin
Publication date: 14/04/2023
Field of study

Geschätzt mehr als 6.000 Erkrankungen werden durch Veränderungen im Genom verursacht. Ursachen gibt es viele: Eine genomische Variante kann die Translation eines Proteins stoppen, die Genregulation stören oder das Spleißen der mRNA in eine andere Isoform begünstigen. All diese Prozesse müssen überprüft werden, um die zum beschriebenen Phänotyp passende Variante zu ermitteln. Eine Automatisierung dieses Prozesses sind Varianteneffektmodelle. Mittels maschinellem Lernen und Annotationen aus verschiedenen Quellen bewerten diese Modelle genomische Varianten hinsichtlich ihrer Pathogenität. Die Entwicklung eines Varianteneffektmodells erfordert eine Reihe von Schritten: Annotation der Trainingsdaten, Auswahl von Features, Training verschiedener Modelle und Selektion eines Modells. Hier präsentiere ich ein allgemeines Workflow dieses Prozesses. Dieses ermöglicht es den Prozess zu konfigurieren, Modellmerkmale zu bearbeiten, und verschiedene Annotationen zu testen. Der Workflow umfasst außerdem die Optimierung von Hyperparametern, Validierung und letztlich die Anwendung des Modells durch genomweites Berechnen von Varianten-Scores. Der Workflow wird in der Entwicklung von Combined Annotation Dependent Depletion (CADD), einem Varianteneffektmodell zur genomweiten Bewertung von SNVs und InDels, verwendet. Durch Etablierung des ersten Varianteneffektmodells für das humane Referenzgenome GRCh38 demonstriere ich die gewonnenen Möglichkeiten Annotationen aufzugreifen und neue Modelle zu trainieren. Außerdem zeige ich, wie Deep-Learning-Scores als Feature in einem CADD-Modell die Vorhersage von RNA-Spleißing verbessern. Außerdem werden Varianteneffektmodelle aufgrund eines neuen, auf Allelhäufigkeit basierten, Trainingsdatensatz entwickelt. Diese Ergebnisse zeigen, dass der entwickelte Workflow eine skalierbare und flexible Möglichkeit ist, um Varianteneffektmodelle zu entwickeln. Alle entstandenen Scores sind unter cadd.gs.washington.edu und cadd.bihealth.org frei verfügbar.More than 6,000 diseases are estimated to be caused by genomic variants. This can happen in many possible ways: a variant may stop the translation of a protein, interfere with gene regulation, or alter splicing of the transcribed mRNA into an unwanted isoform. It is necessary to investigate all of these processes in order to evaluate which variant may be causal for the deleterious phenotype. A great help in this regard are variant effect scores. Implemented as machine learning classifiers, they integrate annotations from different resources to rank genomic variants in terms of pathogenicity. Developing a variant effect score requires multiple steps: annotation of the training data, feature selection, model training, benchmarking, and finally deployment for the model's application. Here, I present a generalized workflow of this process. It makes it simple to configure how information is converted into model features, enabling the rapid exploration of different annotations. The workflow further implements hyperparameter optimization, model validation and ultimately deployment of a selected model via genome-wide scoring of genomic variants. The workflow is applied to train Combined Annotation Dependent Depletion (CADD), a variant effect model that is scoring SNVs and InDels genome-wide. I show that the workflow can be quickly adapted to novel annotations by porting CADD to the genome reference GRCh38. Further, I demonstrate the integration of deep-neural network scores as features into a new CADD model, improving the annotation of RNA splicing events. Finally, I apply the workflow to train multiple variant effect models from training data that is based on variants selected by allele frequency. In conclusion, the developed workflow presents a flexible and scalable method to train variant effect scores. All software and developed scores are freely available from cadd.gs.washington.edu and cadd.bihealth.org

Dokumenten-Publikationsserver der Humboldt-Universität zu Berlin

On the Investigation of Biological Phenomena through Computational Intelligence

Author: Dr. Bipin kumar Tripathi
Jyotsana Pandey
Publication venue: Global Journals Inc. (US)
Publication date: 15/01/2014
Field of study

This paper is largely devoted for building a novel approach which is able to explain biological phenomena like splicing promoter gene identification disease and disorder identification and to acquire and exploit biological data This paper also presents an overview on the artificial neural network based computational intelligence technique to infer and analyze biological information from wide spectrum of complex problems Bioinformatics and computational intelligence are new research area which integrates many core subjects such as chemistry biology medical science mathematics computer and information science Since most of the problems in bioinformatics are inherently hard ill defined and possesses overlapping boundaries Neural networks have proved to be effective in solving those problems where conventional com-putation tools failed to provide solution Our experiments demonstrate the endeavor of biological phenomena as an effec-tive description for many intelligent applications Having a computational tool to predict genes and other meaningful in-formation is therefore of great value and can save a lot of expensive and time consuming experiments for biologists This paper will focus on issues related to design methodology comprising neural network to analyze biological information and investigate them for powerful application

Global Journal of Computer Science and Technology (GJCST)

Automatic intron detection in metagenomes using neural networks.

Author: Martin Indra
Publication venue: Czech Technical University in Prague. Computing and Information Centre.
Publication date: 21/01/2021
Field of study

Tato práce se zabývá detekcí intronů v metagenomech hub pomocí hlubokých neuronových sítí. Přesné biologické mechanizmy rozpoznávání a vyřezávání intronů nejsou zatím plně známy a jejich strojová detekce není považovaná za vyřešený problém. Rozpoznávání a vyřezávání intronů z DNA sekvencí je důležité pro identifikaci genů v metagenomech a hledání jejich homologií mezi známými DNA sekvencemi,které jsou dostupné ve veřejných databázích. Rozpoznání genů a nalezení jejich případných homologů umožňuje identifikaci jak již známých tak i nových druhů a jejich taxonomické zařazení. V rámci práce vznikly dva modely neuronových sítí, které detekují začátky a konce intronů, takzvaná donorová a akceptorová místa sestřihu. Detekovaná místa sestřihu jsou následně zkombinována do kandidátních intronů. Překrývající se kandidátní introny jsou poté odstraněny pomocí jednoduchého skórovacího algoritmu. Práce navazuje na existující řešení, které využívá metody podpůrných vektorů (SVM). Výsledné neuronové sítě dosahují lepších výsledků než SVM a to při více než desetinásobně nižším výpočetním čase na zpracování stejně obsáhlého genomu.This work is concerned with the detection of introns in metagenomes with deep neural networks. Exact biological mechanisms of intron recognition and splicing are not fully known yet and their automated detection has remained unresolved. Detection and removal of introns from DNA sequences is important for the identification of genes in metagenomes and for searching for homologs among the known DNA sequences available in public databases. Gene prediction and the discovery of their homologs allows the identification of known and new species and their taxonomic classification. Two neural network models were developed as part of this thesis. The models' aim is the detection of intron starts and ends with the so-called donor and acceptor splice sites. The splice sites are later combined into candidate introns which are further filtered by a simple score-based overlap resolving algorithm. The work relates to an existing solution based on support vector machines (SVM). The resulting neural networks achieve better results than SVM and require more than order of magnitude less computational resources in order to process equally large genome

Digital Library of the Czech Technical University in Prague