Search CORE

2,688 research outputs found

Machine learning models towards elucidating the plant intron retention code

Author: Sneham Swapnil
Publication venue: Colorado State University. Libraries
Publication date: 01/01/2017
Field of study

2017 Fall.Includes bibliographical references.Alternative Splicing is a process that allows a single gene to encode multiple proteins. Intron Retention (IR) is a type of alternative splicing which is mainly prevalent in plants, but has been shown to regulate gene expression in various organisms and is often involved in rare human diseases. Despite its important role, not much research has been done to understand IR. The motivation behind this research work is to better understand IR and how it is regulated by various biological factors. We designed a combination of 137 features, forming an "intron retention code", to reveal the factors that contribute to IR. Using random forest and support vector machine classifiers, we show the usefulness of these features for the task of predicting whether an intron is subject to IR or not. An analysis of the top-ranking features for this task reveals a high level of similarity of the most predictive features across the three plant species, demonstrating the conservation of the factors that determine IR. We also found a high level of similarity to the top features contributing to IR in mammals. The task of predicting the response to drought stress proved more difficult, with lower levels of accuracy and lower levels of similarity across species, suggesting that additional features need to be considered for predicting condition-specific IR

Mountain Scholar (Digital Collections of Colorado and Wyoming)

Alu Exonization Events Reveal Features Required for Precise Recognition of Exons by the Splicing Machinery

Author: A Corvelo
A Goren
A Grover
A Levy
AF Muro
AJ Mighell
B Clouet d'Orval
C-C Chang
D Libri
D Solnick
DL Black
E Buratti
E Buratti
E Dimitriadou
E Kim
Eddo Kim
FU Nasim
G Ast
G Dreyfuss
G Dror
G Lev-Maor
Gil Ast
IL Hofacker
IL Hofacker
IM Meyer
J Jurka
J Kralovicova
J Wang
JB Kruskal
JO Kriegs
KL Fox-Walsh
L Cartegni
L Katz
LP Eperon
M Blanchette
M Hiller
M Hiller
M Krull
M Roy
MB Shapiro
ML Hastings
N Gal-Mark
N Sela
Nir Kfir
NN Singh
Nurit Gal-Mark
O Ram
PJ Shepard
R Sorek
R Sorek
R Sorek
Ram Oren
RF Roscigno
Roderic Guigó
S Jacquenet
S Washietl
Schraga Schwartz
SH Nagaraj
SH Schwartz
SM Berget
T Sing
WG Fairbrother
X Roca
XH Zhang
XH Zhang
XH Zhang
Y Xing
Z Wang
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

Despite decades of research, the question of how the mRNA splicing machinery precisely identifies short exonic islands within the vast intronic oceans remains to a large extent obscure. In this study, we analyzed Alu exonization events, aiming to understand the requirements for correct selection of exons. Comparison of exonizing Alus to their non-exonizing counterparts is informative because Alus in these two groups have retained high sequence similarity but are perceived differently by the splicing machinery. We identified and characterized numerous features used by the splicing machinery to discriminate between Alu exons and their non-exonizing counterparts. Of these, the most novel is secondary structure: Alu exons in general and their 5′ splice sites (5′ss) in particular are characterized by decreased stability of local secondary structures with respect to their non-exonizing counterparts. We detected numerous further differences between Alu exons and their non-exonizing counterparts, among others in terms of exon–intron architecture and strength of splicing signals, enhancers, and silencers. Support vector machine analysis revealed that these features allow a high level of discrimination (AUC = 0.91) between exonizing and non-exonizing Alus. Moreover, the computationally derived probabilities of exonization significantly correlated with the biological inclusion level of the Alu exons, and the model could also be extended to general datasets of constitutive and alternative exons. This indicates that the features detected and explored in this study provide the basis not only for precise exon selection but also for the fine-tuned regulation thereof, manifested in cases of alternative splicing

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Automatic detection of exonic splicing enhancers (ESEs) using SVMs

Author: Gepperth Alexander
Hotz-Wagenblatt Agnes
Mersch Britta
Suhai Sándor
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Exonic splicing enhancers (ESEs) activate nearby splice sites and promote the inclusion (vs. exclusion) of exons in which they reside, while being a binding site for SR proteins. To study the impact of ESEs on alternative splicing it would be useful to have a possibility to detect them in exons. Identifying SR protein-binding sites in human DNA sequences by machine learning techniques is a formidable task, since the exon sequences are also constrained by their functional role in coding for proteins. Results The choice of training examples needed for machine learning approaches is difficult since there are only few exact locations of human ESEs described in the literature which could be considered as positive examples. Additionally, it is unclear which sequences are suitable as negative examples. Therefore, we developed a motif-oriented data-extraction method that extracts exon sequences around experimentally or theoretically determined ESE patterns. Positive examples are restricted by heuristics based on known properties of ESEs, e.g. location in the vicinity of a splice site, whereas negative examples are taken in the same way from the middle of long exons. We show that a suitably chosen SVM using optimized sequence kernels (e.g., combined oligo kernel) can extract meaningful properties from these training examples. Once the classifier is trained, every potential ESE sequence can be passed to the SVM for verification. Using SVMs with the combined oligo kernel yields a high accuracy of about 90 percent and well interpretable parameters. Conclusion The motif-oriented data-extraction method seems to produce consistent training and test data leading to good classification rates and thus allows verification of potential ESE motifs. The best results were obtained using an SVM with the combined oligo kernel, while oligo kernels with oligomers of a certain length could be used to extract relevant features.</p

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Feature selection for splice site prediction: A new method using EDA-based feature ranking

Author: Aeyels Dirk
Degroeve Sven
Rouzé Pierre
Saeys Yvan
Van de Peer Yves
Publication venue: BioMed Central
Publication date: 01/01/2004
Field of study

BACKGROUND: The identification of relevant biological features in large and complex datasets is an important step towards gaining insight in the processes underlying the data. Other advantages of feature selection include the ability of the classification system to attain good or even better solutions using a restricted subset of features, and a faster classification. Thus, robust methods for fast feature selection are of key importance in extracting knowledge from complex biological data. RESULTS: In this paper we present a novel method for feature subset selection applied to splice site prediction, based on estimation of distribution algorithms, a more general framework of genetic algorithms. From the estimated distribution of the algorithm, a feature ranking is derived. Afterwards this ranking is used to iteratively discard features. We apply this technique to the problem of splice site prediction, and show how it can be used to gain insight into the underlying biological process of splicing. CONCLUSION: We show that this technique proves to be more robust than the traditional use of estimation of distribution algorithms for feature selection: instead of returning a single best subset of features (as they normally do) this method provides a dynamical view of the feature selection process, like the traditional sequential wrapper methods. However, the method is faster than the traditional techniques, and scales better to datasets described by a large number of features

Springer - Publisher Connector

Directory of Open Access Journals

Ghent University Academic Bibliography

PubMed Central

Genome-Wide Association between Branch Point Properties and Alternative Splicing

Author: A Corvelo
A Deirdre
A Loytynoja
André Corvelo
B Modrek
B Patterson
B Rhead
B Ruskin
BR Graveley
C Burge
C Gooding
C Gooding
CF Bourgeois
Christopher W. J. Smith
CJ Coolidge
CW Smith
CW Smith
D Libri
DD Licatalosi
DL Black
DM Helfman
DM Kupfer
E Blanco
E Bon
Eduardo Eyras
F Clark
G Kol
G Yeo
GJ Mulligan
HX Liu
IL Hofacker
Irmtraud M. Meyer
J Southby
K Gao
M Goux-Pelletan
M Hallegger
M Plass
M Stanke
MA Garcia-Blanco
Martina Hallegger
MB Stadler
MC Wollerton
MC Wollerton
MR Green
MS Jurica
N Bellora
NA Faustino
R Castelo
R Reed
SH Schwartz
T Joachims
T Maniatis
TW Nilsen
WG Fairbrother
WJ Kent
X Xiao
XH Zhang
XH Zhang
Z Wang
Z Wang
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

The branch point (BP) is one of the three obligatory signals required for pre-mRNA splicing. In mammals, the degeneracy of the motif combined with the lack of a large set of experimentally verified BPs complicates the task of modeling it in silico, and therefore of predicting the location of natural BPs. Consequently, BPs have been disregarded in a considerable fraction of the genome-wide studies on the regulation of splicing in mammals. We present a new computational approach for mammalian BP prediction. Using sequence conservation and positional bias we obtained a set of motifs with good agreement with U2 snRNA binding stability. Using a Support Vector Machine algorithm, we created a model complemented with polypyrimidine tract features, which considerably improves the prediction accuracy over previously published methods. Applying our algorithm to human introns, we show that BP position is highly dependent on the presence of AG dinucleotides in the 3′ end of introns, with distance to the 3′ splice site and BP strength strongly correlating with alternative splicing. Furthermore, experimental BP mapping for five exons preceded by long AG-dinucleotide exclusion zones revealed that, for a given intron, more than one BP can be chosen throughout the course of splicing. Finally, the comparison between exons of different evolutionary ages and pseudo exons suggests a key role of the BP in the pathway of exon creation in human. Our computational and experimental analyses suggest that BP recognition is more flexible than previously assumed, and it appears highly dependent on the presence of downstream polypyrimidine tracts. The reported association between BP features and the splicing outcome suggests that this, so far disregarded but yet crucial, element buries information that can complement current acceptor site models

CiteSeerX

Crossref

Directory of Open Access Journals

PubMed Central

UCL Discovery

Oxford University Research Archive

UPF Digital Repository

Identify Alternative Splicing Events Based on Position-Specific Evolutionary Conservation

Author: Chen Liang
Zheng Sika
Publication venue: Public Library of Science
Publication date: 30/07/2008
Field of study

The evolution of eukaryotes is accompanied by the increased complexity of alternative splicing which greatly expands genome information. One of the greatest challenges in the post-genome era is a complete revelation of human transcriptome with consideration of alternative splicing. Here, we introduce a comparative genomics approach to systemically identify alternative splicing events based on the differential evolutionary conservation between exons and introns and the high-quality annotation of the ENCODE regions. Specifically, we focus on exons that are included in some transcripts but are completely spliced out for others and we call them conditional exons. First, we characterize distinguishing features among conditional exons, constitutive exons and introns. One of the most important features is the position-specific conservation score. There are dramatic differences in conservation scores between conditional exons and constitutive exons. More importantly, the differences are position-specific. For flanking intronic regions, the differences between conditional exons and constitutive exons are also position-specific. Using the Random Forests algorithm, we can classify conditional exons with high specificities (97% for the identification of conditional exons from intron regions and 95% for the classification of known exons) and fair sensitivities (64% and 32% respectively). We applied the method to the human genome and identified 39,640 introns that actually contain conditional exons and classified 8,813 conditional exons from the current RefSeq exon list. Among those, 31,673 introns containing conditional exons and 5,294 conditional exons classified from known exons cannot be inferred from RefSeq, UCSC or Ensembl annotations. Some of these de novo predictions were experimentally verified

Public Library of Science (PLOS)

Directory of Open Access Journals

PubMed Central

Use of genomic and transcriptomic approaches in the diagnosis of rare inherited disease linked to splicing mutations

Author: Rowlands Charles
Publication venue
Publication date: 01/08/2022
Field of study

The University of Manchester - Institutional Repository

Exploiting mid-range DNA patterns for sequence classification: binary abstraction Markov models

Author: Alexei Fedorov
Allen
Andrew McSweeny
Bechtel
Bechtel
Bernardi
Borodovsky
Consortium
Costantini
Do
Fedorov
Fedorova
Flicek
Grosse
Guigo
Gursel Serpen
Han
Hsu
Kennedy
Lee
Lukashin
Picardi
Provost
Ruvinsky
Samuel S. Shepard
Sboner
Schweikert
Shepard
Shepard
Shepelev
Sonnenburg
Ter-Hovhannisyan
Publication venue: Oxford University Press
Publication date: 01/01/2011
Field of study

Messenger RNA sequences possess specific nucleotide patterns distinguishing them from non-coding genomic sequences. In this study, we explore the utilization of modified Markov models to analyze sequences up to 44 bp, far beyond the 8-bp limit of conventional Markov models, for exon/intron discrimination. In order to analyze nucleotide sequences of this length, their information content is first reduced by conversion into shorter binary patterns via the application of numerous abstraction schemes. After the conversion of genomic sequences to binary strings, homogenous Markov models trained on the binary sequences are used to discriminate between exons and introns. We term this approach the Binary Abstraction Markov Model (BAMM). High-quality abstraction schemes for exon/intron discrimination are selected using optimization algorithms on supercomputers. The best MM classifiers are then combined using support vector machines into a single classifier. With this approach, over 95% classification accuracy is achieved without taking reading frame into account. With further development, the BAMM approach can be applied to sequences lacking the genetic code such as ncRNAs and 5′-untranslated regions

CiteSeerX

Crossref

PubMed Central

Derivation of Context-free Stochastic L-Grammar Rules for Promoter Sequence Modeling Using Support Vector Machine

Author: Damaševičius Robertas
Publication venue: Institute of Information Theories and Applications FOI ITHEA
Publication date: 01/01/2008
Field of study

Formal grammars can used for describing complex repeatable structures such as DNA sequences. In this paper, we describe the structural composition of DNA sequences using a context-free stochastic L-grammar. L-grammars are a special class of parallel grammars that can model the growth of living organisms, e.g. plant development, and model the morphology of a variety of organisms. We believe that parallel grammars also can be used for modeling genetic mechanisms and sequences such as promoters. Promoters are short regulatory DNA sequences located upstream of a gene. Detection of promoters in DNA sequences is important for successful gene prediction. Promoters can be recognized by certain patterns that are conserved within a species, but there are many exceptions which makes the promoter recognition a complex problem. We replace the problem of promoter recognition by induction of context-free stochastic L-grammar rules, which are later used for the structural analysis of promoter sequences. L-grammar rules are derived automatically from the drosophila and vertebrate promoter datasets using a genetic programming technique and their fitness is evaluated using a Support Vector Machine (SVM) classifier. The artificial promoter sequences generated using the derived L- grammar rules are analyzed and compared with natural promoter sequences

Bulgarian Digital Mathematics Library at IMI-BAS