Search CORE

Fast splice site detection using information content and feature reduction

Author: AKMA Baten
AKMA Baten
BCH Chang
C Burge
C Burge
C Cortes
CE Shannon
D Cai
G Dror
G Ratsch
G Yeo
H Drucker
H Itoh
H Liu
JCaHLS Rajapakse
JSaRD Chuang
L Zhang
M Burset
M Pertea
M Zhang
MB Shapiro
MG Reese
MG Reese
N Cristianini
P Waddell
R Castelo
S Brunak
S Buckingham
S Degroeve
S Salzberg
S Sonnenburg
S Sonnenburg
S Washietl
SA Marashi
SK Halgamuge
SM Hebsgaard
T Golub
T-M Chen
TD Schneider
v Vapnik
XH-F Zhang
Y Saeys
YF Sun
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Background: Accurate identification of splice sites in DNA sequences plays a key role in the prediction of gene structure in eukaryotes. Already many computational methods have been proposed for the detection of splice sites and some of them showed high prediction accuracy. However, most of these methods are limited in terms of their long computation time when applied to whole genome sequence data. Results: In this paper we propose a hybrid algorithm which combines several effective and informative input features with the state of the art support vector machine (SVM). To obtain the input features we employ information content method based on Shannon\u27s information theory, Shapiro\u27s score scheme, and Markovian probabilities. We also use a feature elimination scheme to reduce the less informative features from the input data. Conclusion: In this study we propose a new feature based splice site detection method that shows improved acceptor and donor splice site detection in DNA sequences when the performance is compared with various state of the art and well known method

ePublications@SCU

University of Melbourne Institutional Repository

Vertebrate gene finding from multiple-species alignments using a two-level strategy

Author: Carter David
Durbin Richard
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: One way in which the accuracy of gene structure prediction in vertebrate DNA sequences can be improved is by analyzing alignments with multiple related species, since functional regions of genes tend to be more conserved. RESULTS: We describe DOGFISH, a vertebrate gene finder consisting of a cleanly separated site classifier and structure predictor. The classifier scores potential splice sites and other features, using sequence alignments between multiple vertebrate species, while the structure predictor hypothesizes coding transcripts by combining these scores using a simple model of gene structure. This also identifies and assigns confidence scores to possible additional exons. Performance is assessed on the ENCODE regions. We predict transcripts and exons across the whole human genome, and identify over 10,000 high confidence new coding exons not in the Ensembl gene set. CONCLUSION: We present a practical multiple species gene prediction method. Accuracy improves as additional species, up to at least eight, are introduced. The novel predictions of the whole-genome scan should support efficient experimental verification

Apples and oranges: avoiding different priors in Bayesian DNA sequence analysis

Author: A Bernal
A Culotta
A Feelders
AE Kel
AL Berger
AY Ng
C Burge
CM Bishop
D Cai
D Grossman
D Heckerman
D Klein
E Redhead
E Segal
F Pernkopf
G Yeo
GD Stormo
H Wallach
H Wettig
HE Peckham
I Ben-Gal
Ivo Grosse
J Cerquides
J Davis
J Goodman
J Grau
J Keilwagen
Jan Grau
Jens Keilwagen
L Narlikar
M Arita
M Meila-Predoviciu
M Tompa
M Zhang
MI Jordan
NK Kim
O Schulte
O Yakhnenko
P Grünwald
R Castelo
R Castelo
R Greiner
R Staden
S Chen
S Sonnenburg
SL Salzberg
Stefan Posch
T Fawcett
TH Kim
TM Chen
WL Buntine
Y Barash
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background One of the challenges of bioinformatics remains the recognition of short signal sequences in genomic DNA such as donor or acceptor splice sites, splicing enhancers or silencers, translation initiation sites, transcription start sites, transcription factor binding sites, nucleosome binding sites, miRNA binding sites, or insulator binding sites. During the last decade, a wealth of algorithms for the recognition of such DNA sequences has been developed and compared with the goal of improving their performance and to deepen our understanding of the underlying cellular processes. Most of these algorithms are based on statistical models belonging to the family of Markov random fields such as position weight matrix models, weight array matrix models, Markov models of higher order, or moral Bayesian networks. While in many comparative studies different learning principles or different statistical models have been compared, the influence of choosing different prior distributions for the model parameters when using different learning principles has been overlooked, and possibly lead to questionable conclusions. Results With the goal of allowing direct comparisons of different learning principles for models from the family of Markov random fields based on the <it>same a-priori information</it>, we derive a generalization of the commonly-used product-Dirichlet prior. We find that the derived prior behaves like a Gaussian prior close to the maximum and like a Laplace prior in the far tails. In two case studies, we illustrate the utility of the derived prior for a direct comparison of different learning principles with different models for the recognition of binding sites of the transcription factor Sp1 and human donor splice sites. Conclusions We find that comparisons of different learning principles using the same a-priori information can lead to conclusions different from those of previous studies in which the effect resulting from different priors has been neglected. We implement the derived prior is implemented in the open-source library Jstacs to enable an easy application to comparative studies of different learning principles in the field of sequence analysis.</p

Genome-Wide Association between Branch Point Properties and Alternative Splicing

Author: A Corvelo
A Deirdre
A Loytynoja
André Corvelo
B Modrek
B Patterson
B Rhead
B Ruskin
BR Graveley
C Burge
C Gooding
C Gooding
CF Bourgeois
Christopher W. J. Smith
CJ Coolidge
CW Smith
CW Smith
D Libri
DD Licatalosi
DL Black
DM Helfman
DM Kupfer
E Blanco
E Bon
Eduardo Eyras
F Clark
G Kol
G Yeo
GJ Mulligan
HX Liu
IL Hofacker
Irmtraud M. Meyer
J Southby
K Gao
M Goux-Pelletan
M Hallegger
M Plass
M Stanke
MA Garcia-Blanco
Martina Hallegger
MB Stadler
MC Wollerton
MC Wollerton
MR Green
MS Jurica
N Bellora
NA Faustino
R Castelo
R Reed
SH Schwartz
T Joachims
T Maniatis
TW Nilsen
WG Fairbrother
WJ Kent
X Xiao
XH Zhang
XH Zhang
Z Wang
Z Wang
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

The branch point (BP) is one of the three obligatory signals required for pre-mRNA splicing. In mammals, the degeneracy of the motif combined with the lack of a large set of experimentally verified BPs complicates the task of modeling it in silico, and therefore of predicting the location of natural BPs. Consequently, BPs have been disregarded in a considerable fraction of the genome-wide studies on the regulation of splicing in mammals. We present a new computational approach for mammalian BP prediction. Using sequence conservation and positional bias we obtained a set of motifs with good agreement with U2 snRNA binding stability. Using a Support Vector Machine algorithm, we created a model complemented with polypyrimidine tract features, which considerably improves the prediction accuracy over previously published methods. Applying our algorithm to human introns, we show that BP position is highly dependent on the presence of AG dinucleotides in the 3′ end of introns, with distance to the 3′ splice site and BP strength strongly correlating with alternative splicing. Furthermore, experimental BP mapping for five exons preceded by long AG-dinucleotide exclusion zones revealed that, for a given intron, more than one BP can be chosen throughout the course of splicing. Finally, the comparison between exons of different evolutionary ages and pseudo exons suggests a key role of the BP in the pathway of exon creation in human. Our computational and experimental analyses suggest that BP recognition is more flexible than previously assumed, and it appears highly dependent on the presence of downstream polypyrimidine tracts. The reported association between BP features and the splicing outcome suggests that this, so far disregarded but yet crucial, element buries information that can complement current acceptor site models

CiteSeerX

Oxford University Research Archive

UCL Discovery

UPF Digital Repository

Effective transcription factor binding site prediction using a combination of optimization, a genetic algorithm and discriminant analysis to capture distant interactions

Author: A Hoglund
AE Kel
AE Kel
AE Vinogradov
B Efron
B Jaruga
BJ Deroo
C Burge
CD Schmid
CR Calladine
D Cai
D GuhaThakurta
DM Graunke
E Fayard
Elena A Ananko
Elena V Ignatieva
FA Wright
GD Stormo
HP Ko
I Abnizova
I Ben-Gal
IA Udalova
Igor I Turnaev
J Duarte
J Hu
JV Ponomarenko
K Ellrott
K Morohashi
K Quandt
KJ Campbell
L Quintana-Murci
LC Platanias
LG Cowell
M Beato
M Blanchette
M Costantini
M Ganapathi
M Lohoff
M Stepanova
M-LT Lee
ML Bulyk
MP Ponomarenko
MQ Zhang
MQ Zhang
NA Kolchanov
NI Gershenzon
Nikolay A Kolchanov
NV Klimova
O Kel-Margoulis
OA Podkolodnaia
OD King
OG Berg
P Val
PV Benos
Q Zhou
R Castelo
R Kiyama
R Osada
R Pudimat
RV Davuluri
S Kamalakaran
Tatyana I Merkulova
TC Hodgman
TK Man
TM Chen
TV Busygina
VG Levitskii
VG Levitsky
VG Levitsky
VG Levitsky
VG Levitsky
Victor G Levitsky
VV Solovyev
W Huang
WH Shen
WW Wasserman
X Xie
Y Barash
Publication venue: BioMed Central
Publication date: 01/12/2007
Field of study

Abstract Background Reliable transcription factor binding site (TFBS) prediction methods are essential for computer annotation of large amount of genome sequence data. However, current methods to predict TFBSs are hampered by the high false-positive rates that occur when only sequence conservation at the core binding-sites is considered. Results To improve this situation, we have quantified the performance of several Position Weight Matrix (PWM) algorithms, using exhaustive approaches to find their optimal length and position. We applied these approaches to bio-medically important TFBSs involved in the regulation of cell growth and proliferation as well as in inflammatory, immune, and antiviral responses (NF-κB, ISGF3, IRF1, STAT1), obesity and lipid metabolism (PPAR, SREBP, HNF4), regulation of the steroidogenic (SF-1) and cell cycle (E2F) genes expression. We have also gained extra specificity using a method, entitled SiteGA, which takes into account structural interactions within TFBS core and flanking regions, using a genetic algorithm (GA) with a discriminant function of locally positioned dinucleotide (LPD) frequencies. To ensure a higher confidence in our approach, we applied resampling-jackknife and bootstrap tests for the comparison, it appears that, optimized PWM and SiteGA have shown similar recognition performances. Then we applied SiteGA and optimized PWMs (both separately and together) to sequences in the Eukaryotic Promoter Database (EPD). The resulting SiteGA recognition models can now be used to search sequences for BSs using the web tool, SiteGA. Analysis of dependencies between close and distant LPDs revealed by SiteGA models has shown that the most significant correlations are between close LPDs, and are generally located in the core (footprint) region. A greater number of less significant correlations are mainly between distant LPDs, which spanned both core and flanking regions. When SiteGA and optimized PWM models were applied together, this substantially reduced false positives at least at higher stringencies. Conclusion Based on this analysis, SiteGA adds substantial specificity even to optimized PWMs and may be considered for large-scale genome analysis. It adds to the range of techniques available for TFBS prediction, and EPD analysis has led to a list of genes which appear to be regulated by the above TFs.</p

An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets

Author: Ana Stanescu
Doina Caragea
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

MotifAdjuster: a tool for computational reassessment of transcription factor binding site annotations

Author: Baumbach Jan
Grosse Ivo
Keilwagen Jens
Kohl Thomas A
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

MotifAdjuster helps to detect errors in binding site annotations

Springer

Publications at Bielefeld University

Identifying Phenotypes Based on TCR Repertoire Using Machine Learning Methods

Author: Qin Qianqian
Publication venue
Publication date: 15/06/2020
Field of study

The adaptive immune system can prevent human beings being infected by pathogens. T cells, a kind of lymphocytes in the adaptive immunity, recognise antigens by T cell receptors (TCRs) and then generate cell-mediated immune responses. After primary immune responses, the adaptive immunity can generate corresponding immunological memory. TCRs are generated by a process of somatic gene rearrangement and therefore have high diversity. An individual's TCR repertoire can reveal his pathogen exposure history, which can assist in biological studies such as disease diagnosis. This master thesis targets to make predictions about phenotype statuses based on high-throughput TCR sequencing data using machine learning approaches, to see how accurate the phenotype identification based on TCR repertoire can be. The raw TCR data is preprocessed in three different ways and then proceed the next steps separately. Several feature selection approaches are applied to obtain the most important TCRs. The machine learning algorithms including Beta-binomial model (baseline), Logistic regression, Random forest and a Boosting algorithm LightGBM are trained and evaluated. Two datasets, Cytomegalovirus (CMV) and rheumatoid arthritis (RA), are explored. For the CMV dataset, Random forest performs best, even though only a little bit better than the baseline model. However, the classification results of the RA dataset are not so good whatever models used, and the best classifier is LightGBM. The results imply that the TCR data needs to be large enough to make powerful predictions. Using a sufficiently large dataset, the prediction ability of the baseline model is great, and there may exist certain algorithms such as Random forest outperform it

Aaltodoc Publication Archive

Insights into Protein–Protein Interfaces using a Bayesian Network Prediction Method

Author: Abbate
Ahmad
Aloy
Andrew J. Bulpitt
Ansari
Arkin
Arkin
Bahadur
Baker
Baseman
Beaumont
Ben-Gal
Berman
Bogan
Bordner
Bradford
Bradford
Brannon
Burgoyne
Caffrey
Cai
Castelo
Chakrabarti
Chothia
Chris J. Needham
Chung
Clackson
Cohen
Crowley
David R. Westhead
De
Djinovic Carugo
Drawid
Duncan
Fariselli
Fauchère
Fernandez-Recio
Friedman
Friedman
Frishman
Glaser
Glaser
Grishin
Guharoy
Gunasekaran
Hartemink
Hartemink
Hu
Husmeier
Husmeier
Ifuku1
James R. Bradford
Jansen
Jones
Jones
Jones
Keskin
Kim
Kimura
Klinger
Koenderink
Krissinel
Larsen
Lee
Lejeune
Lichtarge
Liu
Lo Conte
Lu
Ma
Matthews
Michalopoulos
Mintseris
Mintseris
Murphy
Murzin
Nariai
Needham
Neuvirth
Nooren
Nooren
Ofran
Oki
Pagliaro
Pe'er
Pudimat
Pupko
Reš
Rocchia
Rocchia
Ryan
Salzberg
Sanner
Smith
Stamper
Stewart
Stewart
Tamada
Troyanskaya
Tsai
Valdar
Valdar
Vetter
Wang
White
Wu
Xu
Yan
Yoo
You
Young
Zhao
Zhou
Zhou
zur Hausen
Publication venue: 'Elsevier BV'
Publication date
Field of study