Search CORE

6,423 research outputs found

Exon-phase symmetry and intrinsic structural disorder promote modular evolution in the human genome

Author: Adams
Balazs
Buljan
Burra
Corvelo
Daughdrill
Davey
Davey
Diella
Dosztanyi
Dosztanyi
Dyson
Eva Schad
Fedorov
Fisher
Fujita
Fuxreiter
Fuxreiter
Gilbert
Greaser
Grover
Hernandez
Kaessmann
Kalmar
Kaplon
Kato
Kawasaki
Kiss
Kiss
Kovacs
Lajos Kalmar
Lee
Li
Long
Meszaros
Mittag
Modrek
Mosca
Oliver
Pancsa
Patthy
Patthy
Patthy
Pentony
Peter Tompa
Punta
Romero
Sarkar
Seet
Sire
Tompa
Tompa
Tompa
Tompa
Tompa
Tompa
Uversky
Van Roey
Vucetic
Ward
Weatheritt
Weatheritt
Zhang
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2013
Field of study

A key signature of module exchange in the genome is phase symmetry of exons, suggestive of exon shuffling events that occurred without disrupting translation reading frame. At the protein level, intrinsic structural disorder may be another key element because disordered regions often serve as functional elements that can be effectively integrated into a protein structure. Therefore, we asked whether exon-phase symmetry in the human genome and structural disorder in the human proteome are connected, signalling such evolutionary mechanisms in the assembly of multi-exon genes. We found an elevated level of structural disorder of regions encoded by symmetric exons and a preferred symmetry of exons encoding for mostly disordered regions (>70% predicted disorder). Alternatively spliced symmetric exons tend to correspond to the most disordered regions. The genes of mostly disordered proteins (>70% predicted disorder) tend to be assembled from symmetric exons, which often arise by internal tandem duplications. Preponderance of certain types of short motifs (e.g. SH3-binding motif) and domains (e.g. high-mobility group domains) suggests that certain disordered modules have been particularly effective in exon-shuffling events. Our observations suggest that structural disorder has facilitated modular assembly of complex genes in evolution of the human genome. © 2013 The Author(s)

Crossref

Repository of the Academy's Library

N-terminal proteomics assisted profiling of the unexplored translation initiation landscape in Arabidopsis thaliana

Author: Gevaert Kris
Jonckheere Veronique
Martens Lennart
Ndah Elvis
Stael Simon
Sticker Adriaan
Van Breusegem Frank
Van Damme Petra
Willems Patrick
Publication venue: 'American Society for Biochemistry & Molecular Biology (ASBMB)'
Publication date: 01/01/2017
Field of study

Proteogenomics is an emerging research field yet lacking a uniform method of analysis. Proteogenomic studies in which N-terminal proteomics and ribosome profiling are combined, suggest that a high number of protein start sites are currently missing in genome annotations. We constructed a proteogenomic pipeline specific for the analysis of N-terminal proteomics data, with the aim of discovering novel translational start sites outside annotated protein coding regions. In summary, unidentified MS/MS spectra were matched to a specific N-terminal peptide library encompassing protein N termini encoded in the Arabidopsis thaliana genome. After a stringent false discovery rate filtering, 117 protein N termini compliant with N-terminal methionine excision specificity and indicative of translation initiation were found. These include N-terminal protein extensions and translation from transposable elements and pseudogenes. Gene prediction provided supporting protein-coding models for approximately half of the protein N termini. Besides the prediction of functional domains (partially) contained within the newly predicted ORFs, further supporting evidence of translation was found in the recently released Araport11 genome re-annotation of Arabidopsis and computational translations of sequences stored in public repositories. Most interestingly, complementary evidence by ribosome profiling was found for 23 protein N termini. Finally, by analyzing protein N-terminal peptides, an in silico analysis demonstrates the applicability of our N-terminal proteogenomics strategy in revealing protein-coding potential in species with well-and poorly-annotated genomes

Ghent University Academic Bibliography

Curated genome annotation of Oryza sativa ssp. japonica and comparative genome analysis with Arabidopsis thaliana

Author: Antonio Baltazar A.
Aono Hideo
Apweiler Rolf
Barrero Roberto A.
Bruskiewich Richard
Bureau Thomas
Burr Benjamin
Burr Frances
Costa de Oliveira Antonio
Fujii Yasuyuki
Fuks Galina
Gojobori Takashi
Habara Takuya
Haberer Georg
Han Bin
Harada Erimi
Higo Kenichi
Hilton Phillip B.
Hiraki Aiko T.
Hirochika Hirohiko
Hoen Douglas
Hokari Hiroki
Hosokawa Satomi
Hsing Yue
Ikawa Hiroshi
Ikeo Kazuho
Imanishi Tadashi
Ito Yukiyo
Itoh Takeshi
Jaiswal Pankaj
Kanno Masako
Kawahara Yosihiro
Kawamura Toshiyuki
Kawashima Hiroaki
Khurana Jitendra P.
Kikuchi Shoshi
Komatsu Setsuko
Koyanagi Kanako O.
Kubooka Hiromi
Liberherr Damien
Lin Yao-Cheng
Lonsdale David
Matsumoto Takashi
Matsuya Akihiro
McCombie W. Richard
Messing Joachim
Miyao Akio
Mulder Nicola
Nagamura Yoshiaki
Nam Jongmin
Namiki Nobukazu
Numa Hisataka
Nurimoto Shin
O'Donovan Claire
Ohyanagi Hajimi
Okido Toshihisa
OOta Satoshi
Osato Naoki
Palmer Lance E.
Quetier Francis
Raghuvanshi Surabh
Saichi Naomi
Sakai Hiroaki
Sakai Yasumichi
Sakata Katsumi
Sakurai Tetsuya
Saski Takuji
Sato Fumihiko
Sato Yoshiharu
Schoof Heiko
Seki Motoaki
Shibata Katsumi
Shibata Michie
Shimizu Yuji
Shinozaki Kazuo
Shinso Yuji
Singh Nagendra K.
Smith-White Brian
Takeda Jun-ichi
Tanaka Tsuyoshi
Tanino Motohiko
Tatusova Tatiana
Thongjuea Supat
Todokoro Fusano
Tsugane Mika
Tyagi Akhilesh K.
Vanavichit Apichart
Wang Aihui
Wing Rod A.
Yamaguchi Kaori
Yamamoto Mayu
Yamamoto Naoyuki
Yamasaki Chisato
Yu Yeisoo
Zhang Hao
Zhao Qiang
Publication venue: Cold Spring Harbor Laboratory Press
Publication date: 01/01/2007
Field of study

We present here the annotation of the complete genome of rice Oryza sativa L. ssp. japonica cultivar Nipponbare. All functional annotations for proteins and non-protein-coding RNA (npRNA) candidates were manually curated. Functions were identified or inferred in 19,969 (70%) of the proteins, and 131 possible npRNAs (including 58 antisense transcripts) were found. Almost 5000 annotated protein-coding genes were found to be disrupted in insertional mutant lines, which will accelerate future experimental validation of the annotations. The rice loci were determined by using cDNA sequences obtained from rice and other representative cereals. Our conservative estimate based on these loci and an extrapolation suggested that the gene number of rice is ~32,000, which is smaller than previous estimates. We conducted comparative analyses between rice and Arabidopsis thaliana and found that both genomes possessed several lineage-specific genes, which might account for the observed differences between these species, while they had similar sets of predicted functional domains among the protein sequences. A system to control translational efficiency seems to be conserved across large evolutionary distances. Moreover, the evolutionary process of protein-coding genes was examined. Our results suggest that natural selection may have played a role for duplicated genes in both species, so that duplication was suppressed or favored in a manner that depended on the function of a gene

Crossref

PubMed Central

Queensland University of Technology ePrints Archive

Caltech Authors

University of Queensland eSpace

Unsupervised and semi-supervised training methods for eukaryotic gene prediction

Author: Ter-Hovhannisyan Vardges
Publication venue: Georgia Institute of Technology
Publication date: 17/11/2008
Field of study

This thesis describes new gene finding methods for eukaryotic gene prediction. The current methods for deriving model parameters for gene prediction algorithms are based on curated or experimentally validated set of genes or gene elements. These training sets often require time and additional expert efforts especially for the species that are in the initial stages of genome sequencing. Unsupervised training allows determination of model parameters from anonymous genomic sequence with. The importance and the practical applicability of the unsupervised training is critical for ever growing rate of eukaryotic genome sequencing. Three distinct training procedures are developed for diverse group of eukaryotic species. GeneMark-ES is developed for species with strong donor and acceptor site signals such as Arabidopsis thaliana, Caenorhabditis elegans and Drosophila melanogaster. The second version of the algorithm, GeneMark-ES-2, introduces enhanced intron model to better describe the gene structure of fungal species with posses with relatively weak donor and acceptor splice sites and well conserved branch point signal. GeneMark-LE, semi-supervised training approach is designed for eukaryotic species with small number of introns. The results indicate that the developed unsupervised training methods perform well as compared to other training methods and as estimated from the set of genes supported by EST-to-genome alignments. Analysis of novel genomes reveals interesting biological findings and show that several candidates of under-annotated and over-annotated fungal species are present in the current set of annotated of fungal genomes.Ph.D.Committee Chair: Mark Borodovky; Committee Member: Jung H. Choi; Committee Member: King Jordan; Committee Member: Leonid Bunimovich; Committee Member: Yury Chernof

Scholarly Materials And Research @ Georgia Tech

RNA-Seq analysis of splicing in Plasmodium falciparum uncovers new splice junctions, alternative splicing and splicing of antisense transcripts.

Author: DeRisi Joseph L
Dimon Michelle T
Sorber Katherine
Publication venue: eScholarship, University of California
Publication date: 17/01/2011
Field of study

Over 50% of genes in Plasmodium falciparum, the deadliest human malaria parasite, contain predicted introns, yet experimental characterization of splicing in this organism remains incomplete. We present here a transcriptome-wide characterization of intraerythrocytic splicing events, as captured by RNA-Seq data from four timepoints of a single highly synchronous culture. Gene model-independent analysis of these data in conjunction with publically available RNA-Seq data with HMMSplicer, an in-house developed splice site detection algorithm, revealed a total of 977 new 5' GU-AG 3' and 5 new 5' GC-AG 3' junctions absent from gene models and ESTs (11% increase to the current annotation). In addition, 310 alternative splicing events were detected in 254 (4.5%) genes, most of which truncate open reading frames. Splicing events antisense to gene models were also detected, revealing complex transcriptional arrangements within the parasite's transcriptome. Interestingly, antisense introns overlap sense introns more than would be expected by chance, perhaps indicating a functional relationship between overlapping transcripts or an inherent organizational property of the transcriptome. Independent experimental validation confirmed over 30 new antisense and alternative junctions. Thus, this largest assemblage of new and alternative splicing events to date in Plasmodium falciparum provides a more precise, dynamic view of the parasite's transcriptome

PubMed Central

eScholarship - University of California

Discrete wavelet transform de-noising in eukaryotic gene splicing

Author: AS Nair
D Anastassiou
EN Trifonov
JG Proakis
KP Soman
PP Vaidyanathan
R Kakumani
S Tiwari
Tessamma Thomas
Tina P George
TW Fox
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background This paper compares the most common digital signal processing methods of exon prediction in eukaryotes, and also proposes a technique for noise suppression in exon prediction. The specimen used here which has relevance in medical research, has been taken from the public genomic database - GenBank. Methods Here exon prediction has been done using the digital signal processing methods viz. binary method, EIIP (electron-ion interaction psuedopotential) method and filter methods. Under filter method two filter designs, and two approaches using these two designs have been tried. The discrete wavelet transform has been used for de-noising of the exon plots. Results Results of exon prediction based on the methods mentioned above, which give values closest to the ones found in the NCBI database are given here. The exon plot de-noised using discrete wavelet transform is also given. Conclusion Alterations to the proven methods as done by the authors, improves performance of exon prediction algorithms. Also it has been proven that the discrete wavelet transform is an effective tool for de-noising which can be used with exon prediction algorithms.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Transcriptome Analysis of the Model Protozoan, Tetrahymena thermophila, Using Deep RNA Sequencing

Background: The ciliated protozoan Tetrahymena thermophila is a well-studied single-celled eukaryote model organism for cellular and molecular biology. However, the lack of extensive T. thermophila cDNA libraries or a large expressed sequence tag (EST) database limited the quality of the original genome annotation. Methodology/Principal Findings: This RNA-seq study describes the first deep sequencing analysis of the T. thermophila transcriptome during the three major stages of the life cycle: growth, starvation and conjugation. Uniquely mapped reads covered more than 96 % of the 24,725 predicted gene models in the somatic genome. More than 1,000 new transcribed regions were identified. The great dynamic range of RNA-seq allowed detection of a nearly six order-of-magnitude range of measurable gene expression orchestrated by this cell. RNA-seq also allowed the first prediction of transcript untranslated regions (UTRs) and an updated (larger) size estimate of the T. thermophila transcriptome: 57 Mb, or about 55 % of the somatic genome. Our study identified nearly 1,500 alternative splicing (AS) events distributed over 5.2 % of T. thermophila genes. This percentage represents a two order-of-magnitude increase over previous EST-based estimates in Tetrahymena. Evidence of stage-specific regulation of alternative splicing was also obtained. Finally, our study allowed us to completely confirm about 26.8 % of the genes originally predicted by the gene finder, to correct coding sequence boundaries an

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Warwick Research Archives Portal Repository

Institute of Hydrobiology, Chinese Academy Of Sciences

Evolution of protein domain architectures

Author: A Heger
A Marchler-Bauer
A Nagy
A Nagy
A Nagy
A Nasir
A Rijk van
A Rzhetsky
A-L Barabási
AD Moore
AD Moore
AD Moore
AH Brivanlou
AR Kersting
B Lee
B Snel
C Bru
C Chothia
C Feschotte
C Haider
C Vogel
C Vogel
C-H Hsu
C-H Hsu
CM Zmasek
D Ekman
D Wilson
DP Syamaladevi
E Bornberg-Bauer
E Dohmen
E Gogvadze
E Nimwegen van
EE Schmidt
EM Marcotte
EV Koonin
G Apic
G Apic
GP Karev
H Tordai
I Cohen-Gihon
I Letunic
I Yanai
J Gough
J Qian
J Weiner
J Weiner
J Weiner III
J Wiedenhoeft
J-M Chandonia
JAG Ranea
JH Fong
JM Eirin-Lopez
JP Demuth
JS Farris
K Forslund
L Grassi
L Leclère
L Li
L Patthy
LY Geer
M Bashton
M Buljan
M Buljan
M d C Orozco-Mosqueda
M Itoh
M Liu
M Sharma
M Stolzer
M Toll-Riera
MA Huynen
MK Basu
MK Basu
N Terrapon
N Vera-Parra
NC Brissett
NL Dawson
NM Luscombe
R Cordaux
RD Finn
RD Finn
RF Doolittle
S Wuchty
S Yang
SD Lam
SK Kummerfeld
SK Kummerfeld
T Bitard-Feildel
T Doğan
T Koestler
T Przytycka
TE Lewis
UniProt Consortium
V Hollich
VA Kuznetsov
W-D Heyer
X Xie
X-C Zhang
Y-C Wu
ÅK Björklund
ÅK Björklund
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

This chapter reviews current research on how protein domain architectures evolve. We begin by summarizing work on the phylogenetic distribution of proteins, as this will directly impact which domain architectures can be formed in different species. Studies relating domain family size to occurrence have shown that they generally follow power law distributions, both within genomes and larger evolutionary groups. These findings were subsequently extended to multi-domain architectures. Genome evolution models that have been suggested to explain the shape of these distributions are reviewed, as well as evidence for selective pressure to expand certain domain families more than others. Each domain has an intrinsic combinatorial propensity, and the effects of this have been studied using measures of domain versatility or promiscuity. Next, we study the principles of protein domain architecture evolution and how these have been inferred from distributions of extant domain arrangements. Following this, we review inferences of ancestral domain architecture and the conclusions concerning domain architecture evolution mechanisms that can be drawn from these. Finally, we examine whether all known cases of a given domain architecture can be assumed to have a single common origin (monophyly) or have evolved convergently (polyphyly). We end by a discussion of some available tools for computational analysis or exploitation of protein domain architectures and their evolution

Crossref

MDC Repository

Human Promoter Prediction Using DNA Numerical Representation

Author: Arniker Swarna Bai
Publication venue: 'University of Windsor Leddy Library'
Publication date: 01/01/2010
Field of study

With the emergence of genomic signal processing, numerical representation techniques for DNA alphabet set {A, G, C, T} play a key role in applying digital signal processing and machine learning techniques for processing and analysis of DNA sequences. The choice of the numerical representation of a DNA sequence affects how well the biological properties can be reflected in the numerical domain for the detection and identification of the characteristics of special regions of interest within the DNA sequence. This dissertation presents a comprehensive study of various DNA numerical and graphical representation methods and their applications in processing and analyzing long DNA sequences. Discussions on the relative merits and demerits of the various methods, experimental results and possible future developments have also been included. Another area of the research focus is on promoter prediction in human (Homo Sapiens) DNA sequences with neural network based multi classifier system using DNA numerical representation methods. In spite of the recent development of several computational methods for human promoter prediction, there is a need for performance improvement. In particular, the high false positive rate of the feature-based approaches decreases the prediction reliability and leads to erroneous results in gene annotation.To improve the prediction accuracy and reliability, DigiPromPred a numerical representation based promoter prediction system is proposed to characterize DNA alphabets in different regions of a DNA sequence.The DigiPromPred system is found to be able to predict promoters with a sensitivity of 90.8% while reducing false prediction rate for non-promoter sequences with a specificity of 90.4%. The comparative study with state-of-the-art promoter prediction systems for human chromosome 22 shows that our proposed system maintains a good balance between prediction accuracy and reliability. To reduce the system architecture and computational complexity compared to the existing system, a simple feed forward neural network classifier known as SDigiPromPred is proposed. The SDigiPromPred system is found to be able to predict promoters with a sensitivity of 87%, 87%, 99% while reducing false prediction rate for non-promoter sequences with a specificity of 92%, 94%, 99% for Human, Drosophila, and Arabidopsis sequences respectively with reconfigurable capability compared to existing system

Scholarship at UWindsor

Genomics and Phylogeny of Cytoskeletal Proteins: Tools and Analyses

Author: Hammesfahr Björn
Publication venue
Publication date: 05/11/2011
Field of study

Georg-August-University Göttingen

MPG.PuRe