Search CORE

arXiv.org e-Print Archive

Languages cool as they expand: Allometric scaling and the decreasing need for new words

Author: A Clauset
A Gnedin
A Vespignani
AA Tsonis
AL Barabási
AM Petersen
B Mandelbrot
B Podobnik
B Podobnik
D Fu
D Helbing
D Lazer
DC van Leijenhorst
E Alvarez-Lacalle
EA Altmann
EG Altmann
EG Altmann
GB West
GW Oehlert
HA Makse
HAJrJSA Makse
HD Rozenfeld
HD Rozenfeld
J Gao
J-B Michel
JA Evans
L Lü
LAN Amaral
LAN Amaral
LMA Bettencourt
M Batty
M Kleiber
M Markosova
M Riccaboni
M Sigman
M Steyvers
MA Montemurro
MEJ Newman
MHR Stanley
MÁ Serrano
R Ferrer i Cancho
R Ferrer i Cancho
R Ferrer i Cancho
RN Mantegna
S Bernhardsson
S Bernhardsson
S Karlin
SK Baek
X Gabaix
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 10/12/2012
Field of study

We analyze the occurrence frequencies of over 15 million words recorded in millions of books published during the past two centuries in seven different languages. For all languages and chronological subsets of the data we confirm that two scaling regimes characterize the word frequency distributions, with only the more common words obeying the classic Zipf law. Using corpora of unprecedented size, we test the allometric scaling relation between the corpus size and the vocabulary size of growing languages to demonstrate a decreasing marginal need for new words, a feature that is likely related to the underlying correlations between words. We calculate the annual growth fluctuations of word use which has a decreasing trend as the corpus size increases, indicating a slowdown in linguistic evolution following language expansion. This ‘‘cooling pattern’’ forms the basis of a third statistical regularity, which unlike the Zipf and the Heaps law, is dynamical in nature

Boston University Institutional Repository (OpenBU)

Digital library of University of Maribor

IMT Institutional Repository

Considering scores between unrelated proteins in the search database improves profile comparison

Author: AA Schaffer
DT Jones
G Yona
J Soding
L Rychlewski
M Frenkel-Morgenstern
M Madera
Nick V Grishin
R Sadreyev
RI Sadreyev
Ruslan I Sadreyev
S Karlin
S Pietrokovski
S Shi
SF Altschul
SF Altschul
Y Qi
Y Wang
Y Zhang
YK Yu
Yong Wang
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Profile-based comparison of multiple sequence alignments is a powerful methodology for the detection remote protein sequence similarity, which is essential for the inference and analysis of protein structure, function, and evolution. Accurate estimation of statistical significance of detected profile similarities is essential for further development of this methodology. Here we analyze a novel approach to estimate the statistical significance of profile similarity: the explicit consideration of background score distributions for each database template (subject). Results Using a simple scheme to combine and analytically approximate query- and subject-based distributions, we show that (i) inclusion of background distributions for the subjects increases the quality of homology detection; (ii) this increase is higher when the distributions are based on the scores to all known non-homologs of the subject rather than a small calibration subset of the database representatives; and (iii) these all known non-homolog distributions of scores for the subject make the dominant contribution to the improved performance: adding the calibration distribution of the query has a negligible additional effect. Conclusion The construction of distributions based on the complete sets of non-homologs for each subject is particularly relevant in the setting of structure prediction where the database consists of proteins with solved 3D structure (PDB, SCOP, CATH, etc.) and therefore structural relationships between proteins are known. These results point to a potential new direction in the development of more powerful methods for remote homology detection.</p

Detection of distant evolutionary relationships between protein families using theory of sequence profile-profile comparison

Author: A Dembo
A Kryshtafovych
A Zemla
A Šali
AA Schaffer
AG Murzin
AY Mitrophanov
G Yona
H Cheng
J Söding
JC Wootton
JM Chandonia
L Holm
L Rychlewski
Mindaugas Margelevičius
MO Dayhoff
N Siew
PZ Kozbial
R Arratia
R Bundschuh
R Kolodny
R Sadreyev
RC Edgar
RI Sadreyev
RL Tatusov
S Henikoff
S Henikoff
S Karlin
S Karlin
S Pietrokovski
SF Altschul
SF Altschul
TF Smith
TT Lee
Y Qi
Y Wang
Y Zhang
Česlovas Venclovas
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Detection of common evolutionary origin (homology) is a primary means of inferring protein structure and function. At present, comparison of protein families represented as sequence profiles is arguably the most effective homology detection strategy. However, finding the best way to represent evolutionary information of a protein sequence family in the profile, to compare profiles and to estimate the biological significance of such comparisons, remains an active area of research. Results Here, we present a new homology detection method based on sequence profile-profile comparison. The method has a number of new features including position-dependent gap penalties and a global score system. Position-dependent gap penalties provide a more biologically relevant way to represent and align protein families as sequence profiles. The global score system enables an analytical solution of the statistical parameters needed to estimate the statistical significance of profile-profile similarities. The new method, together with other state-of-the-art profile-based methods (HHsearch, COMPASS and PSI-BLAST), is benchmarked in all-against-all comparison of a challenging set of SCOP domains that share at most 20% sequence identity. For benchmarking, we use a reference ("gold standard") free model-based evaluation framework. Evaluation results show that at the level of protein domains our method compares favorably to all other tested methods. We also provide examples of the new method outperforming structure-based similarity detection and alignment. The implementation of the new method both as a standalone software package and as a web server is available at <url>http://www.ibt.lt/bioinformatics/coma</url>. Conclusion Due to a number of developments, the new profile-profile comparison method shows an improved ability to match distantly related protein domains. Therefore, the method should be useful for annotation and homology modeling of uncharacterized proteins.</p

Large introns in relation to alternative splicing and gene evolution: a case study of Drosophila bruno-3

Author: A Marchler-Bauer
A Mortazavi
A Nekrutenko
A Rambaut
A Resch
AA Patel
AB Carvalho
AE Vinogradov
AG Clark
AM McGuire
AN Ladd
AR Hatton
AS Chang
AV Alekseyenko
AV Philips
B Budagyan
B Modrek
B Modrek
B Prud'homme
BR Graveley
BR Graveley
BR Graveley
C Lee
C Notredame
CI Castillo-Davis
CL Fitzpatrick
CS Thummel
D Babushok
D Baek
D Gatfield
D Monroe
D Ortíz-Barrientos
DA Petrov
DA Petrov
DG Gilbert
DI Nurminsky
DJ Kenan
DL Black
DL Swofford
E Betran
E Kim
E Kim
E Wagner
EA Glazov
F-C Chen
FA Kondrashov
G Ast
G Lev-Maor
G Marais
GD Schuler
H Itoh
IA Swinburne
International Human Genome Sequencing Consortium
J Delaunay
J Felsenstein
J Rozas
JM Burnette
JM Comeron
JM Johnson
K Tamura
KL Fox-Walsh
KM Neugebauer
LF Lareau
M Clamp
M Guo
M Labrador
M Lynch
M Roy
M Talerico
MA Noor
MD Adams
MD Adams
MF Wilkinson
Mohamed AF Noor
MT Levine
Nikolai P Kandul
NJ Proudfoot
NM Kopelman
P Haddrill
PA Sharp
PJ Good
PM O'Grady
Q Pan
R Sorek
R Sorek
R Sorek
R Sorek
R Sorek
S Karlin
S Misra
S Richards
S-T Chen
SF Altschul
SM Berget
TE Royce
V Stolc
W Wang
WG Hill
Y Xing
Y Xing
Z Kan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

Background: Alternative splicing (AS) of maturing mRNA can generate structurally and functionally distinct transcripts from the same gene. Recent bioinformatic analyses of available genome databases inferred a positive correlation between intron length and AS. To study the interplay between intron length and AS empirically and in more detail, we analyzed the diversity of alternatively spliced transcripts (ASTs) in the Drosophila RNA-binding Bruno-3 (Bru-3) gene. This gene was known to encode thirteen exons separated by introns of diverse sizes, ranging from 71 to 41,973 nucleotides in D. melanogaster. Although Bru-3's structure is expected to be conducive to AS, only two ASTs of this gene were previously described. Results: Cloning of RT-PCR products of the entire ORF from four species representing three diverged Drosophila lineages provided an evolutionary perspective, high sensitivity, and long-range contiguity of splice choices currently unattainable by high-throughput methods. Consequently, we identified three new exons, a new exon fragment and thirty-three previously unknown ASTs of Bru-3. All exon-skipping events in the gene were mapped to the exons surrounded by introns of at least 800 nucleotides, whereas exons split by introns of less than 250 nucleotides were always spliced contiguously in mRNA. Cases of exon loss and creation during Bru-3 evolution in Drosophila were also localized within large introns. Notably, we identified a true de novo exon gain: exon 8 was created along the lineage of the obscura group from intronic sequence between cryptic splice sites conserved among all Drosophila species surveyed. Exon 8 was included in mature mRNA by the species representing all the major branches of the obscura group. To our knowledge, the origin of exon 8 is the first documented case of exonization of intronic sequence outside vertebrates. Conclusion: We found that large introns can promote AS via exon-skipping and exon turnover during evolution likely due to frequent errors in their removal from maturing mRNA. Large introns could be a reservoir of genetic diversity, because they have a greater number of mutable sites than short introns. Taken together, gene structure can constrain and/or promote gene evolution

Public Library of Science (PLOS)

Caltech Authors

A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation

Author: A Krogh
A Marchler-Bauer
A Milosavljević
A Pertsemlidis
AA Schäffer
AY Mitrophanov
BJ Webb
Burkhard Rost
C Barrett
C Webber
D Drasdo
D Metzler
D Siegmund
DJC MacKay
EJ Gumbel
EP Nawrocki
ET Jaynes
I Letunic
J Park
JD Storey
JF Lawless
JS Liu
K Karplus
K Karplus
K Sjölander
M Madera
MG Kann
MQ Zhang
MS Waterman
N Chia
P Bucher
R Bundschuh
R Durbin
R Mott
R Mott
R Mott
R Olsen
RC Edgar
RD Finn
S Johnson
S Karlin
S Karlin
S Miyazawa
Sean R. Eddy
SF Altschul
SF Altschul
SF Altschul
SF Altschul
SF Altschul
SF Altschul
SR Eddy
SR Eddy
TF Smith
WR Pearson
Y-K Yu
Y-K Yu
Y-K Yu
Y-K Yu
Publication venue: Public Library of Science
Publication date: 01/05/2008
Field of study

Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (λ) requires time-consuming computational simulation. Moreover, optimal alignment scores are less powerful than probabilistic scores that integrate over alignment uncertainty (“Forward” scores), but the expected distribution of Forward scores remains unknown. Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling methods are used. For a probabilistic model of local sequence alignment, optimal alignment bit scores (“Viterbi” scores) are Gumbel-distributed with constant λ = log 2, and the high scoring tail of Forward scores is exponential with the same constant λ. Simulation studies support these conjectures over a wide range of profile/sequence comparisons, using 9,318 profile-hidden Markov models from the Pfam database. This enables efficient and accurate determination of expectation values (E-values) for both Viterbi and Forward scores for probabilistic local alignments

Digital Repository @ Iowa State University (ISU)

Pairwise statistical significance of local sequence alignment using multiple parameter sets and empirical justification of parameter set change penalty

Author: A Agrawal
A Agrawal
AA Schäffer
AK Hartmann
Ankit Agrawal
AY Mitrophanov
CA Orengo
J Rocha
M Kschischo
M Pagni
ML Sierk
MS Waterman
P Bucher
PH Sellers
R Mott
R Mott
R Mott
R Olsen
RF Mott
S Grossmann
S Karlin
S Kotz
S Sheetlin
S Wolfsheimer
SE Brenner
SF Altschul
SF Altschul
SF Altschul
SF Altschul
SF Altschul
TF Smith
WR Pearson
WR Pearson
WR Pearson
WR Pearson
WR Pearson
X Huang
X Huang
Xiaoqiu Huang
YK Yu
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Background: Accurate estimation of statistical significance of a pairwise alignment is an important problem in sequence comparison. Recently, a comparative study of pairwise statistical significance with database statistical significance was conducted. In this paper, we extend the earlier work on pairwise statistical significance by incorporating with it the use of multiple parameter sets. Results: Results for a knowledge discovery application of homology detection reveal that using multiple parameter sets for pairwise statistical significance estimates gives better coverage than using a single parameter set, at least at some error levels. Further, the results of pairwise statistical significance using multiple parameter sets are shown to be significantly better than database statistical significance estimates reported by BLAST and PSI-BLAST, and comparable and at times significantly better than SSEARCH. Using non-zero parameter set change penalty values give better performance than zero penalty. Conclusion: The fact that the homology detection performance does not degrade when using multiple parameter sets is a strong evidence for the validity of the assumption that the alignment score distribution follows an extreme value distribution even when using multiple parameter sets. Parameter set change penalty is a useful parameter for alignment using multiple parameter sets. Pairwise statistical significance using multiple parameter sets can be effectively used to determine the relatedness of a (or a few) pair(s) of sequences without performing a time-consuming database search

Public Library of Science (PLOS)

Fine-Tuning Translation Kinetics Selection as the Driving Force of Codon Usage Bias in the Hepatitis A Virus Capsid

Author: A Pfrunder
AA Komar
Albert Bosch
AO Urrutia
C Kimchi-Sarfaty
CC Burns
CM Ruiz-Jarabo
DC Krakauer
E Domingo
E Reich
EN Moriyama
Enric Ribes
G Sanchez
G Sanchez
G Sanchez
H Chiapello
H Musto
IK Ali
J Vandesompele
J Zhou
JB de Kok
JB Plotkin
JL Bennetzen
JL Parmley
JR Buchan
JR Coleman
JV Chamary
JV Chamary
L Aragones
LE Whetter
Lluís Aragonès
M dos Reis
M Semon
M Stenico
ME Ramos-Nino
R Geller
R Grantham
R Grantham
Raul Andino
RM Pinto
Rosa M. Pintó
S Karlin
S Mueller
Susana Guix
T Ikemura
VR Racaniello
Y Lavner
Z Yang
Publication venue: Public Library of Science
Publication date: 01/03/2010
Field of study

Hepatitis A virus (HAV), the prototype of genus Hepatovirus, has several unique biological characteristics that distinguish it from other members of the Picornaviridae family. Among these, the need for an intact eIF4G factor for the initiation of translation results in an inability to shut down host protein synthesis by a mechanism similar to that of other picornaviruses. Consequently, HAV must inefficiently compete for the cellular translational machinery and this may explain its poor growth in cell culture. In this context of virus/cell competition, HAV has strategically adopted a naturally highly deoptimized codon usage with respect to that of its cellular host. With the aim to optimize its codon usage the virus was adapted to propagate in cells with impaired protein synthesis, in order to make tRNA pools more available for the virus. A significant loss of fitness was the immediate response to the adaptation process that was, however, later on recovered and more associated to a re-deoptimization rather than to an optimization of the codon usage specifically in the capsid coding region. These results exclude translation selection and instead suggest fine-tuning translation kinetics selection as the underlying mechanism of the codon usage bias in this specific genome region. Additionally, the results provide clear evidence of the Red Queen dynamics of evolution since the virus has very much evolved to re-adapt its codon usage to the environmental cellular changing conditions in order to recover the original fitness