6 research outputs found
Information Theory and the Length Distribution of all Discrete Systems
We begin with the extraordinary observation that the length distribution of
80 million proteins in UniProt, the Universal Protein Resource, measured in
amino acids, is qualitatively identical to the length distribution of large
collections of computer functions measured in programming language tokens, at
all scales. That two such disparate discrete systems share important structural
properties suggests that yet other apparently unrelated discrete systems might
share the same properties, and certainly invites an explanation.
We demonstrate that this is inevitable for all discrete systems of components
built from tokens or symbols. Departing from existing work by embedding the
Conservation of Hartley-Shannon information (CoHSI) in a classical statistical
mechanics framework, we identify two kinds of discrete system, heterogeneous
and homogeneous. Heterogeneous systems contain components built from a unique
alphabet of tokens and yield an implicit CoHSI distribution with a sharp
unimodal peak asymptoting to a power-law. Homogeneous systems contain
components each built from just one kind of token unique to that component and
yield a CoHSI distribution corresponding to Zipf's law.
This theory is applied to heterogeneous systems, (proteome, computer
software, music); homogeneous systems (language texts, abundance of the
elements); and to systems in which both heterogeneous and homogeneous behaviour
co-exist (word frequencies and word length frequencies in language texts). In
each case, the predictions of the theory are tested and supported to high
levels of statistical significance. We also show that in the same heterogeneous
system, different but consistent alphabets must be related by a power-law. We
demonstrate this on a large body of music by excluding and including note
duration in the definition of the unique alphabet of notes.Comment: 70 pages, 53 figures, inc. 30 pages of Appendice
CoHSI I; Detailed properties of the Canonical Distribution for Discrete Systems such as the Proteome
The CoHSI (Conservation of Hartley-Shannon Information) distribution is at
the heart of a wide-class of discrete systems, defining the length distribution
of their components amongst other global properties. Discrete systems such as
the known proteome where components are proteins, computer software, where
components are functions and texts where components are books, are all known to
fit this distribution accurately. In this short paper, we explore its solution
and its resulting properties and lay the foundation for a series of papers
which will demonstrate amongst other things, why the average length of
components is so highly conserved and why long components occur so frequently
in these systems. These properties are not amenable to local arguments such as
natural selection in the case of the proteome or human volition in the case of
computer software, and indeed turn out to be inevitable global properties of
discrete systems devolving directly from CoHSI and shared by all. We will
illustrate this using examples from the Uniprot protein database as a prelude
to subsequent studies.Comment: 13 pages, 11 figure
CoHSI IV: Unifying Horizontal and Vertical Gene Transfer - is Mechanism Irrelevant ?
In previous papers we have described with strong experimental support, the
organising role that CoHSI (Conservation of Hartley-Shannon Information) plays
in determining important global properties of all known proteins, from defining
the length distribution, to the natural emergence of very long proteins and
their relationship to evolutionary time. Here we consider the insight that
CoHSI might bring to a different problem, the distribution of identical
proteins across species. Horizontal and Vertical Gene Transfer (HGT/VGT) both
lead to the replication of protein sequences across species through a diversity
of mechanisms some of which remain unknown. In contrast, CoHSI predicts from
fundamental theory that such systems will demonstrate power law behavior
independently of any mechanisms, and using the Uniprot database we show that
the global pattern of protein re-use is emphatically linear on a log-log plot
(adj. over 4 decades); i.e. it is
extremely close to the predicted power law. Specifically we show that over 6.9
million proteins in TrEMBL 18-02 are re-used, i.e. their sequence appears
identically in between 2 and 9,812 species, with re-used proteins varying in
length from 7 to as long as 14,596 amino acids. Using (DL+V) to denote the
three domains of life plus viruses, 21,676 proteins are shared between two
(DL+V); 22 between three (DL+V) and 5 are shared in all four (DL+V). Although
the majority of protein re-use occurs between bacterial species those proteins
most frequently re-used occur disproportionately in viruses, which play a
fundamental role in this distribution.
These results suggest that diverse mechanisms of gene transfer (including
traditional inheritance) are irrelevant in determining the global distribution
of protein re-use.Comment: 16 pages, 8 figures, 8 tables, 37 reference
CoHSI V: Identical multiple scale-independent systems within genomes and computer software
A mechanism-free and symbol-agnostic conservation principle, the Conservation
of Hartley-Shannon Information (CoHSI) is predicted to constrain the structure
of discrete systems regardless of their origin or function. Despite their
distinct provenance, genomes and computer software share a simple structural
property; they are linear symbol-based discrete systems, and thus they present
an opportunity to test in a comparative context the predictions of CoHSI. Here,
without any consideration of, or relevance to, their role in specifying
function, we identify that 10 representative genomes (from microbes to human)
and a large collection of software contain identically structured nested
subsystems. In the case of base sequences in genomes, CoHSI predicts that if we
split the genome into n-tuples (a 2-tuple is a pair of consecutive bases; a
3-tuple is a trio and so on), without regard for whether or not a region is
coding, then each collection of n-tuples will constitute a homogeneous discrete
system and will obey a power-law in frequency of occurrence of the n-tuples. We
consider 1-, 2-, 3-, 4-, 5-, 6-, 7- and 8-tuples of ten species and demonstrate
that the predicted power-law behavior is emphatically present, and furthermore
as predicted, is insensitive to the start window for the tuple extraction i.e.
the reading frame is irrelevant.
We go on to provide a proof of Chargaff's second parity rule and on the basis
of this proof, predict higher order tuple parity rules which we then identify
in the genome data.
CoHSI predicts precisely the same behavior in computer software. This
prediction was tested and confirmed using 2-, 3- and 4-tuples of the
hexadecimal representation of machine code in multiple computer programs,
underlining the fundamental role played by CoHSI in defining the landscape in
which discrete symbol-based systems must operate.Comment: 22 pages, 13 figures, 35 reference
CoHSI III: Long proteins and implications for protein evolution
The length distribution of proteins measured in amino acids follows the CoHSI
(Conservation of Hartley-Shannon Information) probability distribution. In
previous papers we have verified various predictions of this using the Uniprot
database but here we explore a novel predicted relationship between the longest
proteins and evolutionary time. We demonstrate from both theory and experiment
that the longest protein and the total number of proteins are intimately
related by Information Theory and we give a simple formula for this. We stress
that no evolutionary explanation is necessary; it is an intrinsic property of a
CoHSI system. While the CoHSI distribution favors the appearance of proteins
with fewer than 750 amino acids (characteristic of most functional proteins or
their constituent domains) its intrinsic asymptotic power-law also favors the
appearance of unusually long proteins; we predict that there are as yet
undiscovered proteins longer than 45,000 amino acids. In so doing, we draw an
analogy between the process of protein folding driven by favorable pathways (or
funnels) through the energy landscape of protein conformations, and the
preferential information pathways through which CoHSI exerts its constraints in
discrete systems.
Finally, we show that CoHSI predicts the recent appearance in evolutionary
time of the longest proteins, specifically in eukaryotes because of their
richer unique alphabet of amino acids, and by merging with independent
phylogenetic data, we confirm a predicted consistent relationship between the
longest proteins and documented and potential undocumented mass extinctions.Comment: 20 pages, 12 figures, 3 tables, 37 reference
CoHSI II; The average length of proteins, evolutionary pressure and eukaryotic fine structure
The CoHSI (Conservation of Hartley-Shannon Information) distribution is at
the heart of a wide-class of discrete systems, defining (amongst other
properties) the length distribution of their components. Discrete systems such
as the known proteome, computer software and texts are all known to fit this
distribution accurately. In a previous paper, we explored the properties of
this distribution in detail. Here we will use these properties to show why the
average length of components in general and proteins in particular is highly
conserved, howsoever measured, demonstrating this on various aggregations of
proteins taken from the UniProt database. We will go on to define departures
from this equilibrium state, identifying fine structure in the average length
of eukaryotic proteins that result from evolutionary processes.Comment: 14 pages, 14 figure