Overlapping genes exist in all domains of life and are much more abundant
than expected at their first discovery in the late 1970s. Assuming that the
reference gene is read in frame +0, an overlapping gene can be encoded in two
reading frames in the sense strand, denoted by +1 and +2, and in three reading
frames in the opposite strand, denoted by -0, -1 and -2. This motivated
numerous researchers to study the constraints induced by the genetic code on
the various overlapping frames, mostly based on information theory. Our focus
in this paper is on the constraints induced on two overlapping genes in terms
of amino acids, as well as polypeptides. We show that simple linear constraints
bind the amino acid composition of two proteins encoded by overlapping genes.
Novel constraints are revealed when polypeptides are considered, and not just
single amino acids. For example, in double-coding sequences with an overlapping
reading frame -2, each Tyrosine (denoted as Tyr or Y) in the overlapping frame
overlaps a Tyrosine in the reference frame +0 (and reciprocally), whereas
specific words (e.g. YY) never occur. We thus distinguish between null
constraints (YY = 0 in frame -2) and non-null constraints (Y in frame +0  Y
in frame -2). Our equivalence-based constraints are symmetrical and thus enable
the characterization of the joint composition of overlapping proteins. We
describe several formal frameworks and a graph algorithm to characterize and
compute these constraints. These results yield support for understanding the
mechanisms and evolution of overlapping genes, and for developing novel
overlapping gene detection methods

Gascuel, Olivier

Lebre, Sophie

English

arXiv

International audienceOverlapping genes exist in all domains of life and are much more abundant than expected upon their first discovery in the late 1970s. Assuming that the reference gene is read in frame +0, an overlapping gene can be encoded in two reading frames in the sense strand, denoted by +1 and +2, and in three reading frames in the opposite strand, denoted by-0,-1, and-2. This motivated numerous researchers to study the constraints induced by the genetic code on the various overlapping frames, mostly based on information theory. Our focus in this paper is on the constraints induced on two overlapping genes in terms of amino acids, as well as polypeptides. We show that simple linear constraints bind the amino-acid composition of two proteins encoded by overlapping genes. Novel constraints are revealed when polypeptides are considered, and not just single amino acids. For example, in double-coding sequences with an overlapping reading frame-2, each Tyrosine (denoted as Tyr or Y) in the overlapping frame overlaps a Tyrosine in the reference frame +0 (and reciprocally), whereas specific words (e.g. YY) never occur. We thus distinguish between null constraints (YY = 0 in frame-2) and non-null constraints (Y in frame +0 ó Y in frame-2). Our equivalence-based constraints are symmetrical and thus enable the characterization of the joint composition of overlapping proteins. We describe several formal frameworks and a graph algorithm to characterize and compute these constraints. As expected, the degrees of freedom left by these constraints vary drastically among the different overlapping frames. Interestingly, the biological meaning of constraints induced on two overlapping proteins (hydropathy, forbidden di-peptides, expected overlap length …) is also specific to the reading frame. We study the combinatorics of these constraints for overlapping polypeptides of length í µí± , pointing out that, (i) except for frame-2, non-null constraints are deduced from the amino-acid (length = 1) constraints and (ii) null constraints are deduced from the di-peptide (length = 2) constraints. These results yield support for understanding the mechanisms and evolution of overlapping genes, and for developing novel overlapping gene detection methods

Lèbre, Sophie

Archive Ouverte en Sciences de l'Information et de la Communication

The combinatorics of overlapping genes

International audienceOverlapping genes exist in all domains of life and are much more abundant than expected upon their first discovery in the late 1970s. Assuming that the reference gene is read in frame +0, an overlapping gene can be encoded in two reading frames in the sense strand, denoted by +1 and +2, and in three reading frames in the opposite strand, denoted by-0,-1, and-2. This motivated numerous researchers to study the constraints induced by the genetic code on the various overlapping frames, mostly based on information theory. Our focus in this paper is on the constraints induced on two overlapping genes in terms of amino acids, as well as polypeptides. We show that simple linear constraints bind the amino-acid composition of two proteins encoded by overlapping genes. Novel constraints are revealed when polypeptides are considered, and not just single amino acids. For example, in double-coding sequences with an overlapping reading frame-2, each Tyrosine (denoted as Tyr or Y) in the overlapping frame overlaps a Tyrosine in the reference frame +0 (and reciprocally), whereas specific words (e.g. YY) never occur. We thus distinguish between null constraints (YY = 0 in frame-2) and non-null constraints (Y in frame +0 ó Y in frame-2). Our equivalence-based constraints are symmetrical and thus enable the characterization of the joint composition of overlapping proteins. We describe several formal frameworks and a graph algorithm to characterize and compute these constraints. As expected, the degrees of freedom left by these constraints vary drastically among the different overlapping frames. Interestingly, the biological meaning of constraints induced on two overlapping proteins (hydropathy, forbidden di-peptides, expected overlap length …) is also specific to the reading frame. We study the combinatorics of these constraints for overlapping polypeptides of length í µí± , pointing out that, (i) except for frame-2, non-null constraints are deduced from the amino-acid (length = 1) constraints and (ii) null constraints are deduced from the di-peptide (length = 2) constraints. These results yield support for understanding the mechanisms and evolution of overlapping genes, and for developing novel overlapping gene detection methods

The combinatorics of overlapping genes

Abstract

Similar works

Full text

Available Versions

Archive Ouverte en Sciences de l'Information et de la Communication

HAL-Pasteur

INRIA a CCSD electronic archive server

Crossref

HAL Descartes