Location of Repository

Tandem repeat copy-number variation in protein-coding regions of human genes

By Colm T. O'Dushlaine, Richard Edwards, Stephen D. Park and Denis C. Shields


BACKGROUND: Tandem repeat variation in protein-coding regions will alter protein length and may introduce frameshifts. Tandem repeat variants are associated with variation in pathogenicity in bacteria and with human disease. We characterized tandem repeat polymorphism in human proteins, using the UniGene database, and tested whether these were associated with host defense roles. <br/><br/>RESULTS: Protein-coding tandem repeat copy-number polymorphisms were detected in 249 tandem repeats found in 218 UniGene clusters; observed length differences ranged from 2 to 144 nucleotides, with unit copy lengths ranging from 2 to 57. This corresponded to 1.59% (218/13,749) of proteins investigated carrying detectable polymorphisms in the copy-number of protein-coding tandem repeats. We found no evidence that tandem repeat copy-number polymorphism was significantly elevated in defense-response proteins (p = 0.882). An association with the Gene Ontology term 'protein-binding' remained significant after covariate adjustment and correction for multiple testing. Combining this analysis with previous experimental evaluations of tandem repeat polymorphism, we estimate the approximate mean frequency of tandem repeat polymorphisms in human proteins to be 6%. Because 13.9% of the polymorphisms were not a multiple of three nucleotides, up to 1% of proteins may contain frameshifting tandem repeat polymorphisms. <br/><br/>CONCLUSION: Around 1 in 20 human proteins are likely to contain tandem repeat copy-number polymorphisms within coding regions. Such polymorphisms are not more frequent among defense-response proteins; their prevalence among protein-binding proteins may reflect lower selective constraints on their structural modification. The impact of frameshifting and longer copy-number variants on protein function and disease merits further investigation

Year: 2005
OAI identifier: oai:eprints.soton.ac.uk:151149
Provided by: e-Prints Soton

Suggested articles



  1. (2004). 3rd, Garner HR: Molecular origins of rapid and continuous morphological evolution. doi
  2. (1996). A gene map of the human genome. Science
  3. (1973). A model of mutation appropriate to estimate the number of electrophoretically detectable alleles in a finite population. Genet Res doi
  4. (1993). A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington's disease chromosomes. The Huntington's Disease Collaborative Research Group. Cell doi
  5. (2005). A: Structure-related statistical singularities along protein sequences: a correlation study. doi
  6. (2003). Bindereif A: HnRNP L stimulates splicing of the eNOS gene by binding to variable-length CA repeats. Nat Struct Biol doi
  7. (2001). Campagne F: TissueInfo: high-throughput identification of tissue expression profiles and specificity. Nucleic Acids Res doi
  8. (1984). CM: Enhanced gene expression by the poly(dT-dG).poly(dC-dA) sequence. Mol Cell Biol
  9. (1998). Computerized polymorphic marker identification: experimental validation and a predicted human polymorphism catalog. Proc Natl Acad Sci USA doi
  10. (2000). DC: A novel variant of the platelet glycoprotein Ibalpha macroglycopeptide region lacks any copies of the 'perfect' 13 amino acid repeat. Thromb Haemost
  11. (2002). DC: Platelet glycoprotein Ib alpha receptor polymorphisms and recurrent ischaemic events in acute coronary syndrome patients. J Thromb Thrombolysis
  12. (1996). Deka R: Dynamics of repeat polymorphisms under a forward-backward mutation model: within- and between-population variability at microsatellite loci. Genetics
  13. (2002). Detecting cryptically simple protein sequences using the SIMPLE algorithm. Bioinformatics doi
  14. (2000). Distributions of dimeric tandem repeats in noncoding and coding DNA sequences. doi
  15. (2003). DN: Human Gene Mutation Database (HGMD): doi
  16. Ensembl 2005. Nucleic Acids Res 2005, 33(Database issue):D447-D453. doi
  17. (1991). et al.: Identification of a gene (FMR-1) containing a CGG repeat coincident with a breakpoint cluster region exhibiting length variation in fragile X syndrome. Cell doi
  18. (1991). Fischbeck KH: Androgen receptor gene mutations in X-linked spinal and bulbar muscular atrophy. Nature doi
  19. (1995). Frequency and polymorphism of simple sequence repeats in a contiguous 685-kb DNA sequence containing the human T-cell receptor beta-chain gene complex. Genomics doi
  20. (2003). G: Predicting human minisatellite polymorphism. Genome Res doi
  21. (2000). Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet doi
  22. (1996). Genetic Data Analysis II: Methods for Discrete Population Genetic Data 2nd edition. doi
  23. (2003). Genome-wide analysis of microsatellite repeats in humans: their abundance and density in specific genomic regions. Genome Biol
  24. (2001). GJ: Novel frameshift mutations near short simple repeats. doi
  25. (1993). GR: Fragile X syndrome unstable element, p(CCG)n, and other simple tandem repeat sequences are binding sites for specific nuclear proteins. Hum Mol Genet doi
  26. (2000). HR: Repeat polymorphisms within gene regions: phenotypic and evolutionary implications. doi
  27. (2002). Human diallelic insertion/deletion polymorphisms. doi
  28. (2000). Human Upf proteins target an mRNA for nonsense-mediated decay when bound downstream of a termination codon. Cell doi
  29. (1990). Informativeness of human (dC-dA)n.(dG-dT)n polymorphisms. Genomics doi
  30. (2001). JA: Human polymorphism of P-selectin glycoprotein ligand 1 attributable to variable numbers of tandem decameric repeats in the mucinlike region. Blood doi
  31. (2005). Jazin E: Genome-wide prediction of human VNTRs. Genomics doi
  32. (1991). JB: Identification of novel single-stranded d(TC)n binding proteins in several mammalian species. Nucleic Acids Res doi
  33. (1999). JE: Double-strand break repair can lead to high frequencies of deletions within short CAG/CTG trinucleotide repeats. Mol Gen Genet doi
  34. (1996). KK: The world-wide distribution of allele frequencies at the human dopamine D4 receptor locus. Hum Genet doi
  35. (1999). Modulation of epidermal growth factor receptor gene transcription by a polymorphic dinucleotide repeat in intron 1. doi
  36. (1997). Molecular features of the CAG repeats of spinocerebellar ataxia 6 (SCA6). Hum Mol Genet doi
  37. (1993). Molecular mimicry and the generation of host defense protein diversity. Cell doi
  38. (2004). Nevo E: Microsatellites within genes: structure, function, and evolution. Mol Biol Evol
  39. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 2005, 33(Database issue):D514-D517. doi
  40. (2004). Paucimorphic alleles versus polymorphic alleles and rare mutations in disease causation: theory, observation and detection. Curr Genomics doi
  41. (2003). Polymorphism in the collagen-like region of the Bacillus anthracis BclA protein leads to variation in exosporium filament length. doi
  42. (1998). PW: GDB: the Human Genome Database. Nucleic Acids Res doi
  43. (1999). RD: Genetic instabilities in (CTG.CAG) repeats occur by recombination. doi
  44. (1995). RI: Simple tandem DNA repeats and human genetic disease. P r o c N a t l A c a d S c i U S A doi
  45. (1999). S: A neurological disease caused by an expanded CAG trinucleotide repeat in the TATAbinding protein gene: a new polyglutamine disease? Hum Mol Genet
  46. (2001). Santibanez-Koref MF: A role for selection in regulating the evolutionary emergence of diseasecausing and other coding CAG repeats in humans and mice. Mol Biol Evol doi
  47. (2003). Saunders NJ: Diversity in coding tandem repeats in related Neisseria spp.
  48. (1993). SC: (CT)n (GA)n repeats and heat shock elements have distinct roles in chromatin structure and transcriptional activation of the Drosophila hsp26 gene. Mol Cell Biol
  49. (2003). Schunkert H: Association of polymorphisms of the apolipoprotein(a) gene with lipoprotein(a) levels and myocardial infarction. Circulation
  50. (2001). Schwinger E: Different types of repeat expansion in the TATA-binding protein gene are associated with a new form of inherited ataxia. doi
  51. (2005). Selection against frameshift mutaHp Ei
  52. (2005). Simple sequence repeats in proteins and their significance for network evolution. Gene doi
  53. (1988). Spontaneous mutation rates to new length alleles at tandem-repetitive hypervariable loci in human DNA. doi
  54. (1997). ST: The effect of FMR1 CGG repeat interruptions on mutation frequency as measured by sperm typing. doi
  55. (1994). Statistical Methods in Medical Research 3rd edition. doi
  56. (1968). Sung JH: Progressive proximal spinal and bulbar muscular atrophy of late onset. A sex-linked recessive trait. Neurology doi
  57. (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res doi
  58. (1992). Tautz D: Slippage synthesis of simple sequence DNA. Nucleic Acids Res doi
  59. (2003). The variable number of tandem repeat polymorphism in the Pselectin glycoprotein ligand-1 gene is not associated with coronary heart disease. doi
  60. (1996). Trinucleotide repeats and long homopeptides in genes and proteins associated with nervous system disease and development. doi
  61. (1994). Unstable expansion of CAG repeat in hereditary dentatorubral-pallidoluysian atrophy (DRPLA). Nat Genet doi
  62. (1985). Vande Woude GF: The human met oncogene is related to the tyrosine kinase oncogenes. Nature doi
  63. (1998). Verbrugh H: Shortsequence DNA repeats in prokaryotic genomes. Microbiol Mol Biol Rev
  64. (2004). Vergnaud G: Identification of polymorphic tandem repeats by direct comparison of genome sequence from different bacterial strains: a web-based resource.
  65. (1996). Whitehead AS: Evolution of hemopoietic ligands and their receptors. Influence of positive selection on correlated replacements throughout ligand and receptor proteins.
  66. (2003). Widespread purifying selection at polymorphic sites in human protein-coding loci. Proc Natl Acad Sci USA doi
  67. (1991). YS: MUC-2 human small intestinal mucin gene structure. Repeated arrays and polymorphism. doi

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.