11 research outputs found

    Toric ideals of homogeneous phylogenetic models

    Full text link
    We consider the phylogenetic tree model in which every node of the tree is observed and binary and the transitions are given by the same matrix on each edge of the tree. We are able to compute the Grobner basis and Markov basis of the toric ideal of invariants for trees with up to 11 nodes. These are perhaps the first non-trivial Grobner bases calculations in 2^11 indeterminates. We conjecture that there is a quadratic Grobner basis for binary trees. Finally, we give a explicit description of the polytope associated to this toric ideal for an infinite family of binary trees and conjecture that there is a universal bound on the number of vertices of this polytope for binary trees.Comment: 6 pages, 17 figure

    Phylogenetic Algebraic Geometry

    Full text link
    Phylogenetic algebraic geometry is concerned with certain complex projective algebraic varieties derived from finite trees. Real positive points on these varieties represent probabilistic models of evolution. For small trees, we recover classical geometric objects, such as toric and determinantal varieties and their secant varieties, but larger trees lead to new and largely unexplored territory. This paper gives a self-contained introduction to this subject and offers numerous open problems for algebraic geometers.Comment: 15 pages, 7 figure

    Classifying and counting linear phylogenetic invariants for the Jukes Cantor model

    Get PDF
    Linear invariants are useful tools for testing phylogenetic hypotheses from aligned DNA/RNA sequences, particularly when the sites evolve at different rates. Here we give a simple, graph theoretic classification, for each phylogenetic tree T, of its associated vector space I(T) of linear invariants under the Jukes-Cantor one parameter model of nucleotide substitution. We also provide an easilydescribed basis for I(T), and show that if T is a binary (fully resolved) phylogenetic tree with n sequences at its leaves then : dim[I(T)] = 4โฟ - F2n-2 where F n is the n-th Fibonacci number. Our method applies a recently-developed Hadamard-matrix based technique to describe elements of I(T) in terms of edge-disjoint packings of subtrees in T, and thereby complements earlier more algebraic treatments

    Reconstructing phylogenies from nucleotide pattern probabilities : a survey and some new result

    Get PDF
    The variations between homologous nucleotide sequences representative of various species are, in part, a consequence of the evolutionary history of these species. Determining the evolutionary tree from patterns in the sequences depends on inverting the stochastic processes governing the substitutions from their ancestral sequence. We present a nl.J.mber of recent (and some new) results which allow for a tree to be reconstructed from the expected frequencies of patterns in its leaf colorations generated under various Markov models. We summarise recent work using Hadamard conjugation, which provides an analytic relation between the parameters of Kimura's 3ST model on a phylogenetic tree and the sequence patterns produced. We give two applications of the theory by describing new properties of the popular "maximum parsimony" method for tree reconstruction

    ์ง‘๋‹จ ์œ ์ „์ฒดํ•™์—์„œ ํšŒ๊ท€ ๋ชจํ˜•์„ ์ด์šฉํ•œ ์œ ์ „ ํ‘œ์ง€์ž ํšจ๊ณผ ์ถ”์ •

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ํ˜‘๋™๊ณผ์ • ์ƒ๋ฌผ์ •๋ณดํ•™์ „๊ณต, 2017. 2. ๊น€ํฌ๋ฐœ.After various DNA (deoxyribonucleic acid) markers at the genomic DNA level had been discovered, scientists paid attention to DNA sequencing and genotyping. Genotyping is to uncover the genetic variants as one of the molecular markers. Single nucleotide polymorphisms (SNPs) are undeniably one of the most important markers. Especially, population-based SNP can possess the characteristics of an individual that may be different from others. To reveal the causes of an individuals characteristics, one of the possible ways is to employ established statistical models. Regression analysis has frequently been used in the bioinformatics area. I analyzed the data using the regression models such as linear, nonlinear regression and mixed models. This doctoral dissertation comprises five chapters. In chapter 1, overviews of the required population genetics theories, effective population size estimation, best linear unbiased prediction (BLUP) and genome-wide association study (GWAS) is introduced. To estimate the effective population size, two methods have been employed: classical Sveds equation and Kimura 2-Parameter (K2P) model and Watterson theta estimator. Sveds equation is based on nonlinear regression, computationally and K2P uses the number of SNPs. The BLUP is used to estimate the random effects in linear mixed models. Moreover, GWAS is used to find causal genetic variants associated with a trait. As one of the methods to predict random marker effects, I propose the Single Nucleotide Polymorphism โ€“ Genomic Best Linear Unbiased Prediction (SNP-GBLUP). This new BLUP is based on Genomic Relationship Matrix (GRM) in theory. In chapter 2, effective population size of Korean Thoroughbred horses (TB horses) has been estimated. TB breeds have been beloved because of those breeds great racing capability. I tried to examine the genetic diversity and stability of Korean TB population using by estimating effective population size. I used two methods as mentioned earlier: Sveds equation as basic approach, K2P and Watterson theta estimator as the second approach. I estimated TB horses effective population size as 79 (Sveds equation) and 77 (K2P). This is rather weak when compared to other countries TB effective population size. For instance, Corbin et al. estimated Irish TB effective population size as 100. The author used Sveds equation which is based on linkage disequilibrium (LD). In Chapter 3, I introduced SNP-SNP Relationship Matrix (SSRM) which deals with the pairwise relationships between SNPs. This relationship matrix can be considered more advanced and differentiated notion than the Genomic Relationship Matrix (GRM) which is important in Genomic-Best Linear Unbiased Prediction (G-BLUP). GRM extracts individual relationships that are crucial concepts of mixed model or BLUP. In the BLUP area, to deal with the random effects effectively, GRM is one of the requisites. SSRM is a novel concept, although it is based on multivariate normal distribution (MVN) and GRM. The difference of SSRM from GRM is grounded on the different definition of the relationships since it is defined at the individual or SNP level. The SSRM is certainly more difficult and not-easily-validated one. Despite this, the bioinformatic information contained in SSRM is sufficient because it can contain extensive information. I think that SSRM is the hidden information and GRM may be disguised or processed one by SNP information. By introducing SSRM, I analyzed the human height data using mixed model. Korean Association Resource Phase 3 (KARE3), Ansan-Ansung cohorts data contains each individuals traits and SNP information. The main objective was to check SSRMs usefulness in mixed model and compare SSRM-based SNP-GBLUP with SNP-BLUP (Single Nucleotide Polymorphism-Best Linear Unbiased Prediction) which is based on IID (independent & identically distributed) between SNP relationships. First, I introduced the theoretical derivation of SSRM based on probability density function (PDF) of the model and linear algebra. Second, I compared SNP-GBLUP with SNP-BLUP and G-BLUP by using human height and SNP data. The genetic values between SNP-GBLUP and SNP-BLUP were very disparate along with the SNP effects. In chapter 4, I tried to solve Missing heritability problem in BLUP. Missing heritability problem is a problem that the associations cannot fully explain heritability that are estimated from correlations between relatives. This is important in association like GWAS or BLUP. BLUP deals with global genetic variants and complex traits. The traits were Berkshire eight pork quality traits (fat, carcass weight, shear force, Minolta color L, A, protein content, water holding capacity, backfat thickness). These traits are very important economic traits in the pork meat production industry and therefor those breeding values (BV) must be predicted with better accuracy as breeding strategies. First, using the GWA study, the putative quantitative trait loci (QTL) for traits of interest were scanned at the SNP level. I chose the criteria of the QTL as unadjusted P-value (<0.01) arbitrarily. Then I analyzed the Berkshire traits with the SNPs using the BLUP. The heritability estimated from BLUP was close to the known heritability estimates. The results showed better results than the results from using total SNPs (original data) in terms of genomic estimated breeding values (GEBVs) and heritability estimates. In chapter 5, the selection coefficient in F1 generation (if borrowed from genetics) โ€“the next generation of the current generation) โ€“ was predicted using Fishers fundamental theorem of natural selection and BLUP. Fishers fundamental theorem of natural selection states: The rate of increase in fitness of any organism at any time is equal to its genetic variance in fitness at that time. The selection is one of the major driving forces to be able to change allele frequency. Thus not only to reveal the history of selection but also to predict future selection trends is very imperative. The statistical model was additive linear model like BLUP. I calculated the additive genetic variance of each SNP using SNP effects (from SNP-GBLUP) and using the Fishers theorem, calculated the selection coefficient. Then the gene ontology of significant SNP-containing genes was surveyed. The phenotypes were three Holstein milk-related traits (milk yield, fat and protein contents). These traits are very crucial to dairy farmers. The heritability estimates from the BLUP were not bad (milk yield, fat and protein content 0.39, 0.45 and 0.40, respectively). The gene catalogue was retrieved from Ensembl server (www.ensembl.org). The theorem links the genetic variance to selection coefficient. The features of selection coefficient were the next generation, expected, relative. The expected implies that the selection coefficient of this kinds of approach is just predicted one and relative means that the predicted values was recalibrated using the maximum values because the order of the values are dependent on the units of phenotypic values. The gene ontology contained in highly selective SNPs predicted from milk protein traits was dendritic spine morphogenesis, nitric oxide biosynthetic process, etc. Specially, dendritic spine morphogenesis was the most significant gene ontology. The dendritic spine is the major sites of excitatory synaptic transmission in the mammalian brain and is very imperative in synaptic development and plasticity. Thus the related genes of the dendritic spine morphogenesis are expected to be important target of future artificial selection trends of Holstein cattle in Korea. The gene ontology of milk yield and fat did not have any significant ontologies.LITERATURE REVIEW 1 1.1 Overview of population genetics 2 1.2 Effective population size estimation 3 1.3 Best linear unbiased prediction (BLUP) 6 1.4 Genome-wide association study (GWA Study) 11 Chapter2 13 2.1 Abstract 14 2.2 Introduction 15 2.3 Materials and Method 17 2.4 Results 25 2.5 Discussion 34 Chapter3 39 3.1 Abstract 40 3.2 Introduction 42 3.3 Materials and Method 45 3.4 Results 50 3.5 Discussion 61 Chapter4 65 4.1 Abstract 66 4.2 Introduction 66 4.3 Materials and Method 68 4.4 Results 70 4.5 Discussion 79 Chapter5 82 5.1 Abstract 83 5.2 Introduction 84 5.3 Materials and Method 86 5.4 Results 90 5.5 Discussion 102 GENERAL DISCUSSION AND CONCLUSION 105 REFERENCES 108 ๊ตญ๋ฌธ์ดˆ๋ก 123Docto

    The characterization of the caleosin gene family in Triticeae and their role in G-protein signalling and Identification and characterization of rye genes silenced in allohexaploid triticale: A bioinformatic study

    Get PDF
    ABSTRACT The characterization of the caleosin gene family in Triticeae and their role in G-protein signalling and Identification and characterization of rye genes silenced in allohexaploid triticale: A bioinformatic study Hala Badr Abdel-Sadek Khalil, Ph.D. Concordia University, 2013 The caleosin genes encode proteins with a single conserved EF hand calcium-binding domain and comprise small gene families found in a wide range of plant species. In this study, Clo3, a member of caleosin family in hexaploid wheat (Triticum aestivum), has been shown to play an important role in signaling by both in vivo and in vitro analyses of its interaction with Gฮฑ, the alpha subunit of heterotrimetric GTP binding protein. This interaction increased the GTPase activity of Gฮฑ by approximately 25%. Eleven paralogous groups of caleosins, which comprise a total of thirty-four caleosin genes, have been assembled and identified using the T. aestivum GenBank EST database and ten gene family members were assembled from Secale cereale 454-cDNA sequences. The analysis of caleosin gene expression was assayed by RNA-Seq analysis of 454 sequence sets and members of the gene family were found to have diverse patterns of gene expression in the nine tissues that were sampled in rye and in triticale, a synthetic polyploid species derived from durum wheat (Triticum turgidum) and rye (Secale cereale). The impact of the polyploidization event on rye genes in the triticale background was investigated using both caleosin genes and whole transcriptome comparisons. The high-throughput cDNA sequence comparison between the diploid rye and the hexaploid triticale detected suppression of expression of approximately 2% of the rye genes surveyed in the triticale. The expression of 23503 rye cDNA contigs was analyzed in 454-cDNA libraries obtained from anther, root and stem from both triticale and rye as well as in five 454-cDNA data sets created from ovary, pollen, seed, seedling shoot and stigma from triticale. Among these, 112 rye cDNA contigs were found to be totally suppressed in all triticale tissues, although their expression was relatively high in rye tissues. Suppressed rye genes were found to have strikingly low similarity to their closest BASTN matches in a current draft of the wheat genome available through the International Wheat Genome Survey Consortium, IWGSC. The comparison of rye silenced genes to wheat database revealed that 89% of rye genes silenced in triticale do not have a best match in T. aestivum with sequence identity higher than 90%, whereas 59% of random rye contigs had a best hit of 90% or higher in T. aestivum. The comparisons to the draft genomes of Triticum. urartu, and Aegilops. tausshii, the A and D genome donors to T. aestivum, respectively, support the previous observation. PCR assays found that 6 out of 10 candidate suppressed genes were deleted from the triticale genome
    corecore