64 research outputs found

    If the Current Clique Algorithms are Optimal, so is Valiant's Parser

    Full text link
    The CFG recognition problem is: given a context-free grammar G\mathcal{G} and a string ww of length nn, decide if ww can be obtained from G\mathcal{G}. This is the most basic parsing question and is a core computer science problem. Valiant's parser from 1975 solves the problem in O(nω)O(n^{\omega}) time, where ω<2.373\omega<2.373 is the matrix multiplication exponent. Dozens of parsing algorithms have been proposed over the years, yet Valiant's upper bound remains unbeaten. The best combinatorial algorithms have mildly subcubic O(n3/log3n)O(n^3/\log^3{n}) complexity. Lee (JACM'01) provided evidence that fast matrix multiplication is needed for CFG parsing, and that very efficient and practical algorithms might be hard or even impossible to obtain. Lee showed that any algorithm for a more general parsing problem with running time O(Gn3ε)O(|\mathcal{G}|\cdot n^{3-\varepsilon}) can be converted into a surprising subcubic algorithm for Boolean Matrix Multiplication. Unfortunately, Lee's hardness result required that the grammar size be G=Ω(n6)|\mathcal{G}|=\Omega(n^6). Nothing was known for the more relevant case of constant size grammars. In this work, we prove that any improvement on Valiant's algorithm, even for constant size grammars, either in terms of runtime or by avoiding the inefficiencies of fast matrix multiplication, would imply a breakthrough algorithm for the kk-Clique problem: given a graph on nn nodes, decide if there are kk that form a clique. Besides classifying the complexity of a fundamental problem, our reduction has led us to similar lower bounds for more modern and well-studied cubic time problems for which faster algorithms are highly desirable in practice: RNA Folding, a central problem in computational biology, and Dyck Language Edit Distance, answering an open question of Saha (FOCS'14)

    The Cycas genome and the early evolution of seed plants

    Get PDF
    Cycads represent one of the most ancient lineages of living seed plants. Identifying genomic features uniquely shared by cycads and other extant seed plants, but not non-seed-producing plants, may shed light on the origin of key innovations, as well as the early diversification of seed plants. Here, we report the 10.5-Gb reference genome of Cycas panzhihuaensis, complemented by the transcriptomes of 339 cycad species. Nuclear and plastid phylogenomic analyses strongly suggest that cycads and Ginkgo form a clade sister to all other living gymnosperms, in contrast to mitochondrial data, which place cycads alone in this position. We found evidence for an ancient whole-genome duplication in the common ancestor of extant gymnosperms. The Cycas genome contains four homologues of the fitD gene family that were likely acquired via horizontal gene transfer from fungi, and these genes confer herbivore resistance in cycads. The male-specific region of the Y chromosome of C. panzhihuaensis contains a MADS-box transcription factor expressed exclusively in male cones that is similar to a system reported in Ginkgo, suggesting that a sex determination mechanism controlled by MADS-box genes may have originated in the common ancestor of cycads and Ginkgo. The C. panzhihuaensis genome provides an important new resource of broad utility for biologists

    Gene Family Histories: Theory and Algorithms

    Get PDF
    Detailed gene family histories and reconciliations with species trees are a prerequisite for studying associations between genetic and phenotypic innovations. Even though the true evolutionary scenarios are usually unknown, they impose certain constraints on the mathematical structure of data obtained from simple yes/no questions in pairwise comparisons of gene sequences. Recent advances in this field have led to the development of methods for reconstructing (aspects of) the scenarios on the basis of such relation data, which can most naturally be represented by graphs on the set of considered genes. We provide here novel characterizations of best match graphs (BMGs) which capture the notion of (reciprocal) best hits based on sequence similarities. BMGs provide the basis for the detection of orthologous genes (genes that diverged after a speciation event). There are two main sources of error in pipelines for orthology inference based on BMGs. Firstly, measurement errors in the estimation of best matches from sequence similarity in general lead to violations of the characteristic properties of BMGs. The second issue concerns the reconstruction of the orthology relation from a BMG. We show how to correct estimated BMG to mathematically valid ones and how much information about orthologs is contained in BMGs. We then discuss implicit methods for horizontal gene transfer (HGT) inference that focus on pairs of genes that have diverged only after the divergence of the two species in which the genes reside. This situation defines the edge set of an undirected graph, the later-divergence-time (LDT) graph. We explore the mathematical structure of LDT graphs and show how much information about all HGT events is contained in such LDT graphs

    The Cycas Genome and the Early Evolution of Seed Plants

    Get PDF

    Avian genomics: insight into bitter taste receptors

    Get PDF
    Dissertação de mestrado em BioinformáticaThe detection of bitter taste is of major importance for animal survival since it provides an earlier evaluation of which food resources are safer, avoiding the ingestion of toxic compounds and regulating the feeding behavior. The taste receptor protein type 2 (T2R) family of G protein-coupled receptors (GPCRs) is responsible for bitter taste perception and its study is relevant to better understand the evolution of the sense of taste. Additionally, birds are a group of animals which are considered good models to evolutionary studies due to their abundance, high diversity of species and global widespread across varied ecological conditions. Phylogenetic reconstructions and selection analysis present a great approach to understand the evolutionary history and diversification of avian T2Rs. Additionally, comparative methodologies can assess the selective pressures acting on these genes. This work aims to assess the evolutionary genomics of the animal taste receptor gene type 2 (Tas2r) gene family in 245 bird species, distributed across 14 orders and, through a set of bioinformatics and genomic tools, to clarify their genomic representation, selective pressures and phylogenetic relationships. The results herein obtained reveal an acceleration of Tas2rs in the order Passeriformes. In addition, it was previously reported that diet has an influence on the Tas2r repertoire. Therefore, we studied the effect of additional ecological traits such as habitat and migration. Our results indicate that Tas2r show conservation on water birds and a stronger evolutionary pressure on non-migratory birds.A deteção de sabor amargo é muito importante para a sobrevivência animal uma vez que permite avaliar que fontes de alimento são seguras consumir, prevenindo assim a ingestão de xenobióticos. Para além disso, estes receptores também regulam o comportamento alimentar dos animais. Os recetores de sabor tipo 2 (T2R), uma família de receptores acoplados às proteínas G (GPCRs), são responsáveis pela deteção de sabor amargo e o seu estudo é relevante para clarificar a evolução do sentido do paladar. Adicionalmente, as aves são um grupo de animais considerados como sendo bons modelos de evolução devido à sua abundância, grande diversidade de espécies e distribuição global em diferentes condições ecológicas. As reconstruções filogenéticas e análises de seleção, apresentam uma abordagem interessante para entender a história evolutiva e a diversificação de T2Rs em aves. Adicionalmente, metodologias comparativas podem avaliar as pressões seletivas que atuam nestes genes. Este estudo tem o objetivo de analisar a genómica evolutiva da família de genes dos receptores de sabor tipo 2 de animais (Tas2r) em 245 espécies de aves em 14 ordens. Através de um conjunto de ferramentas bioinformáticas e genómicas, pretende-se também esclarecer a sua representação genómica, pressões seletivas e relações filogenéticas. Os resultados obtidos revelam uma aceleração da pressão seletiva na ordem Passeriformes. Para além disso, foi anteriormente reportado que a dieta influencia o repertório de T2R. Assim, analisou-se o efeito de traços ecológicos adicionais como migração e habitat. Os nossos resultados indicam que Tas2r apresenta conservação em aves aquáticas e uma maior pressão evolutiva em aves não migratórias.This research was partially supported by the Strategic Funding UIDB/04423/2020 and UIDP/04423/2020 through national funds provided by the Fundação para a Ciência e a Tecnologia (FCT) and the European Regional Development Fund (ERDF) in the framework of the program PT2020, by the European Structural and Investment Funds (ESIF) through the Competitiveness and Internationalization Operational Program - COMPETE 2020 and by National Funds through the FCT under the project PTDC/AAG-GLO/6887/2014 (POCI-01-0124-FEDER-016845) and PTDC/CTA-AMB/31774/2017 (POCI-01-0145-FEDER/031774/2017)

    Identification, organisation and visualisation of complete proteomes in UniProt throughout all taxonomic ranks :|barchaea, bacteria, eukatyote and virus

    Get PDF
    Users of uniprot.org want to be able to query, retrieve and download proteome sets for an organism of their choice. They expect the data to be easily accessed, complete and up to date based on current available knowledge. UniProt release 2012_01 (25th Jan 2012) contains the proteomes of 2,923 organisms; 50% of which are bacteria, 38% viruses, 8% eukaryota and 4% archaea. Note that the term 'organism' is used in a broad sense to include subspecies, strains and isolates. Each completely sequenced organism is processed as an independent organism, hence the availability of 38 strain-specific proteomes Escherichia coli that are accessible for download. There is a project within UniProt dedicated to the mammoth task of maintaining the “Proteomes database”. This active resource is essential for UniProt to continually provide high quality proteome sets to the users. Accurate identification and incorporation of new, publically available, proteomes as well as the maintenance of existing proteomes permits sustained growth of the proteomes project. This is a huge, complicated and vital task accomplished by the activities of both curators and programmers. This thesis explains the data input and output of the proteomes database: the flow of genome project data from the nucleotide database into the proteomes database, then from each genome how a proteome is identified, augmented and made visible to uniprot.org users. Along this journey of discovery many issues arose, puzzles concerning data gathering, data integrity and also data visualisation. All were resolved and the outcome is a well-documented, actively maintained database that strives to provide optimal proteome information to its users
    corecore