2 research outputs found

    Graph algorithms for bioinformatics

    Get PDF
    Biological data are inherently interconnected: protein sequences are connected to their annotations, the annotations are structured into ontologies, and so on. While protein-protein interactions are already represented by graphs, in this work I am presenting how a graph structure can be used to enrich the annotation of protein sequences thanks to algorithms that analyze the graph topology. We also describe a novel solution to restrict the data generation needed for building such a graph, thanks to constraints on the data and dynamic programming. The proposed algorithm ideally improves the generation time by a factor of 5. The graph representation is then exploited to build a comprehensive database, thanks to the rising technology of graph databases. While graph databases are widely used for other kind of data, from Twitter tweets to recommendation systems, their application to bioinformatics is new. A graph database is proposed, with a structure that can be easily expanded and queried

    Parameterized and Safe & Complete Graph Algorithms for Bioinformatics

    No full text
    Given their versatility, nowadays graphs are a popular choice in Bioinformatics to model data, as in the case of pan-genomics, and to model problems, as in the case of multi-assembly. On the one hand, the increasing amount of data forming the pan-genome, constrains the running time of solutions to problems solved on them. On the other hand, there is a lack of theoretical tools intended to improve the quality of multi-assembly solutions when compared to their successful use in the classical genome-assembly problem. In this thesis, we develop faster and more sophisticated solutions to graph problems used in Bioinformatics. We obtain our results by using the lens of parameterized algorithms and safe & complete algorithms. In the first two papers, we propose the first parameterized linear time solutions for the problems of maximum antichain and minimum path cover. The algorithms use the width as the parameter, which has been observed to be small in pan-genomics. As such, for constant values of this parameter the running time of our solutions is optimal. In the last two papers, we use the safe & complete framework on problems whose solution corresponds to a set of paths. Specifically, we present efficient safe & complete algorithms for the problems of path cover and flow decomposition, and provide proof-of-concept implementations showing the quality improvement obtained by our approach.Koska verkot ovat monikäyttöisiä, ne ovat bioinformatiikassa suosittu tapa mallintaa dataa (esimerkiksi pangenomiikassa) ja laskentaongelmia (esimerkiksi usean sekvenssin kokoamisessa). Toisaalta datan määrän kasvu rajoittaa pangenomiikan laskentaongelmien ratkaisumenetelmien ajoaikaa. Toisaalta taas teoreettisia työkaluja, jotka parantavat monen sekvenssin kokoamisen laatua, on vähän verrattuna niiden menestyksekkääseen soveltamiseen klassisessa genomin kokoamisongelmassa. Tässä vaitöskirjassa kehitämme nopeampia ja hienostuneempia ratkaisuja verkko-ongelmiin bioinformatiikassa. Saavutamme tuloksemme parametrisoitujen algoritmien sekä turvallisten ja täydellisten algoritmien avulla. Kahdessa ensimmäisessä artikkelissa esitämme ensimmäisen parametrisoidun lineaariaikaisen ratkaisun maksimiantiketju- ja minimipolkupeiteongelmiin. Algoritmit käyttävät parametrina verkon leveyttä, joka havaintojen mukaan on pangenomiikassa pieni. Näin ollen, kun tämä parametri katsotaan vakioksi, ratkaisujemme ajoaika on optimaalinen. Kahdessa viimeisessä artikkelissa käytämme turvallisten ja täydellisten algoritmien kehystä ratkaistaksemme laskentaongelmia, joiden ratkaisu on joukko polkuja. Erityisesti esitämme tehokkaan turvallisen ja täydellisen algoritmin polkupeiteongelmaan ja virtauksen hajottamiseen. Tämän algoritmin toteutuksemme näyttää, että algoritmimme parantaa ratkaisun laatua
    corecore