4,845 research outputs found

    On the String Consensus Problem and the Manhattan Sequence Consensus Problem

    Full text link
    In the Manhattan Sequence Consensus problem (MSC problem) we are given kk integer sequences, each of length ll, and we are to find an integer sequence xx of length ll (called a consensus sequence), such that the maximum Manhattan distance of xx from each of the input sequences is minimized. For binary sequences Manhattan distance coincides with Hamming distance, hence in this case the string consensus problem (also called string center problem or closest string problem) is a special case of MSC. Our main result is a practically efficient O(l)O(l)-time algorithm solving MSC for k5k\le 5 sequences. Practicality of our algorithms has been verified experimentally. It improves upon the quadratic algorithm by Amir et al.\ (SPIRE 2012) for string consensus problem for k=5k=5 binary strings. Similarly as in Amir's algorithm we use a column-based framework. We replace the implied general integer linear programming by its easy special cases, due to combinatorial properties of the MSC for k5k\le 5. We also show that for a general parameter kk any instance can be reduced in linear time to a kernel of size k!k!, so the problem is fixed-parameter tractable. Nevertheless, for k4k\ge 4 this is still too large for any naive solution to be feasible in practice.Comment: accepted to SPIRE 201

    Nearest constrained circular words

    Get PDF
    In this paper, we study circular words arising in the development of equipment using shields in brachytherapy. This equipment has physical constraints that have to be taken into consideration. From an algorithmic point of view, the problem can be formulated as follows: Given a circular word, find a constrained circular word of the same length such that the Manhattan distance between these two words is minimal. We show that we can solve this problem in pseudo polynomial time (polynomial time in practice) using dynamic programming

    DeepGene : gene finding based on upstream sequence data

    Get PDF
    Genome annotation is a process of identifying functional elements along a genome. By correctly locating and finding the information stored within a sequence, knowledge about structural features and functional roles can be revealed. With the number of sequences doubling approximately every 18 months, there is a severe need for automatic annotation of genomes. Today there are many different annotation software tools available, however they produce far from perfect results. Here a new project, DeepGene, is presented. Using data from the RefSeq prokaryotic database we have started an effort to improve on the prokaryotic genome annotation process. This thesis presents the initial efforts of said improvement with a focus on discerning between coding and non-coding sequences using upstream sequence data from open reading frames. Using the 15 prokaryotic genomes available in the RefSeq database, upstream data was retrieved and processed into two datasets, and were then trained using several popular classification models. The performance of the models was compared with a standard annotation tool to create a general baseline for our model. The models created from the datasets show many similarities in terms of metrics. With the K-mer data having a mean precision at 0.22 and mean recall of 0.74, and the sequential data having a mean precision at 0.30 and mean recall at 0.77. Both the datasets performed worse than our standard annotation software with a mean recall and precision of, respectively, 0.83 and 0.82. As far as upstream sequences are concerned, the models managed to pull all the information available from both datasets. The initial results gave limited information in terms of classification and motif presence indicating that other attributes surrounding the genome should be looked at for a possible improvement on the annotation problem. An ideal step forward is to expand into a pipeline so that the complex false negative classifications may be explained.Genomannotering er en prosess som skal identifisere funksjonelle elementer langs et genom. Ved å finne informasjonen lagret i en sekvens kan man avsløre kunnskap rundt strukturelle og funksjonelle roller. Ettersom antall sekvenser dobler rundt hver 18. måned er det et sterkt behov for automatisk gjenkjenning av genomer. I dag er det mange tilgjengelige annoteringsverktøy, men de produserer langt fra perfekte resultater. Et nytt prosjekt ved navn DeepGene er her presentert. Ved hjelp av data fra RefSeq prokaryotiske database har vi startet et forsøk på å forbedre den prokaryotiske annoteringsprosessen. I denne oppgaven presenteres begynnelsen på forbedringen. Hovedfokuset var å skille mellom kodende og ikke-kodende sekvenser ved hjelp av sekvensdata oppstrøms for åpne leserammer. Ved å benytte seg av de 15 prokaryotiske genomene tilgjengelig i RefSeq databasen, ble oppstrømsdata hentet og prosessert til to datasett. Disse datasettene ble videre trent ved hjelp av populære klassifiseringsmodeller. Ytelsen til disse modellene ble sammenlignet med et standard annoteringsverktøy for å lage et generelt utgangspunkt til vår modell. Modellene trent av datasettet viser mange likheter når det kommer til ytelse. K-mer datasettet hadde en gjennomsnittlig presisjon på 0.22 og nøyaktighet på 0.74. Videre hadde det sekvensielle datasettet en gjennomsnittlig presisjon på 0.30 og en nøyaktighet på 0.77. Begge datasettene hadde dårligere resultater enn vårt standard annoteringsverktøy som hadde en gjennomsnittlig nøyaktighet og presisjon på henholdsvis 0.83 og 0.82. Når det kommer til oppstrømssekvenser klarer modellene å hente ut all informasjon tilgjengelig fra datasettene. Resultatene ga begrenset med informasjon når det kommer til klassifisering og motif-tilstedeværelse. Denne begrensningen indikerer at andre attributter rundt genomet bør undersøkes for en mulig forbedring rundt annoteringsproblemet. Et ideelt steg videre er å utvide modellene til en «pipeline» slik at komplekse falske negative klassifiseringer kan bli forklart.M-K
    corecore