Search CORE

4,845 research outputs found

On the String Consensus Problem and the Manhattan Sequence Consensus Problem

Author: A. Amir
A. Amir
A. Frank
B. Ma
C. Boucher
G.D. Cohen
H.W. Lenstra Jr.
J. Gramm
J.J. Sylvester
K. Fischer
M. Frances
R. Kannan
R.L. Graham
Publication venue
Publication date: 01/01/2014
Field of study

In the Manhattan Sequence Consensus problem (MSC problem) we are given

k

integer sequences, each of length

l

, and we are to find an integer sequence

x

of length

l

(called a consensus sequence), such that the maximum Manhattan distance of

x

from each of the input sequences is minimized. For binary sequences Manhattan distance coincides with Hamming distance, hence in this case the string consensus problem (also called string center problem or closest string problem) is a special case of MSC. Our main result is a practically efficient

O(l)

-time algorithm solving MSC for

k\le 5

sequences. Practicality of our algorithms has been verified experimentally. It improves upon the quadratic algorithm by Amir et al.\ (SPIRE 2012) for string consensus problem for

k=5

binary strings. Similarly as in Amir's algorithm we use a column-based framework. We replace the implied general integer linear programming by its easy special cases, due to combinatorial properties of the MSC for

k\le 5

. We also show that for a general parameter

k

any instance can be reduced in linear time to a kernel of size

k!

, so the problem is fixed-parameter tractable. Nevertheless, for

k\ge 4

this is still too large for any naive solution to be feasible in practice.Comment: accepted to SPIRE 201

arXiv.org e-Print Archive

Crossref

Nearest constrained circular words

Author: Blin Guillaume
Gasparoux Marie
Hamel Sylvie
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. Annual Symposium on Combinatorial Pattern Matching (CPM 2018)
Publication date: 01/01/2018
Field of study

In this paper, we study circular words arising in the development of equipment using shields in brachytherapy. This equipment has physical constraints that have to be taken into consideration. From an algorithmic point of view, the problem can be formulated as follows: Given a circular word, find a constrained circular word of the same length such that the Manhattan distance between these two words is minimal. We show that we can solve this problem in pseudo polynomial time (polynomial time in practice) using dynamic programming

Dagstuhl Research Online Publication Server

DeepGene : gene finding based on upstream sequence data

Author: Almestrand Trude Haug
Publication venue: Norwegian University of Life Sciences, Ås
Publication date: 01/01/2022
Field of study

Genome annotation is a process of identifying functional elements along a genome. By correctly locating and finding the information stored within a sequence, knowledge about structural features and functional roles can be revealed. With the number of sequences doubling approximately every 18 months, there is a severe need for automatic annotation of genomes. Today there are many different annotation software tools available, however they produce far from perfect results. Here a new project, DeepGene, is presented. Using data from the RefSeq prokaryotic database we have started an effort to improve on the prokaryotic genome annotation process. This thesis presents the initial efforts of said improvement with a focus on discerning between coding and non-coding sequences using upstream sequence data from open reading frames. Using the 15 prokaryotic genomes available in the RefSeq database, upstream data was retrieved and processed into two datasets, and were then trained using several popular classification models. The performance of the models was compared with a standard annotation tool to create a general baseline for our model. The models created from the datasets show many similarities in terms of metrics. With the K-mer data having a mean precision at 0.22 and mean recall of 0.74, and the sequential data having a mean precision at 0.30 and mean recall at 0.77. Both the datasets performed worse than our standard annotation software with a mean recall and precision of, respectively, 0.83 and 0.82. As far as upstream sequences are concerned, the models managed to pull all the information available from both datasets. The initial results gave limited information in terms of classification and motif presence indicating that other attributes surrounding the genome should be looked at for a possible improvement on the annotation problem. An ideal step forward is to expand into a pipeline so that the complex false negative classifications may be explained.Genomannotering er en prosess som skal identifisere funksjonelle elementer langs et genom. Ved å finne informasjonen lagret i en sekvens kan man avsløre kunnskap rundt strukturelle og funksjonelle roller. Ettersom antall sekvenser dobler rundt hver 18. måned er det et sterkt behov for automatisk gjenkjenning av genomer. I dag er det mange tilgjengelige annoteringsverktøy, men de produserer langt fra perfekte resultater. Et nytt prosjekt ved navn DeepGene er her presentert. Ved hjelp av data fra RefSeq prokaryotiske database har vi startet et forsøk på å forbedre den prokaryotiske annoteringsprosessen. I denne oppgaven presenteres begynnelsen på forbedringen. Hovedfokuset var å skille mellom kodende og ikke-kodende sekvenser ved hjelp av sekvensdata oppstrøms for åpne leserammer. Ved å benytte seg av de 15 prokaryotiske genomene tilgjengelig i RefSeq databasen, ble oppstrømsdata hentet og prosessert til to datasett. Disse datasettene ble videre trent ved hjelp av populære klassifiseringsmodeller. Ytelsen til disse modellene ble sammenlignet med et standard annoteringsverktøy for å lage et generelt utgangspunkt til vår modell. Modellene trent av datasettet viser mange likheter når det kommer til ytelse. K-mer datasettet hadde en gjennomsnittlig presisjon på 0.22 og nøyaktighet på 0.74. Videre hadde det sekvensielle datasettet en gjennomsnittlig presisjon på 0.30 og en nøyaktighet på 0.77. Begge datasettene hadde dårligere resultater enn vårt standard annoteringsverktøy som hadde en gjennomsnittlig nøyaktighet og presisjon på henholdsvis 0.83 og 0.82. Når det kommer til oppstrømssekvenser klarer modellene å hente ut all informasjon tilgjengelig fra datasettene. Resultatene ga begrenset med informasjon når det kommer til klassifisering og motif-tilstedeværelse. Denne begrensningen indikerer at andre attributter rundt genomet bør undersøkes for en mulig forbedring rundt annoteringsproblemet. Et ideelt steg videre er å utvide modellene til en «pipeline» slik at komplekse falske negative klassifiseringer kan bli forklart.M-K

Brage NMBU