Search CORE

55 research outputs found

Linear Time Construction of Indexable Founder Block Graphs

Author: Cazaux Bastien
Equi Massimo
Mäkinen Veli
Norri Tuukka
Tomescu Alexandru
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik
Publication date: 01/01/2020
Field of study

Peer reviewe

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Helsingin yliopiston digitaalinen arkisto

Minimum Segmentation for Pan-genomic Founder Reconstruction in Linear Time

Author: Cazaux Bastien
Kosolobov Dmitry
Norri Tuukka
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 18th International Workshop on Algorithms in Bioinformatics (WABI 2018)
Publication date: 01/01/2018
Field of study

Given a threshold L and a set R = {R_1, ..., R_m} of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1,n] into set P of disjoint segments such that each segment [a,b] in P has length at least L and the number d(a,b)=|{R_i[a,b] : 1 <= i <= m}| of distinct substrings at segment [a,b] is minimized over [a,b] in P. The distinct substrings in the segments represent founder blocks that can be concatenated to form max{d(a,b) : [a,b] in P} founder sequences representing the original R such that crossovers happen only at segment boundaries. We give an optimal O(mn) time algorithm to solve the problem, improving over earlier O(mn^2). This improvement enables to exploit the algorithm on a pan-genomic setting of input strings being aligned haplotype sequences of complete human chromosomes, with a goal of finding a representative set of references that can be indexed for read alignment and variant calling. We implemented the new algorithm and give some experimental evidence on the practicality of the approach on this pan-genomic setting

Dagstuhl Research Online Publication Server

Minimum Segmentation for Pan-genomic Founder Reconstruction in Linear Time

Author: Cazaux Bastien
Kosolobov Dmitry
Mäkinen Veli
Norri Tuukka
Publication venue: Schloss Dagstuhl Leibniz Center for Informatics
Publication date: 01/01/2018
Field of study

Peer reviewe

arXiv.org e-Print Archive

Helsingin yliopiston digitaalinen arkisto

Linear time minimum segmentation enables scalable founder reconstruction

Author: Cazaux Bastien
Kosolobov Dmitry
Mäkinen Veli
Norri Tuukka
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 17/05/2019
Field of study

Abstract Background We study a preprocessing routine relevant in pan-genomic analyses: consider a set of aligned haplotype sequences of complete human chromosomes. Due to the enormous size of such data, one would like to represent this input set with a few founder sequences that retain as well as possible the contiguities of the original sequences. Such a smaller set gives a scalable way to exploit pan-genomic information in further analyses (e.g. read alignment and variant calling). Optimizing the founder set is an NP-hard problem, but there is a segmentation formulation that can be solved in polynomial time, defined as follows. Given a threshold L and a set

{\mathcal {R}} = \{R_1, \ldots , R_m\}

R = { R 1 , … , R m } of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1, n] into set P of disjoint segments such that each segment

[a,b] \in P

[ a , b ] ∈ P has length at least L and the number

d(a,b)=|\{R_i[a,b] :1\le i \le m\}|

d ( a , b ) = | { R i [ a , b ] : 1 ≤ i ≤ m } | of distinct substrings at segment [a, b] is minimized over

[a,b] \in P

[ a , b ] ∈ P . The distinct substrings in the segments represent founder blocks that can be concatenated to form

\max \{ d(a,b) :[a,b] \in P \}

max { d ( a , b ) : [ a , b ] ∈ P } founder sequences representing the original

{\mathcal {R}}

R such that crossovers happen only at segment boundaries. Results We give an O(mn) time (i.e. linear time in the input size) algorithm to solve the minimum segmentation problem for founder reconstruction, improving over an earlier

O(mn^2)

O ( m n 2 ) . Conclusions Our improvement enables to apply the formulation on an input of thousands of complete human chromosomes. We implemented the new algorithm and give experimental evidence on its practicality. The implementation is available in https://github.com/tsnorri/founder-sequences

Helsingin yliopiston digitaalinen arkisto

Linear time minimum segmentation enables scalable founder reconstruction

Author: Cazaux Bastien
Kosolobov Dmitry
Mäkinen Veli
Norri Tuukka
Publication venue
Publication date: 01/01/2019
Field of study

Background: We study a preprocessing routine relevant in pan-genomic analyses: consider a set of aligned haplotype sequences of complete human chromosomes. Due to the enormous size of such data, one would like to represent this input set with a few founder sequences that retain as well as possible the contiguities of the original sequences. Such a smaller set gives a scalable way to exploit pan-genomic information in further analyses (e.g. read alignment and variant calling). Optimizing the founder set is an NP-hard problem, but there is a segmentation formulation that can be solved in polynomial time, defined as follows. Given a threshold L and a set R={R1,...,Rm} of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1,n] into set P of disjoint segments such that each segment [a,b]P has length at least L and the number d(a,b)=|{Ri[a,b]:1im}| of distinct substrings at segment [a,b] is minimized over [a,b]P. The distinct substrings in the segments represent founder blocks that can be concatenated to form max{d(a,b):[a,b]P} founder sequences representing the original R such that crossovers happen only at segment boundaries. Results: We give an O(mn) time (i.e. linear time in the input size) algorithm to solve the minimum segmentation problem for founder reconstruction, improving over an earlier O(mn2). Conclusions: Our improvement enables to apply the formulation on an input of thousands of complete human chromosomes. We implemented the new algorithm and give experimental evidence on its practicality. The implementation is available in https://github.com/tsnorri/founder-sequences.Peer reviewe

Institutional repository of Ural Federal University named after the first President of Russia B.N.Yeltsin

Helsingin yliopiston digitaalinen arkisto

Founder reconstruction enables scalable and seamless pangenomic analysis

Author: Cazaux Bastien
Dönges Saska
Mäkinen Veli
Norri Tuukka
Valenzuela Daniel
Publication venue
Publication date: 15/12/2021
Field of study

Motivation: Variant calling workflows that utilize a single reference sequence are the de facto standard elementary genomic analysis routine for resequencing projects. Various ways to enhance the reference with pangenomic information have been proposed, but scalability combined with seamless integration to existing workflows remains a challenge. Results: We present PanVC with founder sequences, a scalable and accurate variant calling workflow based on a multiple alignment of reference sequences. Scalability is achieved by removing duplicate parts up to a limit into a founder multiple alignment, that is then indexed using a hybrid scheme that exploits general purpose read aligners. Our implemented workflow uses GATK or BCFtools for variant calling, but the various steps of our workflow (e.g. vcf2multialign tool, founder reconstruction) can be of independent interest as a basis for creating novel pangenome analysis workflows beyond variant calling.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Exploring the potential of 3D Zernike descriptors and SVM for protein\u2013protein interface prediction

Author: Daberdaku Sebastian
Ferrari Carlo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

Abstract Background The correct determination of protein–protein interaction interfaces is important for understanding disease mechanisms and for rational drug design. To date, several computational methods for the prediction of protein interfaces have been developed, but the interface prediction problem is still not fully understood. Experimental evidence suggests that the location of binding sites is imprinted in the protein structure, but there are major differences among the interfaces of the various protein types: the characterising properties can vary a lot depending on the interaction type and function. The selection of an optimal set of features characterising the protein interface and the development of an effective method to represent and capture the complex protein recognition patterns are of paramount importance for this task. Results In this work we investigate the potential of a novel local surface descriptor based on 3D Zernike moments for the interface prediction task. Descriptors invariant to roto-translations are extracted from circular patches of the protein surface enriched with physico-chemical properties from the HQI8 amino acid index set, and are used as samples for a binary classification problem. Support Vector Machines are used as a classifier to distinguish interface local surface patches from non-interface ones. The proposed method was validated on 16 classes of proteins extracted from the Protein–Protein Docking Benchmark 5.0 and compared to other state-of-the-art protein interface predictors (SPPIDER, PrISE and NPS-HomPPI). Conclusions The 3D Zernike descriptors are able to capture the similarity among patterns of physico-chemical and biochemical properties mapped on the protein surface arising from the various spatial arrangements of the underlying residues, and their usage can be easily extended to other sets of amino acid properties. The results suggest that the choice of a proper set of features characterising the protein interface is crucial for the interface prediction task, and that optimality strongly depends on the class of proteins whose interface we want to characterise. We postulate that different protein classes should be treated separately and that it is necessary to identify an optimal set of features for each protein class

Directory of Open Access Journals

Archivio istituzionale della ricerca - Università di Padova

Ala- ja ylärajoja merkkijonon etsinnälle verkosta

Author: Equi Massimo
Publication venue: 'University of Helsinki Libraries'
Publication date: 22/06/2022
Field of study

String Matching in Labelled Graphs (SMLG) is a generalisation of the classic problem of finding a match for a string into a text. In SMLG, we are given a pattern string and a graph with node labels, and we want to find a path whose node labels match the pattern string. This problem has been studied since 1992, and it was initially intended to model the problem of finding a link in a hypertext. Recently, the problem received attention due to its applications in bioinformatics, but all of the solutions, old and new, failed to run in truly sub-quadratic time. In this work, based on four published papers, we study SMLG from different angles, first proving conditional lower bounds, and then proposing efficient algorithms for special classes of graphs. In the first paper, we unveil the reason behind the hardness of SMLG, showing a quadratic conditional lower bound based on the Orthogonal Vectors Hypothesis and the Strong Exponential Time Hypothesis. The techniques that we employ come from the fine-grained complexity, and involve finding linear-time reductions from the Orthogonal Vectors problem to different variations of SMLG. In the second paper, we strengthen our findings by showing that an indexing data structure built in polynomial time is not enough to provide subquadratic time queries for SMLG. We devise a general framework for obtaining indexing lower bounds out of regular lower bounds, and we prove the indexing lower bound for SMLG as an application of this technique. In the third paper, we surpass the limitations of our lower bounds by identifying a class of graphs, called founder block graphs, which support linear time queries after subquadratic indexing. This class of graph effectively represents collections of strings called multiple sequence alignments, if gap characters are not present. In the fourth paper, we significantly improve our previous results on efficiently indexable graphs. We propose elastic founder graphs, a superset of founder block graphs, that are able to represent multiple sequence alignments with gaps. Moreover, we propose algorithms for constructing elastic founder graph, indexing them, and perform queries in linear time.Merkkijonon etsintä verkosta (engl. String Matching in Labelled Graphs, SMLG) on yleistys klassiselle ongelmalle etsiä merkkijonohahmon osumaa tekstistä. SMLG ongelmassa syötteenä ovat merkkijonohahmo ja verkko, jonka solmuilla on merkkijonotunnisteet. Tavoitteena on löytää polku, jonka solmujen tunnisteet muodostavat tekstin, joka sisältää annetun merkkijonohahmon. Ongelmaa on tutkittu vuodesta 1992 alun alkaen mallintamaan linkkien etsintää hypertekstistä. Viime aikoina ongelma on tullut uudestaan esille bioinformatiikan saralla. Sekä vanhat että uudet ratkaisut eivät ole onnistuneet oleellisesti murtamaan neliöllistä aikavaativuutta ongelman ratkaisussa. Tässä työssä SMLG ongelmaa tarkastellaan eri näkökulmista perustuen neljään julkaisuun. Ensin todistetaan ehdollinen alaraja ongelman vaativuudelle. Sitten esitetään tehokkaita ratkaisuja erilaisille verkkojen aliluokille. Ensimmäisessä julkaisussa paljastamme syyn SMLG ongelman vaikeudelle johtamalla ehdollisen alarajan perustuen kohtisuorien vektorien hypoteesiin (engl. Orthogonal Vectors Hypothesis) ja vahvaan eksponentiaalisen aikavaativuuden hypoteesiin (engl. Strong Exponential Time Hypothesis). Tähän tulokseen käytämme hienorakenteisen vaativuusteorian (engl. fine-grained complexity) tekniikoita, kuten lineaariaikaista reduktiota kohtisuorien vektoreiden ongelmasta kohdeongelmaan, tässä tapauksessa eri variaatioille SMLG ongelmasta. Toisessa julkaisussa vahvistamme edellistä tulosta osoittamalla, että polynomiaikainen verkon indeksointi ei riitä tukemaan alle neliöaikaista merkkijonohahmon etsintää. Kehitämme yleisen kehikon tämän kaltaisten indeksointialarajojen johtamiseen tavallisista alarajoista, ja todistamme SMLG ongelman alarajan sovellutuksena tästä tekniikasta. Kolmannessa julkaisussa ohitamme alarajat identifioimalla verkkojen aliluokan, kantasegmentteihin perustuvat verkot (engl. founder block graphs), joilla indeksointi onnistuu alle neliöllisessä ajassa, jonka jälkeen merkkijonohahmon etsintää voidaan suorittaa lineaarisessa ajassa. Kantasegmentteihin perustuvilla verkoilla voidaan esittää merkkijonokokoelmien monilinjaukset, mikäli linjauksessa ei tarvita poistoja ja lisäyksiä. Neljännessä julkaisussa parannamme merkittävästi aiempia tuloksiamme indeksoitavista verkoista. Laajennamme kantasegmentteihin perustuvat verkot elastisuuden käsitteellä, jolloin ne voivat esittää mielivaltaisia monilinjauksia, joissa linjauksessa sallitaan poistot ja lisäykset. Tämän lisäksi johdamme algoritmeja näiden elastisten kantasegmentteihin perustuvien verkkojen muodostamiseen, indeksointiin, sekä merkkijonohahmojen etsintään

Helsingin yliopiston digitaalinen arkisto