Search CORE

3,487 research outputs found

Beyond Adjacency Maximization: Scaffold Filling for New String Distances

Author: Bulteau Laurent
Fertin Guillaume
Komusiewicz Christian
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)
Publication date: 01/01/2017
Field of study

International audienceIn Genomic Scaffold Filling, one aims at polishing in silico a draft genome, called scaffold. The scaffold is given in the form of an ordered set of gene sequences, called contigs. This is done by confronting the scaffold to an already complete reference genome from a close species. More precisely, given a scaffold S, a reference genome G and a score function f () between two genomes, the aim is to complete S by adding the missing genes from G so that the obtained complete genome S * optimizes f (S * , G). In this paper, we extend a model of Jiang et al. [CPM 2016] (i) by allowing the insertions of strings instead of single characters (i.e., some groups of genes may be forced to be inserted together) and (ii) by considering two alternative score functions: the first generalizes the notion of common adjacencies by maximizing the number of common k-mers between S * and G (k-Mer Scaffold Filling), the second aims at minimizing the number of breakpoints between S * and G (Min-Breakpoint Scaffold Filling). We study these problems from the parameterized complexity point of view, providing fixed-parameter (FPT) algorithms for both problems. In particular, we show that k-Mer Scaffold Filling is FPT wrt. parameter , the number of additional k-mers realized by the completion of S—this answers an open question of Jiang et al. [CPM 2016]. We also show that Min-Breakpoint Scaffold Filling is FPT wrt. a parameter combining the number of missing genes, the number of gene repetitions and the target distance

HAL Descartes

Dagstuhl Research Online Publication Server

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

Heuristic algorithms for the Longest Filled Common Subsequence Problem

Author: Mincu Radu Stefan
Popa Alexandru
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 16/04/2019
Field of study

At CPM 2017, Castelli et al. define and study a new variant of the Longest Common Subsequence Problem, termed the Longest Filled Common Subsequence Problem (LFCS). For the LFCS problem, the input consists of two strings

A

and

B

and a multiset of characters

\mathcal{M}

. The goal is to insert the characters from

\mathcal{M}

into the string

B

, thus obtaining a new string

B^*

, such that the Longest Common Subsequence (LCS) between

A

and

B^*

is maximized. Casteli et al. show that the problem is NP-hard and provide a 3/5-approximation algorithm for the problem. In this paper we study the problem from the experimental point of view. We introduce, implement and test new heuristic algorithms and compare them with the approximation algorithm of Casteli et al. Moreover, we introduce an Integer Linear Program (ILP) model for the problem and we use the state of the art ILP solver, Gurobi, to obtain exact solution for moderate sized instances.Comment: Accepted and presented as a proceedings paper at SYNASC 201

arXiv.org e-Print Archive

Crossref

Genomic Scaffold Filling Revisited

Author: Fan Chenglin
Jiang Haitao
Yang Boting
Zhong Farong
Zhu Binhai
Zhu Daming
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016)
Publication date: 01/01/2016
Field of study

The genomic scaffold filling problem has attracted a lot of attention recently. The problem is on filling an incomplete sequence (scaffold) I into I\u27, with respect to a complete reference genome G, such that the number of adjacencies between G and I\u27 is maximized. The problem is NP-complete and APX-hard, and admits a 1.2-approximation. However, the sequence input I is not quite practical and does not fit most of the real datasets (where a scaffold is more often given as a list of contigs). In this paper, we revisit the genomic scaffold filling problem by considering this important case when, (1) a scaffold S is given, the missing genes X = c(G) - c(S) can only be inserted in between the contigs, and the objective is to maximize the number of adjacencies between G and the filled S\u27 and (2) a scaffold S is given, a subset of the missing genes X\u27 subset X = c(G) - c(S) can only be inserted in between the contigs, and the objective is still to maximize the number of adjacencies between G and the filled S\u27\u27. For problem (1), we present a simple NP-completeness proof, we then present a factor-2 greedy approximation algorithm, and finally we show that the problem is FPT when each gene appears at most d times in G. For problem (2), we prove that the problem is W[1]-hard and then we present a factor-2 FPT-approximation for the case when each gene appears at most d times in G

Dagstuhl Research Online Publication Server

The Longest Filled Common Subsequence Problem

Author: Castelli Mauro
Dondi Riccardo
Mauri Giancarlo
Zoppis Italo
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)
Publication date: 01/01/2017
Field of study

Inspired by a recent approach for genome reconstruction from incomplete data, we consider a variant of the longest common subsequence problem for the comparison of two sequences, one of which is incomplete, i.e. it has some missing elements. The new combinatorial problem, called Longest Filled Common Subsequence, given two sequences A and B, and a multiset M of symbols missing in B, asks for a sequence B* obtained by inserting the symbols of M into B so that B* induces a common subsequence with A of maximum length. First, we investigate the computational and approximation complexity of the problem and we show that it is NP-hard and APX-hard when A contains at most two occurrences of each symbol. Then, we give a 3/5 approximation algorithm for the problem. Finally, we present a fixed-parameter algorithm, when the problem is parameterized by the number of symbols inserted in B that "match" symbols of A

Repositório da Universidade Nova de Lisboa

Dagstuhl Research Online Publication Server

Size-Dependent Tile Self-Assembly: Constant-Height Rectangles and Stability

Author: Andrew Winslow
Robert T. Schweller
Sándor P. Fekete
Tu Braunschweig
Publication venue
Publication date: 01/01/2015
Field of study

We introduce a new model of algorithmic tile self-assembly called size-dependent assembly. In previous models, supertiles are stable when the total strength of the bonds between any two halves exceeds some constant temperature. In this model, this constant temperature requirement is replaced by an nondecreasing temperature function

\tau : \mathbb{N} \rightarrow \mathbb{N}

that depends on the size of the smaller of the two halves. This generalization allows supertiles to become unstable and break apart, and captures the increased forces that large structures may place on the bonds holding them together. We demonstrate the power of this model in two ways. First, we give fixed tile sets that assemble constant-height rectangles and squares of arbitrary input size given an appropriate temperature function. Second, we prove that deciding whether a supertile is stable is coNP-complete. Both results contrast with known results for fixed temperature.Comment: In proceedings of ISAAC 201

arXiv.org e-Print Archive

CiteSeerX

DI-fusion

OPERA-LG: efficient and exact scaffolding of large, repeat-rich eukaryotic genomes with performance guarantees

Author: Burton K. H. Chia
Denis Bertrand
Niranjan Nagarajan
Song Gao
Publication venue: Springer Nature
Publication date: 01/01/2016
Field of study

10.1186/s13059-016-0951-yGenome Biology17110

Springer - Publisher Connector

PubMed Central

ScholarBank@NUS

Bayesian multi-objective optimisation with mixed analytical and black-box functions: application to tissue engineering

Author: Calandra R
Deisenroth MP
Geris L
Mehrian M
Misener R
Olofsson SCW
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

Tissue engineering and regenerative medicine looks at improving or restoring biological tissue function in humans and animals. We consider optimising neotissue growth in a three-dimensional scaffold during dynamic perfusion bioreactor culture, in the context of bone tissue engineering. The goal is to choose design variables that optimise two conflicting objectives: (i) maximising neotissue growth and (ii) minimising operating cost. We make novel extensions to Bayesian multi-objective optimisation in the case of one analytical objective function and one black-box, i.e. simulation-based, objective function. The analytical objective represents operating cost while the black-box neotissue growth objective comes from simulating a system of partial differential equations. The resulting multi-objective optimisation method determines the trade-off in the variables between neotissue growth and operating cost. Our method outperforms the most common approach in literature, genetic algorithms, in terms of data efficiency, on both the tissue engineering example and standard test functions. The resulting method is highly applicable to real-world problems combining black-box models with easy-to-quantify objectives like cost

UCL Discovery

Spiral - Imperial College Digital Repository

Open Repository and Bibliography - Liège

Reevaluating Assembly Evaluations with Feature Response Curves: GAGE and Assemblathons

Author: Mishra Bud
Narzisi Giuseppe
Vezzi Francesco
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 03/10/2012
Field of study

In just the last decade, a multitude of bio-technologies and software pipelines have emerged to revolutionize genomics. To further their central goal, they aim to accelerate and improve the quality of de novo whole-genome assembly starting from short DNA reads. However, the performance of each of these tools is contingent on the length and quality of the sequencing data, the structure and complexity of the genome sequence, and the resolution and quality of long-range information. Furthermore, in the absence of any metric that captures the most fundamental "features" of a high-quality assembly, there is no obvious recipe for users to select the most desirable assembler/assembly. International competitions such as Assemblathons or GAGE tried to identify the best assembler(s) and their features. Some what circuitously, the only available approach to gauge de novo assemblies and assemblers relies solely on the availability of a high-quality fully assembled reference genome sequence. Still worse, reference-guided evaluations are often both difficult to analyze, leading to conclusions that are difficult to interpret. In this paper, we circumvent many of these issues by relying upon a tool, dubbed FRCbam, which is capable of evaluating de novo assemblies from the read-layouts even when no reference exists. We extend the FRCurve approach to cases where lay-out information may have been obscured, as is true in many deBruijn-graph-based algorithms. As a by-product, FRCurve now expands its applicability to a much wider class of assemblers -- thus, identifying higher-quality members of this group, their inter-relations as well as sensitivity to carefully selected features, with or without the support of a reference sequence or layout for the reads. The paper concludes by reevaluating several recently conducted assembly competitions and the datasets that have resulted from them.Comment: Submitted to PLoS One. Supplementary material available at http://www.nada.kth.se/~vezzi/publications/supplementary.pdf and http://cs.nyu.edu/mishra/PUBLICATIONS/12.supplementaryFRC.pd

arXiv.org e-Print Archive

Directory of Open Access Journals

PubMed Central

FigShare