3,487 research outputs found
Beyond Adjacency Maximization: Scaffold Filling for New String Distances
International audienceIn Genomic Scaffold Filling, one aims at polishing in silico a draft genome, called scaffold. The scaffold is given in the form of an ordered set of gene sequences, called contigs. This is done by confronting the scaffold to an already complete reference genome from a close species. More precisely, given a scaffold S, a reference genome G and a score function f () between two genomes, the aim is to complete S by adding the missing genes from G so that the obtained complete genome S * optimizes f (S * , G). In this paper, we extend a model of Jiang et al. [CPM 2016] (i) by allowing the insertions of strings instead of single characters (i.e., some groups of genes may be forced to be inserted together) and (ii) by considering two alternative score functions: the first generalizes the notion of common adjacencies by maximizing the number of common k-mers between S * and G (k-Mer Scaffold Filling), the second aims at minimizing the number of breakpoints between S * and G (Min-Breakpoint Scaffold Filling). We study these problems from the parameterized complexity point of view, providing fixed-parameter (FPT) algorithms for both problems. In particular, we show that k-Mer Scaffold Filling is FPT wrt. parameter , the number of additional k-mers realized by the completion of S—this answers an open question of Jiang et al. [CPM 2016]. We also show that Min-Breakpoint Scaffold Filling is FPT wrt. a parameter combining the number of missing genes, the number of gene repetitions and the target distance
Heuristic algorithms for the Longest Filled Common Subsequence Problem
At CPM 2017, Castelli et al. define and study a new variant of the Longest
Common Subsequence Problem, termed the Longest Filled Common Subsequence
Problem (LFCS). For the LFCS problem, the input consists of two strings and
and a multiset of characters . The goal is to insert the
characters from into the string , thus obtaining a new string
, such that the Longest Common Subsequence (LCS) between and is
maximized. Casteli et al. show that the problem is NP-hard and provide a
3/5-approximation algorithm for the problem.
In this paper we study the problem from the experimental point of view. We
introduce, implement and test new heuristic algorithms and compare them with
the approximation algorithm of Casteli et al. Moreover, we introduce an Integer
Linear Program (ILP) model for the problem and we use the state of the art ILP
solver, Gurobi, to obtain exact solution for moderate sized instances.Comment: Accepted and presented as a proceedings paper at SYNASC 201
Genomic Scaffold Filling Revisited
The genomic scaffold filling problem has attracted a lot of attention recently. The problem is on filling an incomplete sequence (scaffold) I into I\u27, with respect to a complete reference genome G, such that the number of adjacencies between G and I\u27 is maximized. The problem is NP-complete and APX-hard, and admits a 1.2-approximation. However, the sequence input I is not quite practical and does not fit most of the real datasets (where a scaffold is more often given as a list of contigs). In this paper, we revisit the genomic scaffold filling problem by considering this important case when, (1) a scaffold S is given, the missing genes X = c(G) - c(S) can only be inserted in between the contigs, and the objective is to maximize the number of adjacencies between G and the filled S\u27 and (2) a scaffold S is given, a subset of the missing genes X\u27 subset X = c(G) - c(S) can only be inserted in between the contigs, and the objective is still to maximize the number of adjacencies between G and the filled S\u27\u27. For problem (1), we present a simple NP-completeness proof, we then present a factor-2 greedy approximation algorithm, and finally we show that the problem is FPT when each gene appears at most d times in G. For problem (2), we prove that the problem is W[1]-hard and then we present a factor-2 FPT-approximation for the case when each gene appears at most d times in G
The Longest Filled Common Subsequence Problem
Inspired by a recent approach for genome reconstruction from incomplete data, we consider a variant of the longest common subsequence problem for the comparison of two sequences, one of which is incomplete, i.e. it has some missing elements. The new combinatorial problem, called Longest Filled Common Subsequence, given two sequences A and B, and a multiset M of symbols missing in B, asks for a sequence B* obtained by inserting the symbols of M into B so that B* induces a common subsequence with A of maximum length. First, we investigate the computational and approximation complexity of the problem and we show that it is NP-hard and APX-hard when A contains at most two occurrences of each symbol. Then, we give a 3/5 approximation algorithm for the problem. Finally, we present a fixed-parameter algorithm, when the problem is parameterized by the number of symbols inserted in B that "match" symbols of A
Size-Dependent Tile Self-Assembly: Constant-Height Rectangles and Stability
We introduce a new model of algorithmic tile self-assembly called
size-dependent assembly. In previous models, supertiles are stable when the
total strength of the bonds between any two halves exceeds some constant
temperature. In this model, this constant temperature requirement is replaced
by an nondecreasing temperature function that depends on the size of the smaller of the two halves. This
generalization allows supertiles to become unstable and break apart, and
captures the increased forces that large structures may place on the bonds
holding them together.
We demonstrate the power of this model in two ways. First, we give fixed tile
sets that assemble constant-height rectangles and squares of arbitrary input
size given an appropriate temperature function. Second, we prove that deciding
whether a supertile is stable is coNP-complete. Both results contrast with
known results for fixed temperature.Comment: In proceedings of ISAAC 201
OPERA-LG: efficient and exact scaffolding of large, repeat-rich eukaryotic genomes with performance guarantees
10.1186/s13059-016-0951-yGenome Biology17110
Bayesian multi-objective optimisation with mixed analytical and black-box functions: application to tissue engineering
Tissue engineering and regenerative medicine looks at improving or restoring biological tissue function in humans and animals. We consider optimising neotissue growth in a three-dimensional scaffold during dynamic perfusion bioreactor culture, in the context of bone tissue engineering. The goal is to choose design variables that optimise two conflicting objectives: (i) maximising neotissue growth and (ii) minimising operating cost. We make novel extensions to Bayesian multi-objective optimisation in the case of one analytical objective function and one black-box, i.e. simulation-based, objective function. The analytical objective represents operating cost while the black-box neotissue growth objective comes from simulating a system of partial differential equations. The resulting multi-objective optimisation method determines the trade-off in the variables between neotissue growth and operating cost. Our method outperforms the most common approach in literature, genetic algorithms, in terms of data efficiency, on both the tissue engineering example and standard test functions. The resulting method is highly applicable to real-world problems combining black-box models with easy-to-quantify objectives like cost
Reevaluating Assembly Evaluations with Feature Response Curves: GAGE and Assemblathons
In just the last decade, a multitude of bio-technologies and software
pipelines have emerged to revolutionize genomics. To further their central
goal, they aim to accelerate and improve the quality of de novo whole-genome
assembly starting from short DNA reads. However, the performance of each of
these tools is contingent on the length and quality of the sequencing data, the
structure and complexity of the genome sequence, and the resolution and quality
of long-range information. Furthermore, in the absence of any metric that
captures the most fundamental "features" of a high-quality assembly, there is
no obvious recipe for users to select the most desirable assembler/assembly.
International competitions such as Assemblathons or GAGE tried to identify the
best assembler(s) and their features. Some what circuitously, the only
available approach to gauge de novo assemblies and assemblers relies solely on
the availability of a high-quality fully assembled reference genome sequence.
Still worse, reference-guided evaluations are often both difficult to analyze,
leading to conclusions that are difficult to interpret. In this paper, we
circumvent many of these issues by relying upon a tool, dubbed FRCbam, which is
capable of evaluating de novo assemblies from the read-layouts even when no
reference exists. We extend the FRCurve approach to cases where lay-out
information may have been obscured, as is true in many deBruijn-graph-based
algorithms. As a by-product, FRCurve now expands its applicability to a much
wider class of assemblers -- thus, identifying higher-quality members of this
group, their inter-relations as well as sensitivity to carefully selected
features, with or without the support of a reference sequence or layout for the
reads. The paper concludes by reevaluating several recently conducted assembly
competitions and the datasets that have resulted from them.Comment: Submitted to PLoS One. Supplementary material available at
http://www.nada.kth.se/~vezzi/publications/supplementary.pdf and
http://cs.nyu.edu/mishra/PUBLICATIONS/12.supplementaryFRC.pd
- …