718 research outputs found
Phylogenetic reconstruction from transpositions
Background Because of the advent of high-throughput sequencing and the consequent reduction in the cost of sequencing, many organisms have been completely sequenced and most of their genes identified. It thus has become possible to represent whole genomes as ordered lists of gene identifiers and to study the rearrangement of these entities through computational means. As a result, genome rearrangement data has attracted increasing attentions from both biologists and computer scientists as a new type of data for phylogenetic analysis. The main events of genome rearrangements include inversions, transpositions and transversions. To date, GRAPPA and MGR are the most accurate methods for rearrangement phylogeny, both assuming inversion as the only event. However, due to the complexity of computing transposition distance, it is very difficult to analyze datasets when transpositions are dominant.
Results We extend GRAPPA to handle transpositions. The new method is named GRAPPA-TP, with two major extensions: a heuristic method to estimate transposition distance, and a new transposition median solver for three genomes. Although GRAPPA-TP uses a greedy approach to compute the transposition distance, it is very accurate when genomes are relatively close. The new GRAPPA-TP is available from http://phylo.cse.sc.edu/
Conclusion Our extensive testing using simulated datasets shows that GRAPPA-TP is very accurate in terms of ancestor genome inference and phylogenetic reconstruction. Simulation results also suggest that model match is critical in genome rearrangement analysis: it is not accurate to simulate transpositions with other events including inversions
Étude algorithmique et combinatoire de la méthode de Kemeny-Young et du consensus de classements
Une permutation est une liste qui ordonne des objets ou des candidats en fonction d’une
préférence ou d’un critère. Des exemples sont les résultats d’un moteur de recherche sur
l’internet, des classements d’athlètes, des listes de gènes liés à une maladie données par
des méthodes de prédiction ou simplement des préférences d’activités à faire pour la pro-
chaine fin de semaine. On peut être intéressé à agréger plusieurs permutations pour en
obtenir une permutation consensus. Ce problème est bien connu en science politique et
plusieurs méthodes existent pour agréger des permutations, chacune ayant ses propriétés
mathématiques. Parmi ces méthodes, la méthode de Kemeny-Young, aussi nommée la
médiane de permutations, permet de trouver un consensus qui minimise la somme des
distances entre ce consensus et l’ensemble de permutations. Cette méthode détient plu-
sieurs propriétés désirables. Par contre, elle est difficile à calculer, ouvrant par ce fait,
la voie à de nombreux travaux de recherche. Une généralisation de ce problème permet
de considérer les classements qui contiennent des égalités entre les objets classés et qui
peuvent être incomplets en ne considérant qu’un sous-ensemble d’objets. Dans cette thèse
nous étudions la méthode de Kemeny-Young sous différents aspects :
— Premièrement, une réduction d’espace de recherche est proposée. Elle permet
d’améliorer les temps de calcul d’approches exactes pour le problème.
— Deuxièmement, une heuristique bien paramétrée est développée et sert par le gui-
dage d’un algorithme exact branch-and-bound. Cet algorithme utilise aussi une
nouvelle réduction d’espace.
— Troisièmement, le cas particulier du problème sur trois permutations est investigué.
Une réduction d’espace de recherche basée sur les graphes est proposée pour ce cas,
suivi d’une borne inférieure très stricte. Deux conjectures sont émises et font le lien
entre ce cas et le problème du 3-Hitting Set.
— Finalement, une généralisation du problème est proposée et permet d’étendre nos
travaux de réduction d’espace de recherche à l’agrégation de classements.A permutation is a list that orders objects or candidates with a preference function or
a criterion. Some examples include results from a search engine on the internet, athlete
rankings, lists of genes related to a disease given by prediction methods or simply the
preference of activities for the next weekend. One might be interested to aggregate a set
of permutations to get a consensus permutation. This problem is well known in political
science and many methods exists that can aggregate permutations, each one having its
mathematical properties. Among those methods, the Kemeny-Young method, also known
as the median of permutations, finds a consensus that minimise the sum of distances
between that consensus and the set of permutations. This method holds many desirable
properties. On the other end, this method is difficult to calculate, thus opening the way
for research works. A generalization of this problem considers rankings containing ties
between the ranked objects and rankings that might be incomplete by considering only
a subset of objects. In this thesis, we study the Kemeny-Young method under different
aspects :
— Firstly, a search space reduction technique is proposed. It improves the time com-
plexity of exact algorithms for the problem.
— Secondly, a well parameterized heuristic is developed and is used as guidance in
a branch-and-bound exact algorithm. This algorithm also uses a new search space
reduction technique.
— Thirdly, the special case of the problem on three permutations is investigated. A
search space reduction technique based on graphs is presented for this case, followed
by a very tight lower bound. Two conjectures are stated and are linking this case
with the 3-Hitting Set problem.
— Finally, a generalization of the problem is proposed and allows us to extend our
work on search space reduction techniques to the rank aggregation problem
Gene order rearrangement methods for the reconstruction of phylogeny
The study of phylogeny, i.e. the evolutionary history of species, is a central problem in biology and a key for understanding characteristics of contemporary species. Many problems in this area can be formulated as combinatorial optimisation problems which makes it particularly
interesting for computer scientists. The reconstruction of the phylogeny of species can be based on various kinds of data, e.g. morphological properties or characteristics of the genetic information of the species. Maximum parsimony is a popular and widely used method for phylogenetic reconstruction aiming for an explanation of the observed data requiring the least evolutionary changes.
A certain property of the genetic information gained much interest for the reconstruction of phylogeny in recent time: the organisation of the genomes of species, i.e. the arrangement of the genes on the chromosomes. But the idea to reconstruct phylogenetic information
from gene arrangements has a long history. In Dobzhansky and Sturtevant (1938) it was already pointed out that “a comparison of the different gene arrangements in the same chromosome may, in certain cases, throw light on the historical relationships of these structures, and consequently on the history of the species as a whole”. This kind of data
is promising for the study of deep evolutionary relationships because gene arrangements are believed to evolve slowly (Rokas and Holland, 2000). This seems to be the case especially for mitochondrial genomes which are available for a wide range of species (Boore, 1999).
The development of methods for the reconstruction of phylogeny from gene arrangement data has made considerable progress during the last years. Prominent examples are the computation of parsimonious evolutionary scenarios, i.e. a shortest sequence of rearrangements
transforming one arrangement of genes into another or the length of such a minimal scenario (Hannenhalli and Pevzner, 1995b; Sankoff, 1992; Watterson et al., 1982); the reconstruction of parsimonious phylogenetic trees from gene arrangement data (Bader et al.,
2008; Bernt et al., 2007b; Bourque and Pevzner, 2002; Moret et al., 2002a); or the computation of the similarities of gene arrangements (Bergeron et al., 2008a; Heber et al., 2009).
1
1 Introduction
The central theme of this work is to provide efficient algorithms for modified versions of fundamental genome rearrangement problems using more plausible rearrangement models.
Two types of modified rearrangement models are explored.
The first type is to restrict the set of allowed rearrangements as follows. It can be observed that certain groups of genes are preserved during evolution. This may be caused by functional constraints which prevented the destruction (Lathe et al., 2000; SĂ©mon and Duret, 2006; Xie et al., 2003), certain properties of the rearrangements which shaped the gene
orders (Eisen et al., 2000; Sankoff, 2002; Tillier and Collins, 2000), or just because no destructive rearrangement happened since the speciation of the gene orders. It can be assumed that gene groups, found in all studied gene orders, are not acquired independently.
Accordingly, these gene groups should be preserved in plausible reconstructions of the course of evolution, in particular the gene groups should be present in the reconstructed putative ancestral gene orders. This can be achieved by restricting the set of rearrangements, which
are allowed for the reconstruction, to those which preserve the gene groups of the given gene orders. Since it is difficult to determine functionally what a gene group is, it has been proposed to consider common combinatorial structures of the gene orders as gene groups
(Marcotte et al., 1999; Overbeek et al., 1999).
The second considered modification of the rearrangement model is extending the set of allowed rearrangement types. Different types of rearrangement operations have shuffled the gene orders during evolution. It should be attempted to use the same set of rearrangement
operations for the reconstruction otherwise distorted or even wrong phylogenetic conclusions may be obtained in the worst case.
Both possibilities have been considered for certain rearrangement problems before. Restricted sets of allowed rearrangements have been used successfully for the computation of parsimonious rearrangement scenarios consisting of inversions only where the gene groups
are identified as common intervals (Bérard et al., 2007; Figeac and Varré, 2004). Extending the set of allowed rearrangement operations is a delicate task. On the one hand it is unknown which rearrangements have to be regarded because this is part of the phylogeny to
be discovered. On the other hand, efficient exact rearrangement methods including several operations are still rare, in particular when transpositions should be included. For example, the problem to compute shortest rearrangement scenarios including transpositions is still of
unknown computational complexity. Currently, only efficient approximation algorithms are known (e.g. Bader and Ohlebusch, 2007; Elias and Hartman, 2006).
Two problems have been studied with respect to one or even both of these possibilities in the scope of this work.
The first one is the inversion median problem. Given the gene orders of some taxa, this problem asks for potential ancestral gene orders such that the corresponding inversion scenario is parsimonious, i.e. has a minimum length. Solving this problem is an essential component 2 of algorithms for computing phylogenetic trees from gene arrangements (Bourque and Pevzner, 2002; Moret et al., 2002a, 2001). The unconstrained inversion median problem is NP-hard (Caprara, 2003). In Chapter 3 the inversion median problem is studied under the additional constraint to preserve gene groups of the input gene orders. Common intervals, i.e. sets of genes that appear consecutively in the gene orders, are used for modelling gene groups. The problem of finding such ancestral gene orders is called the preserving inversion median problem. Already the problem of finding a shortest inversion scenario for two gene orders is NP-hard (Figeac and Varré, 2004).
Mitochondrial gene orders are a rich source for phylogenetic investigations because they are known for more than 1 000 species. Four rearrangement operations are reported at least in the literature to be relevant for the study of mitochondrial gene order evolution (Boore, 1999): That is inversions, transpositions, inverse transpositions, and tandem duplication random loss (TDRL). Efficient methods for a plausible reconstruction of genome rearrangements for mitochondrial gene orders using all four operations are presented in Chapter 4.
An important rearrangement operation, in particular for the study of mitochondrial gene orders, is the tandem duplication random loss operation (e.g. Boore, 2000; Mauro et al., 2006). This rearrangement duplicates a part of a gene order followed by the random loss of one of the redundant copies of each gene. The gene order is rearranged depending on which copy is lost. This rearrangement should be regarded for reconstructing phylogeny from gene order data. But the properties of this rearrangement operation have rarely been studied
(Bouvel and Rossin, 2009; Chaudhuri et al., 2006). The combinatorial properties of the TDRL operation are studied in Chapter 5. The enumeration and counting of sorting TDRLs, that is TDRL operations reducing the distance, is studied in particular. Closed formulas for computing the number of sorting TDRLs and methods for the enumeration are presented. Furthermore, TDRLs are one of the operations considered in Chapter 4. An interesting property of this rearrangement, distinguishing it from other rearrangements, is its
asymmetry. That is the effects of a single TDRL can (in the most cases) not be reversed with a single TDRL. The use of this property for phylogeny reconstruction is studied in Section 4.3.
This thesis is structured as follows. The existing approaches obeying similar types of modified rearrangement models as well as important concepts and computational methods to related problems are reviewed in Chapter 2. The combinatorial structures of gene orders that
have been proposed for identifying gene groups, in particular common intervals, as well as the computational approaches for their computation are reviewed in Section 2.2. Approaches for computing parsimonious pairwise rearrangement scenarios are outlined in Section 2.3.
Methods for the computation genome rearrangement scenarios obeying biologically motivated constraints, as introduced above, are detailed in Section 2.4. The approaches for the inversion median problem are covered in Section 2.5. Methods for the reconstruction of phylogenetic
trees from gene arrangement data are briefly outlined in Section 2.6.3
1 Introduction
Chapter 3 introduces the new algorithms CIP, ECIP, and TCIP for solving the preserving inversion median problem. The efficiency of the algorithm is empirically studied for simulated as well as mitochondrial data. The description of algorithms CIP and ECIP is based on Bernt et al. (2006b). TCIP has been described in Bernt et al. (2007a, 2008b). But the
theoretical foundation of TCIP is extended significantly within this work in order to allow for more than three input permutations.
Gene order rearrangement methods that have been developed for the reconstruction of the phylogeny of mitochondrial gene orders are presented in the fourth chapter. The presented algorithm CREx computes rearrangement scenarios for pairs of gene orders. CREx regards the four types of rearrangement operations which are important for mitochondrial gene orders.
Based on CREx the algorithm TreeREx for assigning rearrangement events to a given tree is developed. The quality of the CREx reconstructions is analysed in a large empirical study for simulated gene orders. The results of TreeREx are analysed for several mitochondrial data sets. Algorithms CREx and TreeREx have been published in Bernt et al. (2008a, 2007c).
The analysis of the mitochondrial gene orders of Echinodermata was included in Perseke et al. (2008). Additionally, a new and simple method is presented to explore the potential of the CREx method. The new method is applied to the complete mitochondrial data set.
The problem of enumerating and counting sorting TDRLs is studied in Chapter 5. The theoretical results are covered to a large extent by Bernt et al. (2009b). The missing combinatorial explanation for some of the presented formulas is given here for the first time.
Therefor, a new method for the enumeration and counting of sorting TDRLs has been developed (Bernt et al., 2009a)
Advances in Branch-and-Fix methods to solve the Hamiltonian cycle problem in manufacturing optimization
159 p.Esta tesis parte del problema de la optimizaciĂłn de la ruta de la herramienta donde se contribuye con unsistema de soporte para la toma de decisiones que genera rutas Ăłptimas en la tecnologĂa de FabricaciĂłnAditiva. Esta contribuciĂłn sirve como punto de partida o inspiraciĂłn para analizar el problema del cicloHamiltoniano (HCP). El HCP consiste en visitar todos los vĂ©rtices de un grafo dado una Ăşnica vez odeterminar que dicho ciclo no existe. Muchos de los mĂ©todos propuestos en la literatura sirven paragrafos no dirigidos y los que se enfocan en los grafos dirigidos no han sido implementados ni testeados.Uno de los mĂ©todos para resolver el problema es el Branch-and-Fix (BF), un mĂ©todo exacto que utiliza latranformaciĂłn del HCP a un problema continuo. El BF es un algoritmo de ramificaciĂłn que consiste enconstruir un árbol de decisiĂłn donde en cada vĂ©rtice dos problemas lineales son resueltos. Este mĂ©todo hasido testeado en grafos de tamaño pequeño y por ello, no se ha estudiado en profundidad las limitacionesque puede presentar. Por ello, en esta tesis se proponen cuatro contribuciones metodolĂłgicasrelacionadas con el HCP y el BF: 1) mejorar la enficiencia del BF en diferentes aspectos, 2) proponer unmĂ©todo de ramificaciĂłn global, 3) proponer un mĂ©todo del BF colapsado, 4) extender el HCP a unescenario multi-objetivo y proponer un mĂ©todo para resolverlo
SAT and CP: Parallelisation and Applications
This thesis is considered with the parallelisation of solvers which search for either an arbitrary, or an optimum, solution to a problem stated in some formal way. We discuss the parallelisation of two solvers, and their application in three chapters.In the first chapter, we consider SAT, the decision problem of propositional logic, and algorithms for showing the satisfiability or unsatisfiability of propositional formulas. We sketch some proof-theoretic foundations which are related to the strength of different algorithmic approaches. Furthermore, we discuss details of the implementations of SAT solvers, and show how to improve upon existing sequential solvers. Lastly, we discuss the parallelisation of these solvers with a focus on clause exchange, the communication of intermediate results within a parallel solver. The second chapter is concerned with Contraint Programing (CP) with learning. Contrary to classical Constraint Programming techniques, this incorporates learning mechanisms as they are used in the field of SAT solving. We present results from parallelising CHUFFED, a learning CP solver. As this is both a kind of CP and SAT solver, it is not clear which parallelisation approaches work best here. In the final chapter, we will discuss Sorting networks, which are data oblivious sorting algorithms, i. e., the comparisons they perform do not depend on the input data. Their independence of the input data lends them to parallel implementation. We consider the question how many parallel sorting steps are needed to sort some inputs, and present both lower and upper bounds for several cases
Almost Symmetries and the Unit Commitment Problem
This thesis explores two main topics. The first is almost symmetry detection on graphs. The presence of symmetry in combinatorial optimization problems has long been considered an anathema, but in the past decade considerable progress has been made. Modern integer and constraint programming solvers have automatic symmetry detection built-in to either exploit or avoid symmetric regions of the search space. Automatic symmetry detection generally works by converting the input problem to a graph which is in exact correspondence with the problem formulation. Symmetry can then be detected on this graph using one of the excellent existing algorithms; these are also the symmetries of the problem formulation.The motivation for detecting almost symmetries on graphs is that almost symmetries in an integer program can force the solver to explore nearly symmetric regions of the search space. Because of the known correspondence between integer programming formulations and graphs, this is a first step toward detecting almost symmetries in integer programming formulations. Though we are only able to compute almost symmetries for graphs of modest size, the results indicate that almost symmetry is definitely present in some real-world combinatorial structures, and likely warrants further investigation.The second topic explored in this thesis is integer programming formulations for the unit commitment problem. The unit commitment problem involves scheduling power generators to meet anticipated energy demand while minimizing total system operation cost. Today, practitioners usually formulate and solve unit commitment as a large-scale mixed integer linear program.The original intent of this project was to bring the analysis of almost symmetries to the unit commitment problem. Two power generators are almost symmetric in the unit commitment problem if they have almost identical parameters. Along the way, however, new formulations for power generators were discovered that warranted a thorough investigation of their own. Chapters 4 and 5 are a result of this research.Thus this work makes three contributions to the unit commitment problem: a convex hull description for a power generator accommodating many types of constraints, an improved formulation for time-dependent start-up costs, and an exact symmetry reduction technique via reformulation
Learning optimization models in the presence of unknown relations
In a sequential auction with multiple bidding agents, it is highly
challenging to determine the ordering of the items to sell in order to maximize
the revenue due to the fact that the autonomy and private information of the
agents heavily influence the outcome of the auction.
The main contribution of this paper is two-fold. First, we demonstrate how to
apply machine learning techniques to solve the optimal ordering problem in
sequential auctions. We learn regression models from historical auctions, which
are subsequently used to predict the expected value of orderings for new
auctions. Given the learned models, we propose two types of optimization
methods: a black-box best-first search approach, and a novel white-box approach
that maps learned models to integer linear programs (ILP) which can then be
solved by any ILP-solver. Although the studied auction design problem is hard,
our proposed optimization methods obtain good orderings with high revenues.
Our second main contribution is the insight that the internal structure of
regression models can be efficiently evaluated inside an ILP solver for
optimization purposes. To this end, we provide efficient encodings of
regression trees and linear regression models as ILP constraints. This new way
of using learned models for optimization is promising. As the experimental
results show, it significantly outperforms the black-box best-first search in
nearly all settings.Comment: 37 pages. Working pape
- …