642 research outputs found
Efficient mining of discriminative molecular fragments
Frequent pattern discovery in structured data is receiving
an increasing attention in many application areas of sciences. However, the computational complexity and the large amount of data to be explored often make the sequential algorithms unsuitable. In this context high performance distributed computing becomes a very interesting and promising approach. In this paper we present a parallel formulation of the frequent subgraph mining problem to discover interesting patterns in molecular compounds. The application is characterized by a highly irregular tree-structured computation. No estimation is available for task workloads, which show a power-law distribution in a wide range. The proposed approach allows dynamic resource aggregation and provides fault and latency tolerance. These features make the distributed application suitable for multi-domain heterogeneous environments, such as computational Grids. The distributed application has been evaluated on the well known National Cancer Institute’s HIV-screening dataset
Capacitated Trees, Capacitated Routing, and Associated Polyhedra
We study the polyhedral structure of two related core combinatorial problems: the subtree cardinalityconstrained minimal spanning tree problem and the identical customer vehicle routing problem. For each of these problems, and for a forest relaxation of the minimal spanning tree problem, we introduce a number of new valid inequalities and specify conditions for ensuring when these inequalities are facets for the associated integer polyhedra. The inequalities are defined by one of several underlying support graphs: (i) a multistar, a "star" with a clique replacing the central vertex; (ii) a clique cluster, a collection of cliques intersecting at a single vertex, or more generally at a central" clique; and (iii) a ladybug, consisting of a multistar as a head and a clique as a body. We also consider packing (generalized subtour elimination) constraints, as well as several variants of our basic inequalities, such as partial multistars, whose satellite vertices need not be connected to all of the central vertices. Our development highlights the relationship between the capacitated tree and capacitated forest polytopes and a so-called path-partitioning polytope,and shows how to use monotone polytopes and a set of simple exchange arguments to prove that valid inequalities are facets
The capacitated minimum spanning tree problem
In this thesis we focus on the Capacitated Minimum Spanning Tree (CMST), an extension of the minimum spanning tree (MST) which considers a central or root vertex which receives and sends commodities (information, goods, etc) to a group of terminals. Such commodities flow through links which have capacities that limit the total flow they can accommodate. These capacity constraints over the links result of interest because in many applications the capacity limits are inherent. We find the applications of the CMST in the same areas as the applications of the MST; telecommunications network design, facility location planning, and vehicle routing. The CMST arises in telecommunications networks design when the presence of a central server is compulsory and the flow of information is limited by the capacity of either the server or the connection lines. Its study also results specially interesting in the context of the vehicle routing problem, due to the utility that spanning trees can have in constructive methods. By the simple fact of adding capacity constraints to the MST problem we move from a polynomially solvable problem to a non-polynomial one.
In the first chapter we describe and define the problem, introduce some notation, and present a review of the existing literature. In such review we include formulations and exact methods as well as the most relevant heuristic approaches. In the second chapter two basic formulations and the most used valid inequalities are presented.
In the third chapter we present two new formulations for the CMST which are based on the identification of subroots (vertices directly connected to the root). One way of characterizing CMST solutions is by identifying the subroots and the vertices assigned to them. Both formulations use binary decision variables y to identify the subroots. Additional decision variables x are used to represent the elements (arcs) of the tree. In the second formulation the set of x variables is extended to indicate the depth of the arcs in the tree. For each formulation we present families of valid inequalities and address the separation problem in each case. Also a solution algorithm is proposed.
In the fourth chapter we present a biased random-key genetic algorithm (BRKGA) for the CMST. BRKGA is a population-based metaheuristic, that has been used for combinatorial optimization. Decoders, solution representation and exploring strategies are presented and discussed. A final algorithm to obtain upper bounds for the CMST is proposed.
Numerical results for the BRKGA and two cutting plane algorithms based on the new formulations are presented in the fifth chapter . The above mentioned results are discussed and analyzed in this same chapter.
The conclusion of this thesis are presented in the last chapter, in which we include the opportunity areas suitable for future research.En esta tesis nos enfocamos en el problema del Árbol de Expansión Capacitado de Coste Mínimo (CMST, por sus siglas en inglés), que es una extensión del problema del árbol de expansión de coste mínimo (MST, por sus siglas en inglés). El CMST considera un vértice raíz que funciona como servidor central y que envía y recibe bienes (información, objetos, etc) a un conjunto de vértices llamados terminales. Los bienes solo pueden fluir entre el servidor y las terminales a través de enlaces cuya capacidad es limitada. Dichas restricciones sobre los enlaces dan relevancia al problema, ya que existen muchas aplicaciones en que las restricciones de capacidad son de vital importancia. Dentro de las áreas de aplicación del CMST más importantes se encuentran las relacionadas con el diseño de redes de telecomunicación, el diseño de rutas de vehículos y problemas de localización. Dentro del diseño de redes de telecomunicación, el CMST está presente cuando se considera un servidor central, cuya capacidad de transmisión y envío está limitada por las características de los puertos del servidor o de las líneas de transmisión. Dentro del diseño de rutas de vehículos el CMST resulta relevante debido a la influencia que pueden tener los árboles en el proceso de construcción de soluciones. Por el simple de añadir las restricciones de capacidad, el problema pasa de resolverse de manera exacta en tiempo polinomial usando un algoritmo voraz, a un problema que es muy difícil de resolver de manera exacta.
En el primer capítulo se describe y define el problema, se introduce notación y se presenta una revisión bibliográfica de la literatura existente. En dicha revisión bibliográfica se incluyen formulaciones, métodos exactos y los métodos heurísticos utilizados más importantes. En el siguiente capítulo se muestran dos formulaciones binarias existentes, así como las desigualdades válidas más usadas para resolver el CMST. Para cada una de las formulaciones propuestas, se describe un algoritmo de planos de corte.
Dos nuevas formulaciones para el CMST se presentan en el tercer capítulo. Dichas formulaciones estás basadas en la identificación de un tipo de vértices especiales llamados subraíces. Los subraíces son aquellos vértices que se encuentran directamente conectados al raíz. Un forma de caracterizar las soluciones del CMST es a través de identificar los nodos subraíces y los nodos dependientes a ellos. Ambas formulaciones utilizan variables para identificar los subraices y variables adicionales para identificar los arcos que forman parte del árbol. Adicionalmente, las variables en la segunda formulación ayudan a identificar la profundidad con respecto al raíz a la que se encuentran dichos arcos. Para cada formulación se presentan desigualdades válidas y se plantean procedimientos para resolver el problema de su separación.
En el cuarto capítulo se presenta un algoritmo genético llamado BRKGA para resolver el CMST. El BRKGA está basado en el uso de poblaciones generadas por secuencias de números aleatorios, que posteriormente evolucionan. Diferentes decodificadores, un método de búsqueda local, espacios de búsqueda y estrategias de exploración son presentados y analizados. El capítulo termina presentando un algoritmo final que permite la obtención de cotas superiores para el CMST. Los resultados computacionales para el BRKGA y los dos algoritmos de planos de corte basados en las formulaciones propuestas se muestran en el quinto capítulo. Dichos resultados son analizados y discutidos en dicho capítulo.
La tesis termina presentando las conclusiones derivadas del desarrollo del trabajo de investigación, así como las áreas de oportunidad sobre las que es posible realizar futuras investigaciones
Dynamic load balancing for the distributed mining of molecular structures
In molecular biology, it is often desirable to find common properties in large numbers of drug candidates. One family of
methods stems from the data mining community, where algorithms to find frequent graphs have received increasing attention over the
past years. However, the computational complexity of the underlying problem and the large amount of data to be explored essentially
render sequential algorithms useless. In this paper, we present a distributed approach to the frequent subgraph mining problem to
discover interesting patterns in molecular compounds. This problem is characterized by a highly irregular search tree, whereby no
reliable workload prediction is available. We describe the three main aspects of the proposed distributed algorithm, namely, a dynamic
partitioning of the search space, a distribution process based on a peer-to-peer communication framework, and a novel receiverinitiated
load balancing algorithm. The effectiveness of the distributed method has been evaluated on the well-known National Cancer
Institute’s HIV-screening data set, where we were able to show close-to linear speedup in a network of workstations. The proposed
approach also allows for dynamic resource aggregation in a non dedicated computational environment. These features make it suitable
for large-scale, multi-domain, heterogeneous environments, such as computational grids
Structured Sparsity: Discrete and Convex approaches
Compressive sensing (CS) exploits sparsity to recover sparse or compressible
signals from dimensionality reducing, non-adaptive sensing mechanisms. Sparsity
is also used to enhance interpretability in machine learning and statistics
applications: While the ambient dimension is vast in modern data analysis
problems, the relevant information therein typically resides in a much lower
dimensional space. However, many solutions proposed nowadays do not leverage
the true underlying structure. Recent results in CS extend the simple sparsity
idea to more sophisticated {\em structured} sparsity models, which describe the
interdependency between the nonzero components of a signal, allowing to
increase the interpretability of the results and lead to better recovery
performance. In order to better understand the impact of structured sparsity,
in this chapter we analyze the connections between the discrete models and
their convex relaxations, highlighting their relative advantages. We start with
the general group sparse model and then elaborate on two important special
cases: the dispersive and the hierarchical models. For each, we present the
models in their discrete nature, discuss how to solve the ensuing discrete
problems and then describe convex relaxations. We also consider more general
structures as defined by set functions and present their convex proxies.
Further, we discuss efficient optimization solutions for structured sparsity
problems and illustrate structured sparsity in action via three applications.Comment: 30 pages, 18 figure
- …