11 research outputs found
Replicable parallel branch and bound search
Combinatorial branch and bound searches are a common technique for solving global optimisation and decision problems. Their performance often depends on good search order heuristics, refined over decades of algorithms research. Parallel search necessarily deviates from the sequential search order, sometimes dramatically and unpredictably, e.g. by distributing work at random. This can disrupt effective search order heuristics and lead to unexpected and highly variable parallel performance. The variability makes it hard to reason about the parallel performance of combinatorial searches.
This paper presents a generic parallel branch and bound skeleton, implemented in Haskell, with replicable parallel performance. The skeleton aims to preserve the search order heuristic by distributing work in an ordered fashion, closely following the sequential search order. We demonstrate the generality of the approach by applying the skeleton to 40 instances of three combinatorial problems: Maximum Clique, 0/1 Knapsack and Travelling Salesperson. The overheads of our Haskell skeleton are reasonable: giving slowdown factors of between 1.9 and 6.2 compared with a class-leading, dedicated, and highly optimised C++ Maximum Clique solver. We demonstrate scaling up to 200 cores of a Beowulf cluster, achieving speedups of 100x for several Maximum Clique instances. We demonstrate low variance of parallel performance across all instances of the three combinatorial problems and at all scales up to 200 cores, with median Relative Standard Deviation (RSD) below 2%. Parallel solvers that do not follow the sequential search order exhibit far higher variance, with median RSD exceeding 85% for Knapsack
Multiple graph matching and applications
En aplicaciones de reconocimiento de patrones, los grafos con atributos son en gran medida apropiados. Normalmente, los vértices de los grafos representan partes locales de los objetos i las aristas relaciones entre estas partes locales. No obstante, estas ventajas vienen juntas con un severo inconveniente, la distancia entre dos grafos no puede ser calculada en un tiempo polinómico. Considerando estas caracterÃsticas especiales el uso de los prototipos de grafos es necesariamente omnipresente. Las aplicaciones de los prototipos de grafos son extensas, siendo las más habituales clustering, clasificación, reconocimiento de objetos, caracterización de objetos i bases de datos de grafos entre otras. A pesar de la diversidad de aplicaciones de los prototipos de grafos, el objetivo del mismo es equivalente en todas ellas, la representación de un conjunto de grafos. Para construir un prototipo de un grafo todos los elementos del conjunto de enteramiento tienen que ser etiquetados comúnmente. Este etiquetado común consiste en identificar que nodos de que grafos representan el mismo tipo de información en el conjunto de entrenamiento. Una vez este etiquetaje común esta hecho, los atributos locales pueden ser combinados i el prototipo construido. Hasta ahora los algoritmos del estado del arte para calcular este etiquetaje común mancan de efectividad o bases teóricas. En esta tesis, describimos formalmente el problema del etiquetaje global i mostramos una taxonomÃa de los tipos de algoritmos existentes. Además, proponemos seis nuevos algoritmos para calcular soluciones aproximadas al problema del etiquetaje común. La eficiencia de los algoritmos propuestos es evaluada en diversas bases de datos reales i sintéticas. En la mayorÃa de experimentos realizados los algoritmos propuestos dan mejores resultados que los existentes en el estado del arte.In pattern recognition, the use of graphs is, to a great extend, appropriate and advantageous. Usually, vertices of the graph represent local parts of an object while edges represent relations between these local parts. However, its advantages come together with a sever drawback, the distance between two graph cannot be optimally computed in polynomial time. Taking into account this special characteristic the use of graph prototypes becomes ubiquitous. The applicability of graphs prototypes is extensive, being the most common applications clustering, classification, object characterization and graph databases to name some. However, the objective of a graph prototype is equivalent to all applications, the representation of a set of graph. To synthesize a prototype all elements of the set must be mutually labeled. This mutual labeling consists in identifying which nodes of which graphs represent the same information in the training set. Once this mutual labeling is done the set can be characterized and combined to create a graph prototype. We call this initial labeling a common labeling. Up to now, all state of the art algorithms to compute a common labeling lack on either performance or theoretical basis. In this thesis, we formally describe the common labeling problem and we give a clear taxonomy of the types of algorithms. Six new algorithms that rely on different techniques are described to compute a suboptimal solution to the common labeling problem. The performance of the proposed algorithms is evaluated using an artificial and several real datasets. In addition, the algorithms have been evaluated on several real applications. These applications include graph databases and group-wise image registration. In most of the tests and applications evaluated the presented algorithms have showed a great improvement in comparison to state of the art applications
Schema decision trees for heterogeneous JSON arrays
Due to the popularity of the JavaScript Object Notation (JSON), a need has arisen for the creation of schema documents for the purpose of validating the content of other JSON documents. Existing automatic schema generation tools, however, have not adequately considered the scenario of an array of JSON objects with different types of structures. These tools work off the assumption that all objects have the same structure, and thus, only generate a single schema combining them together. To address this problem, this thesis looks to improve upon schema generation for heterogeneous JSON arrays. We develop an algorithm to determine a set of keys that identifies what type of structure each element has. These keys are then used as the basis for a schema decision tree. The objective of this tree is to help in the validation process by allowing each element to be compared against a single, more tailored, schema
æ倧ã¯ãªãŒã¯åé¡ã®å€é åŒæéçå¯è§£æ§ã«é¢ããç 究
ããããâæ倧ã¯ãªãŒã¯åé¡âã¯å
žåçãªNP å®å
šåé¡ã§ãã, å€é
åŒæéçã«æ¬åé¡ã解ãããšã¯ã»ãŒäžå¯èœã§ãããšåŒ·ãäºæž¬ãããŠããïŒåŸã£ãŠ, å°ãªããšãã©ã®ãããªæ¡ä»¶äžãªãã°ãã®NP å®å
šåé¡ãå€é
åŒæéçã«è§£ãããšãåºæ¥ãããæããã«ããããšã¯éèŠãªèª²é¡ã§ããïŒããã«å¯Ÿã, å¹³é¢ã°ã©ã, ã³ãŒãã«ã°ã©ãçããã€ãã®ç¹æ®ã°ã©ãã«å¯ŸããŠã¯å€é
åŒæéçå¯è§£æ§ãæç«ããããšã瀺ãããŠãã. ãããäžè¬ã°ã©ãã«ãããŠã¯, æ倧ã¯ãªãŒã¯åé¡ãå€é
åŒæéçå¯è§£ãšãªãæ¡ä»¶ã«ã€ããŠ, ãããŸã§ã«ãããŠææ矩ãªå®éççµæã¯çºè¡šãããŠããªãã£ã. ããã§æ¬ç 究ã§ã¯, å
ã極倧ã¯ãªãŒã¯å
šåæã¢ã«ãŽãªãºã CLIQUES (E. Tomita, A. Tanaka, H. Takahashi: Theoretical Computer Science, 2006) ãåºã«ããŠ, åºæ¬çãªæ倧ã¯ãªãŒã¯æœåºã®æ·±ãåªå
æ¢çŽ¢ã¢ã«ãŽãªãºã ã確ç«ãã. ãã®åºæ¬çã¢ã«ãŽãªãºã ã«å¯ŸããŠæ¢çŽ¢é åéå®æäœããã匷ååã, 察å¿ãããã詳现ãªå Žååãã䌎ã£ã解æãè¡ãããšã«ãã, ã¢ã«ãŽãªãºã ãå€é
åŒæéçã«çµç«¯ããæ¡ä»¶ãé次緩åã, 次ã®å®éçãªå€é
åŒæéçå¯è§£æ§æ¡ä»¶ãäžããïŒ å³ã¡, å
ãäžè¬ã°ã©ãã«ãããŠã°ã©ãã®æ倧次æ°Î ã®ã¿ãæ¡ä»¶ãšãã, æ倧ã¯ãªãŒã¯åé¡ã«å¯Ÿãã以äžã®å€é
åŒæéçå¯è§£æ§ã®æç«ã瀺ãã. ãç¯ç¹æ°n ã®ã°ã©ãG = (V,E) ã®æ倧次æ°Î ã,Î_0:å®æ°) ãªãæ¡ä»¶ãæºãããšã, æ倧ã¯ãªãŒã¯åé¡ã¯O(n1+d) ãªãå€é
åŒæéã§å¯è§£ã§ãã. ãããã«æ¬ç 究ã«ãããŠã¯, å
šç¯ç¹ã«å¯Ÿããåèšæ¡ä»¶ãããç·©åãã, 次ã®æ¡åŒµçµæãäžãã. ããµã€ãºn0>_2 ãªãä»»æã®é£çµãªèªå°éšåã°ã©ãG(C)( CâV ) ã«å¯ŸããŠ, C äžã®æå°æ¬¡æ°ç¯ç¹v ã, deg(v)_0:å®æ°) ãæºãããšã, æ倧ã¯ãªãŒã¯åé¡ã¯O(nmax(2,1+d)) ã®å€é
åŒæéã§å¯è§£ã§ãã. ãããã¯, ãµã€ãºn0 ã§ããé£çµãªèªå°éšåã°ã©ãã®ãã¡, 次æ°æå°ã®ç¯ç¹ãé€ãå
šãç¡æ¡ä»¶ãšãããã®ã§, å¶éæ¡ä»¶ã®å€§ããç·©åã§ãã. 以äžæ¬è«æã§ã¯, æ倧ã¯ãªãŒã¯åé¡ã®å€é
åŒæéçå¯è§£æ§ã«ã€ããŠ, æ°ããæ çµã¿ãäžãã.é»æ°é信倧åŠ201
Solving hard subgraph problems in parallel
This thesis improves the state of the art in exact, practical algorithms for finding subgraphs. We study maximum clique, subgraph isomorphism, and maximum common subgraph problems. These are widely applicable: within computing science, subgraph problems arise in document clustering, computer vision, the design of communication protocols, model checking, compiler code generation, malware detection, cryptography, and robotics; beyond, applications occur in biochemistry, electrical engineering, mathematics, law enforcement, fraud detection, fault diagnosis, manufacturing, and sociology. We therefore consider both the ``pure'' forms of these problems, and variants with labels and other domain-specific constraints.
Although subgraph-finding should theoretically be hard, the constraint-based search algorithms we discuss can easily solve real-world instances involving graphs with thousands of vertices, and millions of edges. We therefore ask: is it possible to generate ``really hard'' instances for these problems, and if so, what can we learn? By extending research into combinatorial phase transition phenomena, we develop a better understanding of branching heuristics, as well as highlighting a serious flaw in the design of graph database systems.
This thesis also demonstrates how to exploit two of the kinds of parallelism offered by current computer hardware. Bit parallelism allows us to carry out operations on whole sets of vertices in a single instruction---this is largely routine. Thread parallelism, to make use of the multiple cores offered by all modern processors, is more complex. We suggest three desirable performance characteristics that we would like when introducing thread parallelism: lack of risk (parallel cannot be exponentially slower than sequential), scalability (adding more processing cores cannot make runtimes worse), and reproducibility (the same instance on the same hardware will take roughly
the same time every time it is run). We then detail the difficulties in guaranteeing these characteristics when using modern algorithmic techniques.
Besides ensuring that parallelism cannot make things worse, we also increase the likelihood of it making things better. We compare randomised work stealing to new tailored strategies, and perform experiments to identify the factors contributing to good speedups. We show that whilst load balancing is difficult, the primary factor influencing the results is the interaction between branching heuristics and parallelism. By using parallelism to explicitly offset the commitment made to weak early branching choices, we obtain parallel subgraph solvers which are substantially and consistently better than the best sequential algorithms
æšç·šéè·é¢ã®å®£èšçæå³ã«åºã¥ãéå±€ãšãã®èšç®ã«é¢ããç 究
Webã«ãããHTMLããŒã¿ãXMLããŒã¿,ãã€ãªã€ã³ãã©ããã£ã¯ã¹ã«ãããRNAãç³éããŒã¿ã®ãããªæ ¹ä»ãã©ãã«ä»ãæš(以åŸ,æšãšãã)ãšããŠè¡šçŸãããæšæ§é ããŒã¿ãæ¯èŒããããšã¯,æ§é ããŒã¿ããã®ããŒã¿ãã€ãã³ã°ãæ©æ¢°åŠç¿ã«ãããéèŠãªç 究ã®äžã€ã§ãã.ãã®ãããªæšå士ã®è·é¢ãšããŠæåãªãã®ã®äžã€ã«æšç·šéè·é¢ããã.æšç·šéè·é¢ã¯,ããŒãã®åé€,æ¿å
¥,眮æãããªãç·šéæäœãçšããŠ,äžæ¹ã®æ ¹ä»ãæšããä»æ¹ã®æšãžã®å€æã«å¿
èŠãªç·šéæäœåã®æå°ã³ã¹ããšããŠå®åŒåããã.2ã€ã®æšã®éã®ç·šéæäœåã¯ç¡æ°ã«ååšãããã,æäœåããã¹ãŠèšç®ããŠæšç·šéè·é¢ãæ±ããæ¹æ³ã¯çŸå®çã§ã¯ãªã.ããã§Taiã¯,æšç·šéè·é¢èšç®ã®æéãšããŠ,æšç·šéè·é¢ã«å®£èšçæå³ãäžããTaiãããã³ã°(以åŸåã«ãããã³ã°ãšããã)ãå°å
¥ãã.ãã®Taiãããã³ã°ã¯,å
ç¥åå«é¢ä¿(ããã³é åºæšã®å Žåã¯å
åŒé¢ä¿)ãä¿æããæšã®ããŒãéã®äžå¯Ÿäžå¯Ÿå¿ã§ãã,Taiãããã³ã°ã®æå°ã³ã¹ãã¯æšç·šéè·é¢ãšäžèŽãã.æšç·šéè·é¢ã®èšç®æéã¯,é åºæšã®å Žåã¯ããŒãæ°nã«å¯ŸããŠO(n3)æéã§ããã,ç¡é åºæšã®å Žåã¯MAX SNPå°é£ã§ãã.äžæ¹,ç³éããŒã¿ã§ã¯ããŒãã®ã€ãªããã«æå³ããããããã®ã€ãªããã厩ããªããããªå¶çŽãæ±ããã,XMLããŒã¿ã§ã¯æ ¹ããŒãããäžå®ã®ããŒãã¯ã©ã®æšã«ãå
±éããå Žåããã,ããèããŒãã«éç¹ã眮ããè·é¢ãæ±ãããã.ãã®ããã«,察象ã«ãã£ãŠã¯æšç·šéè·é¢ã¯é床ã«äžè¬çãšãªããã,ä»æ¹ã§ã¯èšç®å¹çãäžãããšããç®çã®äžã«,宣èšçæå³ã§ãããããã³ã°ã«å¶éãå ããããšã§æšç·šéè·é¢ã®ããŸããŸãªå€çš®ãç 究ãããŠãã.ç¹ã«,RNA解æãªã©ã§å©çšãã,åé€ã®åã«æ¿å
¥ãè¡ãæšç·šéè·é¢ã§ãããæšã¢ã©ã€ã¡ã³ãè·é¢ã®èšç®ã¯,é åºæšã®å Žåã¯ããŒãæ°nã«å¯ŸããŠO(n4)æé,ç¡é åºæšã®å Žåã¯äžè¬ã«MAX SNPå°é£ã§ããã,次æ°ãéå®ãããŠããæšã®ãšãã¯å€é
åŒæéã§èšç®ã§ãã.ãã®ã¢ã©ã€ã¡ã³ãè·é¢ã¯,2ã€ã®æšã®è¶
æšãšãªãã¢ã©ã€ã¡ã³ãæšã®æå°ã³ã¹ããšããŠå®åŒåããããšãã§ã,Taiãããã³ã°ã«å¶éãå ããå£å¶éãããã³ã°ã®æå°ã³ã¹ããšäžèŽãã.æ¬è«æã§ã¯,ãŸã,ãããã³ã°ãžã®å¶éãTaiãããã³ã°ã®éå±€ãšããŠæã,ãã®éå±€ãå
±ééšå森,ç¹ã«,å
±ééšå森äžã®ããŒãã®æ¥ç¶ãšéšåæšã®äžŠã³ã®èŠ³ç¹ããèŠçŽãããšã§,æšç·šéè·é¢ã®å€çš®ã®èšç®ã«ãããæ¬è³ªã«ã€ããŠç 究ãã.ãŸã,ãããã®èŠ³ç¹ã«ãã£ãŠæ°ãã«å°å
¥ããããããã³ã°ã«ã€ããŠ,ãããã®æå°ã³ã¹ããšãªãç·šéè·é¢ã®å€çš®ã®æéèšç®æéã解æãã.ãŸã,æšã¢ã©ã€ã¡ã³ãè·é¢ã«å¯ŸããŠ,森ã¢ã©ã€ã¡ã³ãæ§ç¯ã®é«éåãç®çãšããŠå°å
¥ãããã¢ã³ã«ãŒã¢ã©ã€ã¡ã³ãåé¡ãæå±ãããŠãã.ããã¯,ã¢ã³ã«ãŒãšåŒã°ãããããã³ã°ãå
¥åãšã,ãã®ã¢ã³ã«ãŒã§ã®å¯Ÿå¿ãä¿æããã¢ã©ã€ã¡ã³ãæšãæ§ç¯ããåé¡ã§ããã,ãã®ã¢ã³ã«ãŒã¯Taiãããã³ã°ã§ãã,å£å¶éãããã³ã°ã§ãªããããã³ã°ãã¢ã³ã«ãŒãšããŠå
¥åããããšæšãæ§ç¯ããããšãã§ããªã.ããã§æ¬è«æã§ã¯,æšã¢ã©ã€ã¡ã³ãè·é¢ã®å®£èšçæå³ãå£å¶éãããã³ã°ãšãªãããšã®æ§æçãªå¥èšŒæãäžã,ãã®æ§ææ¹æ³ãå©çšããããšã§,ã¢ã³ã«ãŒã¢ã©ã€ã¡ã³ãåé¡ã®åºåã,ã¢ã©ã€ã¡ã³ãæšãæ§ç¯ã§ããªãå Žåã¯ânoâãè¿ã圢ã«å®åŒåãã.ãŸã,ããã«åºã¥ãã¢ã³ã«ãŒã¢ã©ã€ã¡ã³ãè·é¢ãå®åŒåã,ã¢ã³ã«ãŒã¢ã©ã€ã¡ã³ãè·é¢ãšã¢ã©ã€ã¡ã³ãè·é¢ãå®ããŒã¿ãããšã«æ¯èŒãã.ããã«,é åºæšããäžè¬çã§ãã,ç¡é åºæšããå¶éãããå·¡åçé åºæšãææ¡ã,å·¡åçé åºæšéã§ã®ã¢ã©ã€ã¡ã³ãè·é¢ãèšç®ããã¢ã«ãŽãªãºã ãèšèšãã.æåŸã«,æšç·šéè·é¢ã«é¢ããããŸããŸãªå
容ãšããŠ,ç¡é åºæšç·šéè·é¢ãèšç®ããåçAâã¢ã«ãŽãªãºã ã®èšèš,Taiãããã³ã°ã®æ ¹ç¡ãæšãžã®æ¡åŒµ,å·¡åçé åºæšãšæ¬¡æ°å¶éç¡é åºæšã®ãããã³ã°ã«ãŒãã«ã®èšèšãè¡ã.ç¡é åºæšç·šéè·é¢ãèšç®ããã¢ã«ãŽãªãºã ãšããŠã¯,æ¢ã«,è€æ°ã®äžéé¢æ°ãçšããHiguchiãã®Aâã¢ã«ãŽãªãºã ãå°å
¥ãããŠããã,ããã«ã¯èšç®ã®éè€ãååšãããã,æ¹åã®äœå°ããã.æ¬è«æã§ã¯,ãã®éè€èšç®ãåçèšç»æ³ãçšããŠçããåçAâã¢ã«ãŽãªãºã ãå°å
¥ãã.ãŸã,å®éšã«ãã,äžéé¢æ°ã®å¹çã確èªãã.ãŸã,æ ¹ä»ãæšTaiãããã³ã°ã¯æšç·šéè·é¢ã«å¯Ÿå¿ããéèŠãªæŠå¿µã§ããã,ãã®Taiãããã³ã°ãæ ¹ç¡ãæšã«æ¡åŒµããããã«ã¯,åå°ã§ããããšã«å ããŠ,å
ç¥åå«é¢ä¿ã«ä»£ããæ¡ä»¶ãå°å
¥ããå¿
èŠããã.ããã§,ZhangããLCAä¿åãããã³ã°ãæ ¹ç¡ãæšã«æ¡åŒµããéã«çšããäžå¿ã«çç®ã,æ ¹ç¡ãæšã®ãããã³ã°ãå°å
¥ãã.ç¹ã«,æ ¹ç¡ãæšãšããŠããè¡šçŸãããé²å系統暹ãç¹åŸŽã¥ããæ¡ä»¶ã§ãã4ç¹æ¡ä»¶ãš3ç¹æ¡ä»¶ãæšã®ããããžãŒãç¹åŸŽã¥ããæ¡ä»¶ã«å€æŽã,ããããã®æ¡ä»¶ãä¿åãããããªãããã³ã°ãå°å
¥ãã.ããã«,ãµããŒããã¯ã¿ãŒãã·ã³ãå©çšããŠæšãåé¡ããããã®åºæ¬çãªæ¹æ³ã®1ã€ã§ããæšã«ãŒãã«ã¯é åºæšã«ã€ããŠå€ãç 究ããããªãããŠãã,ãã®ã»ãšãã©ã,é åºæšéã®ãããã³ã°ãæ°ãäžãããããã³ã°ã«ãŒãã«ã®ãã¬ãŒã ã¯ãŒã¯ã«åé¡ããã.äžæ¹ã§,ç¡é åºæšã®ã«ãŒãã«ã¯,ãã®èšç®ã®é£ããããã»ãšãã©ç 究ããªãããŠããªã.ããã§,å·¡åçé åºæšãš,次æ°ãå®æ°Dã«å¶éããç¡é åºæšã«å¯Ÿãããããã³ã°ã«ãŒãã«ãèšèšã,ãããã®èšç®æéã«ã€ããŠè°è«ãã.ä¹å·å·¥æ¥å€§åŠå士åŠäœè«æ åŠäœèšçªå·ïŒæ
å·¥åç²ç¬¬332å· åŠäœæäžå¹Žææ¥ïŒå¹³æ30幎3æ23æ¥ç¬¬1ç« ã¯ããã«|第2ç« æšç·šéè·é¢ãšæšã¢ã©ã€ã¡ã³ãè·é¢|第3ç« å
±ééšå森ã«åºã¥ãTaiãããã³ã°éå±€|第4ç« æšã¢ã©ã€ã¡ã³ãè·é¢ã®èšç®|第5ç« ããŸããŸãªæ¡åŒµ|第6ç« çµè«ãšä»åŸã®èª²é¡ä¹å·å·¥æ¥å€§åŠå¹³æ29幎
Evolutionary genomics : statistical and computational methods
This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward