6 research outputs found
A clique-based method for the edit distance between unordered trees and its application to analysis of glycan structures
[Background]Measuring similarities between tree structured data is important for analysis of RNA secondary structures, phylogenetic trees, glycan structures, and vascular trees. The edit distance is one of the most widely used measures for comparison of tree structured data. However, it is known that computation of the edit distance for rooted unordered trees is NP-hard. Furthermore, there is almost no available software tool that can compute the exact edit distance for unordered trees. [Results]In this paper, we present a practical method for computing the edit distance between rooted unordered trees. In this method, the edit distance problem for unordered trees is transformed into the maximum clique problem and then efficient solvers for the maximum clique problem are applied. We applied the proposed method to similar structure search for glycan structures. The result suggests that our proposed method can efficiently compute the edit distance for moderate size unordered trees. It also suggests that the proposed method has the accuracy comparative to those by the edit distance for ordered trees and by an existing method for glycan search. [Conclusions]The proposed method is simple but useful for computation of the edit distance between unordered trees. The object code is available upon request
æ倧ã¯ãªãŒã¯åé¡ã®å€é åŒæéçå¯è§£æ§ã«é¢ããç 究
ããããâæ倧ã¯ãªãŒã¯åé¡âã¯å
žåçãªNP å®å
šåé¡ã§ãã, å€é
åŒæéçã«æ¬åé¡ã解ãããšã¯ã»ãŒäžå¯èœã§ãããšåŒ·ãäºæž¬ãããŠããïŒåŸã£ãŠ, å°ãªããšãã©ã®ãããªæ¡ä»¶äžãªãã°ãã®NP å®å
šåé¡ãå€é
åŒæéçã«è§£ãããšãåºæ¥ãããæããã«ããããšã¯éèŠãªèª²é¡ã§ããïŒããã«å¯Ÿã, å¹³é¢ã°ã©ã, ã³ãŒãã«ã°ã©ãçããã€ãã®ç¹æ®ã°ã©ãã«å¯ŸããŠã¯å€é
åŒæéçå¯è§£æ§ãæç«ããããšã瀺ãããŠãã. ãããäžè¬ã°ã©ãã«ãããŠã¯, æ倧ã¯ãªãŒã¯åé¡ãå€é
åŒæéçå¯è§£ãšãªãæ¡ä»¶ã«ã€ããŠ, ãããŸã§ã«ãããŠææ矩ãªå®éççµæã¯çºè¡šãããŠããªãã£ã. ããã§æ¬ç 究ã§ã¯, å
ã極倧ã¯ãªãŒã¯å
šåæã¢ã«ãŽãªãºã CLIQUES (E. Tomita, A. Tanaka, H. Takahashi: Theoretical Computer Science, 2006) ãåºã«ããŠ, åºæ¬çãªæ倧ã¯ãªãŒã¯æœåºã®æ·±ãåªå
æ¢çŽ¢ã¢ã«ãŽãªãºã ã確ç«ãã. ãã®åºæ¬çã¢ã«ãŽãªãºã ã«å¯ŸããŠæ¢çŽ¢é åéå®æäœããã匷ååã, 察å¿ãããã詳现ãªå Žååãã䌎ã£ã解æãè¡ãããšã«ãã, ã¢ã«ãŽãªãºã ãå€é
åŒæéçã«çµç«¯ããæ¡ä»¶ãé次緩åã, 次ã®å®éçãªå€é
åŒæéçå¯è§£æ§æ¡ä»¶ãäžããïŒ å³ã¡, å
ãäžè¬ã°ã©ãã«ãããŠã°ã©ãã®æ倧次æ°Î ã®ã¿ãæ¡ä»¶ãšãã, æ倧ã¯ãªãŒã¯åé¡ã«å¯Ÿãã以äžã®å€é
åŒæéçå¯è§£æ§ã®æç«ã瀺ãã. ãç¯ç¹æ°n ã®ã°ã©ãG = (V,E) ã®æ倧次æ°Î ã,Î_0:å®æ°) ãªãæ¡ä»¶ãæºãããšã, æ倧ã¯ãªãŒã¯åé¡ã¯O(n1+d) ãªãå€é
åŒæéã§å¯è§£ã§ãã. ãããã«æ¬ç 究ã«ãããŠã¯, å
šç¯ç¹ã«å¯Ÿããåèšæ¡ä»¶ãããç·©åãã, 次ã®æ¡åŒµçµæãäžãã. ããµã€ãºn0>_2 ãªãä»»æã®é£çµãªèªå°éšåã°ã©ãG(C)( CâV ) ã«å¯ŸããŠ, C äžã®æå°æ¬¡æ°ç¯ç¹v ã, deg(v)_0:å®æ°) ãæºãããšã, æ倧ã¯ãªãŒã¯åé¡ã¯O(nmax(2,1+d)) ã®å€é
åŒæéã§å¯è§£ã§ãã. ãããã¯, ãµã€ãºn0 ã§ããé£çµãªèªå°éšåã°ã©ãã®ãã¡, 次æ°æå°ã®ç¯ç¹ãé€ãå
šãç¡æ¡ä»¶ãšãããã®ã§, å¶éæ¡ä»¶ã®å€§ããç·©åã§ãã. 以äžæ¬è«æã§ã¯, æ倧ã¯ãªãŒã¯åé¡ã®å€é
åŒæéçå¯è§£æ§ã«ã€ããŠ, æ°ããæ çµã¿ãäžãã.é»æ°é信倧åŠ201
Efficient similarity computations on parallel machines using data shaping
Similarity computation is a fundamental operation in all forms of data. Big Data is, typically, characterized by attributes such as volume, velocity, variety, veracity, etc. In general, Big Data variety appears as structured, semi-structured or unstructured forms. The volume of Big Data in general, and semi-structured data in particular, is increasing at a phenomenal rate. Big Data phenomenon is posing new set of challenges to similarity computation problems occurring in semi-structured data.
Technology and processor architecture trends suggest very strongly that future processors shall have ten\u27s of thousands of cores (hardware threads). Another crucial trend is that ratio between on-chip and off-chip memory to core counts is decreasing. State-of-the-art parallel computing platforms such as General Purpose Graphics Processors (GPUs) and MICs are promising for high performance as well high throughput computing. However, processing semi-structured component of Big Data efficiently using parallel computing systems (e.g. GPUs) is challenging. Reason being most of the emerging platforms (e.g. GPUs) are organized as Single Instruction Multiple Thread/Data machines which are highly structured, where several cores (streaming processors) operate in lock-step manner, or they require a high degree of task-level parallelism.
We argue that effective and efficient solutions to key similarity computation problems need to operate in a synergistic manner with the underlying computing hardware. Moreover, semi-structured form input data needs to be shaped or reorganized with the goal to exploit the enormous computing power of \textit{state-of-the-art} highly threaded architectures such as GPUs. For example, shaping input data (via encoding) with minimal data-dependence can facilitate flexible and concurrent computations on high throughput accelerators/co-processors such as GPU, MIC, etc.
We consider various instances of traditional and futuristic problems occurring in intersection of semi-structured data and data analytics. Preprocessing is an operation common at initial stages of data processing pipelines. Typically, the preprocessing involves operations such as data extraction, data selection, etc. In context of semi-structured data, twig filtering is used in identifying (and extracting) data of interest. Duplicate detection and record linkage operations are useful in preprocessing tasks such as data cleaning, data fusion, and also useful in data mining, etc., in order to find similar tree objects. Likewise, tree edit is a fundamental metric used in context of tree problems; and similarity computation between trees another key problem in context of Big Data.
This dissertation makes a case for platform-centric data shaping as a potent mechanism to tackle the data- and architecture-borne issues in context of semi-structured data processing on GPU and GPU-like parallel architecture machines. In this dissertation, we propose several data shaping techniques for tree matching problems occurring in semi-structured data. We experiment with real world datasets. The experimental results obtained reveal that the proposed platform-centric data shaping approach is effective for computing similarities between tree objects using GPGPUs. The techniques proposed result in performance gains up to three orders of magnitude, subject to problem and platform
Multiple graph matching and applications
En aplicaciones de reconocimiento de patrones, los grafos con atributos son en gran medida apropiados. Normalmente, los vértices de los grafos representan partes locales de los objetos i las aristas relaciones entre estas partes locales. No obstante, estas ventajas vienen juntas con un severo inconveniente, la distancia entre dos grafos no puede ser calculada en un tiempo polinómico. Considerando estas caracterÃsticas especiales el uso de los prototipos de grafos es necesariamente omnipresente. Las aplicaciones de los prototipos de grafos son extensas, siendo las más habituales clustering, clasificación, reconocimiento de objetos, caracterización de objetos i bases de datos de grafos entre otras. A pesar de la diversidad de aplicaciones de los prototipos de grafos, el objetivo del mismo es equivalente en todas ellas, la representación de un conjunto de grafos. Para construir un prototipo de un grafo todos los elementos del conjunto de enteramiento tienen que ser etiquetados comúnmente. Este etiquetado común consiste en identificar que nodos de que grafos representan el mismo tipo de información en el conjunto de entrenamiento. Una vez este etiquetaje común esta hecho, los atributos locales pueden ser combinados i el prototipo construido. Hasta ahora los algoritmos del estado del arte para calcular este etiquetaje común mancan de efectividad o bases teóricas. En esta tesis, describimos formalmente el problema del etiquetaje global i mostramos una taxonomÃa de los tipos de algoritmos existentes. Además, proponemos seis nuevos algoritmos para calcular soluciones aproximadas al problema del etiquetaje común. La eficiencia de los algoritmos propuestos es evaluada en diversas bases de datos reales i sintéticas. En la mayorÃa de experimentos realizados los algoritmos propuestos dan mejores resultados que los existentes en el estado del arte.In pattern recognition, the use of graphs is, to a great extend, appropriate and advantageous. Usually, vertices of the graph represent local parts of an object while edges represent relations between these local parts. However, its advantages come together with a sever drawback, the distance between two graph cannot be optimally computed in polynomial time. Taking into account this special characteristic the use of graph prototypes becomes ubiquitous. The applicability of graphs prototypes is extensive, being the most common applications clustering, classification, object characterization and graph databases to name some. However, the objective of a graph prototype is equivalent to all applications, the representation of a set of graph. To synthesize a prototype all elements of the set must be mutually labeled. This mutual labeling consists in identifying which nodes of which graphs represent the same information in the training set. Once this mutual labeling is done the set can be characterized and combined to create a graph prototype. We call this initial labeling a common labeling. Up to now, all state of the art algorithms to compute a common labeling lack on either performance or theoretical basis. In this thesis, we formally describe the common labeling problem and we give a clear taxonomy of the types of algorithms. Six new algorithms that rely on different techniques are described to compute a suboptimal solution to the common labeling problem. The performance of the proposed algorithms is evaluated using an artificial and several real datasets. In addition, the algorithms have been evaluated on several real applications. These applications include graph databases and group-wise image registration. In most of the tests and applications evaluated the presented algorithms have showed a great improvement in comparison to state of the art applications
æšç·šéè·é¢ã®å®£èšçæå³ã«åºã¥ãéå±€ãšãã®èšç®ã«é¢ããç 究
Webã«ãããHTMLããŒã¿ãXMLããŒã¿,ãã€ãªã€ã³ãã©ããã£ã¯ã¹ã«ãããRNAãç³éããŒã¿ã®ãããªæ ¹ä»ãã©ãã«ä»ãæš(以åŸ,æšãšãã)ãšããŠè¡šçŸãããæšæ§é ããŒã¿ãæ¯èŒããããšã¯,æ§é ããŒã¿ããã®ããŒã¿ãã€ãã³ã°ãæ©æ¢°åŠç¿ã«ãããéèŠãªç 究ã®äžã€ã§ãã.ãã®ãããªæšå士ã®è·é¢ãšããŠæåãªãã®ã®äžã€ã«æšç·šéè·é¢ããã.æšç·šéè·é¢ã¯,ããŒãã®åé€,æ¿å
¥,眮æãããªãç·šéæäœãçšããŠ,äžæ¹ã®æ ¹ä»ãæšããä»æ¹ã®æšãžã®å€æã«å¿
èŠãªç·šéæäœåã®æå°ã³ã¹ããšããŠå®åŒåããã.2ã€ã®æšã®éã®ç·šéæäœåã¯ç¡æ°ã«ååšãããã,æäœåããã¹ãŠèšç®ããŠæšç·šéè·é¢ãæ±ããæ¹æ³ã¯çŸå®çã§ã¯ãªã.ããã§Taiã¯,æšç·šéè·é¢èšç®ã®æéãšããŠ,æšç·šéè·é¢ã«å®£èšçæå³ãäžããTaiãããã³ã°(以åŸåã«ãããã³ã°ãšããã)ãå°å
¥ãã.ãã®Taiãããã³ã°ã¯,å
ç¥åå«é¢ä¿(ããã³é åºæšã®å Žåã¯å
åŒé¢ä¿)ãä¿æããæšã®ããŒãéã®äžå¯Ÿäžå¯Ÿå¿ã§ãã,Taiãããã³ã°ã®æå°ã³ã¹ãã¯æšç·šéè·é¢ãšäžèŽãã.æšç·šéè·é¢ã®èšç®æéã¯,é åºæšã®å Žåã¯ããŒãæ°nã«å¯ŸããŠO(n3)æéã§ããã,ç¡é åºæšã®å Žåã¯MAX SNPå°é£ã§ãã.äžæ¹,ç³éããŒã¿ã§ã¯ããŒãã®ã€ãªããã«æå³ããããããã®ã€ãªããã厩ããªããããªå¶çŽãæ±ããã,XMLããŒã¿ã§ã¯æ ¹ããŒãããäžå®ã®ããŒãã¯ã©ã®æšã«ãå
±éããå Žåããã,ããèããŒãã«éç¹ã眮ããè·é¢ãæ±ãããã.ãã®ããã«,察象ã«ãã£ãŠã¯æšç·šéè·é¢ã¯é床ã«äžè¬çãšãªããã,ä»æ¹ã§ã¯èšç®å¹çãäžãããšããç®çã®äžã«,宣èšçæå³ã§ãããããã³ã°ã«å¶éãå ããããšã§æšç·šéè·é¢ã®ããŸããŸãªå€çš®ãç 究ãããŠãã.ç¹ã«,RNA解æãªã©ã§å©çšãã,åé€ã®åã«æ¿å
¥ãè¡ãæšç·šéè·é¢ã§ãããæšã¢ã©ã€ã¡ã³ãè·é¢ã®èšç®ã¯,é åºæšã®å Žåã¯ããŒãæ°nã«å¯ŸããŠO(n4)æé,ç¡é åºæšã®å Žåã¯äžè¬ã«MAX SNPå°é£ã§ããã,次æ°ãéå®ãããŠããæšã®ãšãã¯å€é
åŒæéã§èšç®ã§ãã.ãã®ã¢ã©ã€ã¡ã³ãè·é¢ã¯,2ã€ã®æšã®è¶
æšãšãªãã¢ã©ã€ã¡ã³ãæšã®æå°ã³ã¹ããšããŠå®åŒåããããšãã§ã,Taiãããã³ã°ã«å¶éãå ããå£å¶éãããã³ã°ã®æå°ã³ã¹ããšäžèŽãã.æ¬è«æã§ã¯,ãŸã,ãããã³ã°ãžã®å¶éãTaiãããã³ã°ã®éå±€ãšããŠæã,ãã®éå±€ãå
±ééšå森,ç¹ã«,å
±ééšå森äžã®ããŒãã®æ¥ç¶ãšéšåæšã®äžŠã³ã®èŠ³ç¹ããèŠçŽãããšã§,æšç·šéè·é¢ã®å€çš®ã®èšç®ã«ãããæ¬è³ªã«ã€ããŠç 究ãã.ãŸã,ãããã®èŠ³ç¹ã«ãã£ãŠæ°ãã«å°å
¥ããããããã³ã°ã«ã€ããŠ,ãããã®æå°ã³ã¹ããšãªãç·šéè·é¢ã®å€çš®ã®æéèšç®æéã解æãã.ãŸã,æšã¢ã©ã€ã¡ã³ãè·é¢ã«å¯ŸããŠ,森ã¢ã©ã€ã¡ã³ãæ§ç¯ã®é«éåãç®çãšããŠå°å
¥ãããã¢ã³ã«ãŒã¢ã©ã€ã¡ã³ãåé¡ãæå±ãããŠãã.ããã¯,ã¢ã³ã«ãŒãšåŒã°ãããããã³ã°ãå
¥åãšã,ãã®ã¢ã³ã«ãŒã§ã®å¯Ÿå¿ãä¿æããã¢ã©ã€ã¡ã³ãæšãæ§ç¯ããåé¡ã§ããã,ãã®ã¢ã³ã«ãŒã¯Taiãããã³ã°ã§ãã,å£å¶éãããã³ã°ã§ãªããããã³ã°ãã¢ã³ã«ãŒãšããŠå
¥åããããšæšãæ§ç¯ããããšãã§ããªã.ããã§æ¬è«æã§ã¯,æšã¢ã©ã€ã¡ã³ãè·é¢ã®å®£èšçæå³ãå£å¶éãããã³ã°ãšãªãããšã®æ§æçãªå¥èšŒæãäžã,ãã®æ§ææ¹æ³ãå©çšããããšã§,ã¢ã³ã«ãŒã¢ã©ã€ã¡ã³ãåé¡ã®åºåã,ã¢ã©ã€ã¡ã³ãæšãæ§ç¯ã§ããªãå Žåã¯ânoâãè¿ã圢ã«å®åŒåãã.ãŸã,ããã«åºã¥ãã¢ã³ã«ãŒã¢ã©ã€ã¡ã³ãè·é¢ãå®åŒåã,ã¢ã³ã«ãŒã¢ã©ã€ã¡ã³ãè·é¢ãšã¢ã©ã€ã¡ã³ãè·é¢ãå®ããŒã¿ãããšã«æ¯èŒãã.ããã«,é åºæšããäžè¬çã§ãã,ç¡é åºæšããå¶éãããå·¡åçé åºæšãææ¡ã,å·¡åçé åºæšéã§ã®ã¢ã©ã€ã¡ã³ãè·é¢ãèšç®ããã¢ã«ãŽãªãºã ãèšèšãã.æåŸã«,æšç·šéè·é¢ã«é¢ããããŸããŸãªå
容ãšããŠ,ç¡é åºæšç·šéè·é¢ãèšç®ããåçAâã¢ã«ãŽãªãºã ã®èšèš,Taiãããã³ã°ã®æ ¹ç¡ãæšãžã®æ¡åŒµ,å·¡åçé åºæšãšæ¬¡æ°å¶éç¡é åºæšã®ãããã³ã°ã«ãŒãã«ã®èšèšãè¡ã.ç¡é åºæšç·šéè·é¢ãèšç®ããã¢ã«ãŽãªãºã ãšããŠã¯,æ¢ã«,è€æ°ã®äžéé¢æ°ãçšããHiguchiãã®Aâã¢ã«ãŽãªãºã ãå°å
¥ãããŠããã,ããã«ã¯èšç®ã®éè€ãååšãããã,æ¹åã®äœå°ããã.æ¬è«æã§ã¯,ãã®éè€èšç®ãåçèšç»æ³ãçšããŠçããåçAâã¢ã«ãŽãªãºã ãå°å
¥ãã.ãŸã,å®éšã«ãã,äžéé¢æ°ã®å¹çã確èªãã.ãŸã,æ ¹ä»ãæšTaiãããã³ã°ã¯æšç·šéè·é¢ã«å¯Ÿå¿ããéèŠãªæŠå¿µã§ããã,ãã®Taiãããã³ã°ãæ ¹ç¡ãæšã«æ¡åŒµããããã«ã¯,åå°ã§ããããšã«å ããŠ,å
ç¥åå«é¢ä¿ã«ä»£ããæ¡ä»¶ãå°å
¥ããå¿
èŠããã.ããã§,ZhangããLCAä¿åãããã³ã°ãæ ¹ç¡ãæšã«æ¡åŒµããéã«çšããäžå¿ã«çç®ã,æ ¹ç¡ãæšã®ãããã³ã°ãå°å
¥ãã.ç¹ã«,æ ¹ç¡ãæšãšããŠããè¡šçŸãããé²å系統暹ãç¹åŸŽã¥ããæ¡ä»¶ã§ãã4ç¹æ¡ä»¶ãš3ç¹æ¡ä»¶ãæšã®ããããžãŒãç¹åŸŽã¥ããæ¡ä»¶ã«å€æŽã,ããããã®æ¡ä»¶ãä¿åãããããªãããã³ã°ãå°å
¥ãã.ããã«,ãµããŒããã¯ã¿ãŒãã·ã³ãå©çšããŠæšãåé¡ããããã®åºæ¬çãªæ¹æ³ã®1ã€ã§ããæšã«ãŒãã«ã¯é åºæšã«ã€ããŠå€ãç 究ããããªãããŠãã,ãã®ã»ãšãã©ã,é åºæšéã®ãããã³ã°ãæ°ãäžãããããã³ã°ã«ãŒãã«ã®ãã¬ãŒã ã¯ãŒã¯ã«åé¡ããã.äžæ¹ã§,ç¡é åºæšã®ã«ãŒãã«ã¯,ãã®èšç®ã®é£ããããã»ãšãã©ç 究ããªãããŠããªã.ããã§,å·¡åçé åºæšãš,次æ°ãå®æ°Dã«å¶éããç¡é åºæšã«å¯Ÿãããããã³ã°ã«ãŒãã«ãèšèšã,ãããã®èšç®æéã«ã€ããŠè°è«ãã.ä¹å·å·¥æ¥å€§åŠå士åŠäœè«æ åŠäœèšçªå·ïŒæ
å·¥åç²ç¬¬332å· åŠäœæäžå¹Žææ¥ïŒå¹³æ30幎3æ23æ¥ç¬¬1ç« ã¯ããã«|第2ç« æšç·šéè·é¢ãšæšã¢ã©ã€ã¡ã³ãè·é¢|第3ç« å
±ééšå森ã«åºã¥ãTaiãããã³ã°éå±€|第4ç« æšã¢ã©ã€ã¡ã³ãè·é¢ã®èšç®|第5ç« ããŸããŸãªæ¡åŒµ|第6ç« çµè«ãšä»åŸã®èª²é¡ä¹å·å·¥æ¥å€§åŠå¹³æ29幎