Search CORE

6 research outputs found

A clique-based method for the edit distance between unordered trees and its application to analysis of glycan structures

Author: Akutsu Tatsuya
Fukagawa Daiji
Takasu Atsuhiro
Tamura Takeyuki
Tomita Etsuji
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

[Background]Measuring similarities between tree structured data is important for analysis of RNA secondary structures, phylogenetic trees, glycan structures, and vascular trees. The edit distance is one of the most widely used measures for comparison of tree structured data. However, it is known that computation of the edit distance for rooted unordered trees is NP-hard. Furthermore, there is almost no available software tool that can compute the exact edit distance for unordered trees. [Results]In this paper, we present a practical method for computing the edit distance between rooted unordered trees. In this method, the edit distance problem for unordered trees is transformed into the maximum clique problem and then efficient solvers for the maximum clique problem are applied. We applied the proposed method to similar structure search for glycan structures. The result suggests that our proposed method can efficiently compute the edit distance for moderate size unordered trees. It also suggests that the proposed method has the accuracy comparative to those by the edit distance for ordered trees and by an existing method for glycan search. [Conclusions]The proposed method is simple but useful for computation of the edit distance between unordered trees. The object code is available upon request

Crossref

Springer - Publisher Connector

PubMed Central

Kyoto University Research Information Repository

最大クリーク問題の多項式時間的可解性に関する研究

Author: Hiroaki Nakanishi
中西裕陽
Publication venue
Publication date: 19/12/2016
Field of study

いわゆる“最大クリーク問題”は典型的なNP 完全問題であり, 多項式時間的に本問題を解くことはほぼ不可能であると強く予測されている．従って, 少なくともどのような条件下ならばこのNP 完全問題を多項式時間的に解くことが出来るかを明らかにすることは重要な課題である．これに対し, 平面グラフ, コーダルグラフ等いくつかの特殊グラフに対しては多項式時間的可解性が成立することが示されている. しかし一般グラフにおいては, 最大クリーク問題が多項式時間的可解となる条件について, これまでにおいて有意義な定量的結果は発表されていなかった. そこで本研究では, 先ず極大クリーク全列挙アルゴリズムCLIQUES (E. Tomita, A. Tanaka, H. Takahashi: Theoretical Computer Science, 2006) を基にして, 基本的な最大クリーク抽出の深さ優先探索アルゴリズムを確立した. この基本的アルゴリズムに対して探索領域限定操作をより強力化し, 対応したより詳細な場合分けを伴った解析を行うことにより, アルゴリズムが多項式時間的に終端する条件を逐次緩和し, 次の定量的な多項式時間的可解性条件を与えた．即ち, 先ず一般グラフにおいてグラフの最大次数Δ のみを条件とした, 最大クリーク問題に対する以下の多項式時間的可解性の成立を示した. 「節点数n のグラフG = (V,E) の最大次数Δ が,Δ_0:定数) なる条件を満たすとき, 最大クリーク問題はO(n1+d) なる多項式時間で可解である. 」さらに本研究においては, 全節点に対する前記条件をより緩和した, 次の拡張結果も与えた. 「サイズn0>_2 なる任意の連結な誘導部分グラフG(C)( C⊆V ) に対して, C 中の最小次数節点v が, deg(v)_0:定数) を満たすとき, 最大クリーク問題はO(nmax(2,1+d)) の多項式時間で可解である. 」これは, サイズn0 である連結な誘導部分グラフのうち, 次数最小の節点を除き全く無条件としたもので, 制限条件の大きい緩和である. 以上本論文では, 最大クリーク問題の多項式時間的可解性について, 新しい枠組みを与えた.電気通信大学201

Creative Repository of Electro-Communications

Efficient similarity computations on parallel machines using data shaping

Author: Shukla Parijat
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2017
Field of study

Similarity computation is a fundamental operation in all forms of data. Big Data is, typically, characterized by attributes such as volume, velocity, variety, veracity, etc. In general, Big Data variety appears as structured, semi-structured or unstructured forms. The volume of Big Data in general, and semi-structured data in particular, is increasing at a phenomenal rate. Big Data phenomenon is posing new set of challenges to similarity computation problems occurring in semi-structured data. Technology and processor architecture trends suggest very strongly that future processors shall have ten\u27s of thousands of cores (hardware threads). Another crucial trend is that ratio between on-chip and off-chip memory to core counts is decreasing. State-of-the-art parallel computing platforms such as General Purpose Graphics Processors (GPUs) and MICs are promising for high performance as well high throughput computing. However, processing semi-structured component of Big Data efficiently using parallel computing systems (e.g. GPUs) is challenging. Reason being most of the emerging platforms (e.g. GPUs) are organized as Single Instruction Multiple Thread/Data machines which are highly structured, where several cores (streaming processors) operate in lock-step manner, or they require a high degree of task-level parallelism. We argue that effective and efficient solutions to key similarity computation problems need to operate in a synergistic manner with the underlying computing hardware. Moreover, semi-structured form input data needs to be shaped or reorganized with the goal to exploit the enormous computing power of \textit{state-of-the-art} highly threaded architectures such as GPUs. For example, shaping input data (via encoding) with minimal data-dependence can facilitate flexible and concurrent computations on high throughput accelerators/co-processors such as GPU, MIC, etc. We consider various instances of traditional and futuristic problems occurring in intersection of semi-structured data and data analytics. Preprocessing is an operation common at initial stages of data processing pipelines. Typically, the preprocessing involves operations such as data extraction, data selection, etc. In context of semi-structured data, twig filtering is used in identifying (and extracting) data of interest. Duplicate detection and record linkage operations are useful in preprocessing tasks such as data cleaning, data fusion, and also useful in data mining, etc., in order to find similar tree objects. Likewise, tree edit is a fundamental metric used in context of tree problems; and similarity computation between trees another key problem in context of Big Data. This dissertation makes a case for platform-centric data shaping as a potent mechanism to tackle the data- and architecture-borne issues in context of semi-structured data processing on GPU and GPU-like parallel architecture machines. In this dissertation, we propose several data shaping techniques for tree matching problems occurring in semi-structured data. We experiment with real world datasets. The experimental results obtained reveal that the proposed platform-centric data shaping approach is effective for computing similarities between tree objects using GPGPUs. The techniques proposed result in performance gains up to three orders of magnitude, subject to problem and platform

Digital Repository @ Iowa State University (ISU)

Multiple graph matching and applications

Author: Solé Ribalta Albert
Publication venue: 'Universitat Rovira I Virgili'
Publication date: 01/01/2012
Field of study

En aplicaciones de reconocimiento de patrones, los grafos con atributos son en gran medida apropiados. Normalmente, los vértices de los grafos representan partes locales de los objetos i las aristas relaciones entre estas partes locales. No obstante, estas ventajas vienen juntas con un severo inconveniente, la distancia entre dos grafos no puede ser calculada en un tiempo polinómico. Considerando estas características especiales el uso de los prototipos de grafos es necesariamente omnipresente. Las aplicaciones de los prototipos de grafos son extensas, siendo las más habituales clustering, clasificación, reconocimiento de objetos, caracterización de objetos i bases de datos de grafos entre otras. A pesar de la diversidad de aplicaciones de los prototipos de grafos, el objetivo del mismo es equivalente en todas ellas, la representación de un conjunto de grafos. Para construir un prototipo de un grafo todos los elementos del conjunto de enteramiento tienen que ser etiquetados comúnmente. Este etiquetado común consiste en identificar que nodos de que grafos representan el mismo tipo de información en el conjunto de entrenamiento. Una vez este etiquetaje común esta hecho, los atributos locales pueden ser combinados i el prototipo construido. Hasta ahora los algoritmos del estado del arte para calcular este etiquetaje común mancan de efectividad o bases teóricas. En esta tesis, describimos formalmente el problema del etiquetaje global i mostramos una taxonomía de los tipos de algoritmos existentes. Además, proponemos seis nuevos algoritmos para calcular soluciones aproximadas al problema del etiquetaje común. La eficiencia de los algoritmos propuestos es evaluada en diversas bases de datos reales i sintéticas. En la mayoría de experimentos realizados los algoritmos propuestos dan mejores resultados que los existentes en el estado del arte.In pattern recognition, the use of graphs is, to a great extend, appropriate and advantageous. Usually, vertices of the graph represent local parts of an object while edges represent relations between these local parts. However, its advantages come together with a sever drawback, the distance between two graph cannot be optimally computed in polynomial time. Taking into account this special characteristic the use of graph prototypes becomes ubiquitous. The applicability of graphs prototypes is extensive, being the most common applications clustering, classification, object characterization and graph databases to name some. However, the objective of a graph prototype is equivalent to all applications, the representation of a set of graph. To synthesize a prototype all elements of the set must be mutually labeled. This mutual labeling consists in identifying which nodes of which graphs represent the same information in the training set. Once this mutual labeling is done the set can be characterized and combined to create a graph prototype. We call this initial labeling a common labeling. Up to now, all state of the art algorithms to compute a common labeling lack on either performance or theoretical basis. In this thesis, we formally describe the common labeling problem and we give a clear taxonomy of the types of algorithms. Six new algorithms that rely on different techniques are described to compute a suboptimal solution to the common labeling problem. The performance of the proposed algorithms is evaluated using an artificial and several real datasets. In addition, the algorithms have been evaluated on several real applications. These applications include graph databases and group-wise image registration. In most of the tests and applications evaluated the presented algorithms have showed a great improvement in comparison to state of the art applications

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Tesis Doctorals en Xarxa

Repositori Institucional URV

木編集距離の宣言的意味に基づく階層とその計算に関する研究

Author: 芳野拓也
Publication venue: 平田, 耕一
Publication date: 13/06/2018
Field of study

WebにおけるHTMLデータやXMLデータ,バイオインフォマティクスにおけるRNAや糖鎖データのような根付きラベル付き木(以後,木という)として表現される木構造データを比較することは,構造データからのデータマイニングや機械学習における重要な研究の一つである.そのような木同士の距離として有名なものの一つに木編集距離がある.木編集距離は,ノードの削除,挿入,置換からなる編集操作を用いて,一方の根付き木から他方の木への変換に必要な編集操作列の最小コストとして定式化される.2つの木の間の編集操作列は無数に存在するため,操作列をすべて計算して木編集距離を求める方法は現実的ではない.そこでTaiは,木編集距離計算の指針として,木編集距離に宣言的意味を与えるTaiマッピング(以後単にマッピングともいう)を導入した.このTaiマッピングは,先祖子孫関係(および順序木の場合は兄弟関係)を保持する木のノード間の一対一対応であり,Taiマッピングの最小コストは木編集距離と一致する.木編集距離の計算時間は,順序木の場合はノード数nに対してO(n3)時間であるが,無順序木の場合はMAX SNP困難である.一方,糖鎖データではノードのつながりに意味があるためそのつながりを崩さないような制約が求められ,XMLデータでは根ノードから一定のノードはどの木にも共通する場合があり,より葉ノードに重点を置いた距離が求められる.このように,対象によっては木編集距離は過度に一般的となるため,他方では計算効率を上げるという目的の下に,宣言的意味であるマッピングに制限を加えることで木編集距離のさまざまな変種が研究されている.特に,RNA解析などで利用され,削除の前に挿入を行う木編集距離でもある木アライメント距離の計算は,順序木の場合はノード数nに対してO(n4)時間,無順序木の場合は一般にMAX SNP困難であるが,次数が限定されている木のときは多項式時間で計算できる.このアライメント距離は,2つの木の超木となるアライメント木の最小コストとして定式化することができ,Taiマッピングに制限を加えた劣制限マッピングの最小コストと一致する.本論文では,まず,マッピングへの制限をTaiマッピングの階層として捉え,この階層を共通部分森,特に,共通部分森中のノードの接続と部分木の並びの観点から見直すことで,木編集距離の変種の計算における本質について研究する.また,これらの観点によって新たに導入されるマッピングについて,それらの最小コストとなる編集距離の変種の時間計算時間を解析する.また,木アライメント距離に対して,森アライメント構築の高速化を目的として導入されたアンカーアライメント問題が提唱されている.これは,アンカーと呼ばれるマッピングを入力とし,そのアンカーでの対応を保持したアライメント木を構築する問題であるが,このアンカーはTaiマッピングであり,劣制限マッピングでないマッピングがアンカーとして入力されると木が構築することができない.そこで本論文では,木アライメント距離の宣言的意味が劣制限マッピングとなることの構成的な別証明を与え,その構成方法を利用することで,アンカーアライメント問題の出力を,アライメント木が構築できない場合は”no”を返す形に定式化する.また,それに基づくアンカーアライメント距離を定式化し,アンカーアライメント距離とアライメント距離を実データをもとに比較する.さらに,順序木より一般的であり,無順序木より制限された巡回的順序木を提案し,巡回的順序木間でのアライメント距離を計算するアルゴリズムを設計する.最後に,木編集距離に関するさまざまな内容として,無順序木編集距離を計算する動的A∗アルゴリズムの設計,Taiマッピングの根無し木への拡張,巡回的順序木と次数制限無順序木のマッピングカーネルの設計を行う.無順序木編集距離を計算するアルゴリズムとしては,既に,複数の下限関数を用いるHiguchiらのA∗アルゴリズムが導入されているが,これには計算の重複が存在するため,改善の余地がある.本論文では,その重複計算を動的計画法を用いて省いた動的A∗アルゴリズムを導入する.また,実験により,下限関数の効率を確認する.また,根付き木Taiマッピングは木編集距離に対応する重要な概念であるが,このTaiマッピングを根無し木に拡張するためには,単射であることに加えて,先祖子孫関係に代わる条件を導入する必要がある.そこで,ZhangらがLCA保存マッピングを根無し木に拡張する際に用いた中心に着目し,根無し木のマッピングを導入する.特に,根無し木としてよく表現される進化系統樹を特徴づける条件である4点条件と3点条件を木のトポロジーを特徴づける条件に変更し,それぞれの条件を保存するようなマッピングを導入する.さらに,サポートベクターマシンを利用して木を分類するための基本的な方法の1つである木カーネルは順序木について多く研究がおこなわれており,そのほとんどが,順序木間のマッピングを数え上げるマッピングカーネルのフレームワークに分類される.一方で,無順序木のカーネルは,その計算の難しさからほとんど研究がなされていない.そこで,巡回的順序木と,次数を定数Dに制限した無順序木に対するマッピングカーネルを設計し,それらの計算時間について議論する.九州工業大学博士学位論文学位記番号：情工博甲第332号学位授与年月日：平成30年3月23日第1章はじめに|第2章木編集距離と木アライメント距離|第3章共通部分森に基づくTaiマッピング階層|第4章木アライメント距離の計算|第5章さまざまな拡張|第6章結論と今後の課題九州工業大学平成29年

Kyutacar : Kyushu Institute of Technology Academic Repository