Search CORE

98,008 research outputs found

Truveta Mapper: A Zero-shot Ontology Alignment Framework

Author: Amir Mariyam
Bahramali Alireza
Baruah Murchana
Ehsani Sina
Eslamialishah Mahsa
Naddaf-Sh Sadra
Zarandioon Saman
Publication venue
Publication date: 31/03/2023
Field of study

In this paper, a new perspective is suggested for unsupervised Ontology Matching (OM) or Ontology Alignment (OA) by treating it as a translation task. Ontologies are represented as graphs, and the translation is performed from a node in the source ontology graph to a path in the target ontology graph. The proposed framework, Truveta Mapper (TM), leverages a multi-task sequence-to-sequence transformer model to perform alignment across multiple ontologies in a zero-shot, unified and end-to-end manner. Multi-tasking enables the model to implicitly learn the relationship between different ontologies via transfer-learning without requiring any explicit cross-ontology manually labeled data. This also enables the formulated framework to outperform existing solutions for both runtime latency and alignment quality. The model is pre-trained and fine-tuned only on publicly available text corpus and inner-ontologies data. The proposed solution outperforms state-of-the-art approaches, Edit-Similarity, LogMap, AML, BERTMap, and the recently presented new OM frameworks in Ontology Alignment Evaluation Initiative (OAEI22), offers log-linear complexity in contrast to quadratic in the existing end-to-end methods, and overall makes the OM task efficient and more straightforward without much post-processing involving mapping extension or mapping repair

arXiv.org e-Print Archive

PicXAA-R: Efficient structural alignment of multiple RNA sequences using a greedy approach

Author: A Wilm
A Wilm
AO Harmanci
AS Schwartz
B Paten
Byung-Jun Yoon
C Do
C Notredame
CB Do
CB Do
CB Do
D Dalli
D Sankoff
DH Mathews
DH Mathews
FF Costa
G Storz
H Kiryu
H Kiryu
I Holmes
IL Hofacker
IL Hofacker
IL Hofacker
J Gorodkin
JH Havgaard
JH Havgaard
JS McCaskill
K Katoh
M Anwar
M Bauer
M Hamada
M Hamada
R Durbin
RD Dowell
RK Bradley
RK Bradley
S Griffiths-Jones
S Lindgreen
S Moretti
S Siebert
S Wang
S Washietl
S Will
Sayed Mohammad Ebrahim Sahraeian
SM Sahraeian
SR Eddy
U Roshan
X Xu
Y Tabei
ZJ Lu
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Accurate and efficient structural alignment of non-coding RNAs (ncRNAs) has grasped more and more attentions as recent studies unveiled the significance of ncRNAs in living organisms. While the Sankoff style structural alignment algorithms cannot efficiently serve for multiple sequences, mostly progressive schemes are used to reduce the complexity. However, this idea tends to propagate the early stage errors throughout the entire process, thereby degrading the quality of the final alignment. For multiple protein sequence alignment, we have recently proposed PicXAA which constructs an accurate alignment in a non-progressive fashion. Results Here, we propose PicXAA-R as an extension to PicXAA for greedy structural alignment of ncRNAs. PicXAA-R efficiently grasps both folding information within each sequence and local similarities between sequences. It uses a set of probabilistic consistency transformations to improve the posterior base-pairing and base alignment probabilities using the information of all sequences in the alignment. Using a graph-based scheme, we greedily build up the structural alignment from sequence regions with high base-pairing and base alignment probabilities. Conclusions Several experiments on datasets with different characteristics confirm that PicXAA-R is one of the fastest algorithms for structural alignment of multiple RNAs and it consistently yields accurate alignment results, especially for datasets with locally similar sequences. PicXAA-R source code is freely available at: <url>http://www.ece.tamu.edu/~bjyoon/picxaa/</url>.</p

Crossref

Directory of Open Access Journals

PubMed Central

Texas A&M Repository

PIntron: a Fast Method for Gene Structure Prediction via Maximal Pairings of a Pattern and a Text

Author: Bonizzoni Paola
Della Vedova Gianluca
Pirola Yuri
Rizzi Raffaella
Publication venue
Publication date: 01/01/2010
Field of study

Current computational methods for exon-intron structure prediction from a cluster of transcript (EST, mRNA) data do not exhibit the time and space efficiency necessary to process large clusters of over than 20,000 ESTs and genes longer than 1Mb. Guaranteeing both accuracy and efficiency seems to be a computational goal quite far to be achieved, since accuracy is strictly related to exploiting the inherent redundancy of information present in a large cluster. We propose a fast method for the problem that combines two ideas: a novel algorithm of proved small time complexity for computing spliced alignments of a transcript against a genome, and an efficient algorithm that exploits the inherent redundancy of information in a cluster of transcripts to select, among all possible factorizations of EST sequences, those allowing to infer splice site junctions that are highly confirmed by the input data. The EST alignment procedure is based on the construction of maximal embeddings that are sequences obtained from paths of a graph structure, called Embedding Graph, whose vertices are the maximal pairings of a genomic sequence T and an EST P. The procedure runs in time linear in the size of P, T and of the output. PIntron, the software tool implementing our methodology, is able to process in a few seconds some critical genes that are not manageable by other gene structure prediction tools. At the same time, PIntron exhibits high accuracy (sensitivity and specificity) when compared with ENCODE data. Detailed experimental data, additional results and PIntron software are available at http://www.algolab.eu/PIntron

arXiv.org e-Print Archive

CiteSeerX

Graph-based modeling of tandem repeats improves global multiple sequence alignment

Author: Anisimova Maria
Szalkowski Adam M.
Publication venue
Publication date: 02/08/2017
Field of study

Tandem repeats (TRs) are often present in proteins with crucial functions, responsible for resistance, pathogenicity and associated with infectious or neurodegenerative diseases. This motivates numerous studies of TRs and their evolution, requiring accurate multiple sequence alignment. TRs may be lost or inserted at any position of a TR region by replication slippage or recombination, but current methods assume fixed unit boundaries, and yet are of high complexity. We present a new global graph-based alignment method that does not restrict TR unit indels by unit boundaries. TR indels are modeled separately and penalized using the phylogeny-aware alignment algorithm. This ensures enhanced accuracy of reconstructed alignments, disentangling TRs and measuring indel events and rates in a biologically meaningful way. Our method detects not only duplication events but also all changes in TR regions owing to recombination, strand slippage and other events inserting or deleting TR units. We evaluate our method by simulation incorporating TR evolution, by either sampling TRs from a profile hidden Markov model or by mimicking strand slippage with duplications. The new method is illustrated on a family of type III effectors, a pathogenicity determinant in agriculturally important bacteria Ralstonia solanacearum. We show that TR indel rate variation contributes to the diversification of this protein famil

RERO DOC Digital Library

Fuse: Multiple Network Alignment via Data Fusion

Author: Gligorijević V
Malod-Dognin N
Pržulj N
Publication venue: 'Oxford University Press (OUP)'
Publication date: 09/10/2015
Field of study

Spiral - Imperial College Digital Repository

If the Current Clique Algorithms are Optimal, so is Valiant's Parser

Author: Abboud Amir
Backurs Arturs
Williams Virginia Vassilevska
Publication venue
Publication date: 05/11/2015
Field of study

The CFG recognition problem is: given a context-free grammar

\mathcal{G}

and a string

w

of length

n

, decide if

w

can be obtained from

\mathcal{G}

. This is the most basic parsing question and is a core computer science problem. Valiant's parser from 1975 solves the problem in

O(n^{\omega})

time, where

\omega<2.373

is the matrix multiplication exponent. Dozens of parsing algorithms have been proposed over the years, yet Valiant's upper bound remains unbeaten. The best combinatorial algorithms have mildly subcubic

O(n^3/\log^3{n})

complexity. Lee (JACM'01) provided evidence that fast matrix multiplication is needed for CFG parsing, and that very efficient and practical algorithms might be hard or even impossible to obtain. Lee showed that any algorithm for a more general parsing problem with running time

O(|\mathcal{G}|\cdot n^{3-\varepsilon})

can be converted into a surprising subcubic algorithm for Boolean Matrix Multiplication. Unfortunately, Lee's hardness result required that the grammar size be

|\mathcal{G}|=\Omega(n^6)

. Nothing was known for the more relevant case of constant size grammars. In this work, we prove that any improvement on Valiant's algorithm, even for constant size grammars, either in terms of runtime or by avoiding the inefficiencies of fast matrix multiplication, would imply a breakthrough algorithm for the

k

-Clique problem: given a graph on

n

nodes, decide if there are

k

that form a clique. Besides classifying the complexity of a fundamental problem, our reduction has led us to similar lower bounds for more modern and well-studied cubic time problems for which faster algorithms are highly desirable in practice: RNA Folding, a central problem in computational biology, and Dyck Language Edit Distance, answering an open question of Saha (FOCS'14)

arXiv.org e-Print Archive

Crossref

Tree decomposition and parameterized algorithms for RNA structure-sequence alignment including tertiary interactions and pseudoknots

Author: Barth Dominique
Denise Alain
Ponty Yann
Rinaudo Philippe
Publication venue
Publication date: 17/06/2012
Field of study

We present a general setting for structure-sequence comparison in a large class of RNA structures that unifies and generalizes a number of recent works on specific families on structures. Our approach is based on tree decomposition of structures and gives rises to a general parameterized algorithm, where the exponential part of the complexity depends on the family of structures. For each of the previously studied families, our algorithm has the same complexity as the specific algorithm that had been given before.Comment: (2012

arXiv.org e-Print Archive

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Polytechnique

HAL UVSQ

HAL-Rennes 1