Search CORE

13 research outputs found

Recommended from our members

Seedability: optimizing alignment parameters for sensitive sequence comparison

Author: Ayad LAK
Chikhi R
Pissis SP
Publication venue: Oxford University Press (OUP)
Publication date: 12/08/2023
Field of study

Data availability: The data underlying this article are available either in https://github.com/lorrainea/Seedability or in the ensembl database at https://www.ensembl.org, and can be accessed using the gene names ENSPTRG00000044036 and ENSG00000174236 or in the NCBI database at https://www.ncbi.nlm.nih.gov and can be found using the reference sequence NC_000001.11.Motivation: Most sequence alignment techniques make use of exact k-mer hits, called seeds, as anchors to optimize alignment speed. A large number of bioinformatics tools employing seed-based alignment techniques, such as Minimap2⁠, use a single value of k per sequencing technology, without a strong guarantee that this is the best possible value. Given the ubiquity of sequence alignment, identifying values of k that lead to more sensitive alignments is thus an important task. To aid this, we present Seedability⁠, a seed-based alignment framework designed for estimating an optimal seed k-mer length (as well as a minimal number of shared seeds) based on a given alignment identity threshold. In particular, we were motivated to make Minimap2 more sensitive in the pairwise alignment of short sequences. Results: The experimental results herein show improved alignments of short and divergent sequences when using the parameter values determined by Seedability in comparison to the default values of Minimap2. We also show several cases of pairs of real divergent sequences, where the default parameter values of Minimap2 yield no output alignments, but the values output by Seedability produce plausible alignments. Availability and implementation: https://github.com/lorrainea/Seedability (distributed under GPL v3.0).R.C. was supported by ANR Full-RNA, SeqDigger, Inception, and PRAIRIE grants (ANR-22-CE45-0007, ANR-19-CE45-0008, PIA/ANR16-CONV-0005, ANR-19-P3IA-0001). This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreements No. 872539 (PANGAIA) and 956229 (ALPACA)

Brunel University Research Archive

String Sanitization: A Combinatorial Approach

Author: B Cazaux
CC Aggarwal
D Pissinger
J Gallant
M Crochemore
O Abul
R Grossi
SP Pissis
VS Verykios
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 30/04/2020
Field of study

Crossref

CWI's Institutional Repository

Fast Indexes for Gapped Pattern Matching

Author: D Knuth
G Navarro
J Bader
K Fredriksson
M Crochemore
M Lewenstein
M Morgante
P Bille
P Bille
Philip Bille
R Saikkonen
SP Pissis
T Crawford
T Haapasalo
U Manber
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 28/02/2020
Field of study

We describe indexes for searching large data sets for variable-length-gapped (VLG) patterns. VLG patterns are composed of two or more subpatterns, between each adjacent pair of which is a gap-constraint specifying upper and lower bounds on the distance allowed between subpatterns. VLG patterns have numerous applications in computational biology (motif search), information retrieval (e.g., for language models, snippet generation, machine translation) and capture a useful subclass of the regular expressions commonly used in practice for searching source code. Our best approach provides search speeds several times faster than prior art across a broad range of patterns and texts.Comment: This research is supported by Academy of Finland through grant 319454 and has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

arXiv.org e-Print Archive

Crossref

Recommended from our members

CNEFinder: Finding conserved non-coding elements in genomes

Author: Ayad LAK
Pissis SP
Polychronopoulos D
Publication venue: Oxford University Press
Publication date: 01/09/2018
Field of study

Availability and implementation: Free software under the terms of the GNU GPL (https://github.com/lorrainea/CNEFinder).Motivation: Conserved non-coding elements (CNEs) represent an enigmatic class of genomic elements which, despite being extremely conserved across evolution, do not encode for proteins. Their functions are still largely unknown. Thus, there exists a need to systematically investigate their roles in genomes. Towards this direction, identifying sets of CNEs in a wide range of organisms is an important first step. Currently, there are no tools published in the literature for systematically identifying CNEs in genomes. Results We fill this gap by presenting CNEFinder⁠; a tool for identifying CNEs between two given DNA sequences with user-defined criteria. The results presented here show the tool’s ability of identifying CNEs accurately and efficiently. CNEFinder is based on a k-mer technique for computing maximal exact matches. The tool thus does not require or compute whole-genome alignments or indexes, such as the suffix array or the Burrows Wheeler Transform (BWT), which makes it flexible to use on a wide scale.This work was supported by the Engineering and Physical Sciences Research Council [grant number EP/M50788X/1]

Brunel University Research Archive

Recommended from our members

Sparse Suffix and LCP Array: Simple, Direct, Small, and Fast

Author: Ayad LAK
Loukides G
Pissis SP
Verbeek H
Publication venue: Springer Nature
Publication date: 20/12/2023
Field of study

A preprint version of this article is available at arXiv:2310.09023v1 [cs.DS] (https://arxiv.org/abs/2310.09023) under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/). It has not been certified by peer review.Sparse suffix sorting is the problem of sorting b=o(n) suffixes of a string of length n. Efficient sparse suffix sorting algorithms have existed for more than a decade. Despite the multitude of works and their justified claims for applications in text indexing, the existing algorithms have not been employed by practitioners. Arguably this is because there are no simple, direct, and efficient algorithms for sparse suffix array construction. We provide two new algorithms for constructing the sparse suffix and LCP arrays that are simultaneously simple, direct, small, and fast. In particular, our algorithms are: simple in the sense that they can be implemented using only basic data structures; direct in the sense that the output arrays are not a byproduct of constructing the sparse suffix tree or an LCE data structure; fast in the sense that they run in O(nlogb) time, in the worst case, or in O(n) time, when the total number of suffixes with an LCP value greater than 2⌊lognb⌋+1−1 is in O(b/logb), matching the time of the optimal yet much more complicated algorithms [Gawrychowski and Kociumaka, SODA 2017; Birenzwige et al., SODA 2020]; and small in the sense that they can be implemented using only 8b+o(b) machine words. Our algorithms are simplified, yet non-trivial, space-efficient adaptations of the Monte Carlo algorithm by I et al. for constructing the sparse suffix tree in O(nlogb) time [STACS 2014]. We also provide proof-of-concept experiments to justify our claims on simplicity and efficiency.SPP and HV are supported by the PANGAIA project (GA 872539). SPP is supported by the ALPACA project (GA 956229). HV is supported by a Constance van Eeden Fellowship

Brunel University Research Archive

Recommended from our members

IsoXpressor: A tool to assess transcriptional activity within isochores

Author: Arhondakis S
Ayad LAK
Dourou A-M
Pissis SP
Publication venue: Oxford University Press on behalf of the Society for Molecular Biology and Evolution
Publication date: 08/08/2020
Field of study

Data Availability: The data underlying this article are available in the article and in its Supplementary Material online at: https://academic.oup.com/gbe/article/12/9/1573/5898630#207856986.Genomes are characterized by large regions of homogeneous base compositions known as isochores. The latter are divided into GC-poor and GC-rich classes linked to distinct functional and structural properties. Several studies have addressed how isochores shape function and structure. To aid in this important subject, we present IsoXpressor, a tool designed for the analysis of the functional property of transcription within isochores. IsoXpressor allows users to process RNA-Seq data in relation to the isochores, and it can be employed to investigate any biological question of interest for any species. The results presented herein as proof of concept are focused on the preimplantation process in Homo sapiens (human) and Macaca mulatta (rhesus monkey)

Brunel University Research Archive

Efficient Pattern Matching in Elastic-Degenerate Texts

Author: A Amir
A Dilthey
B Schieber
D Gusfield
DE Knuth
DM Church
E Ukkonen
EM McCreight
HT Harel
L Huang
MS Rahman
S Maciuca
SP Pissis
Y Liu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 16/02/2017
Field of study

Crossref

King's Research Portal

Subframe Temporal Alignment of Non-Stationary Cameras

Author: B Cazaux
CC Aggarwal
D Pissinger
J Gallant
M Crochemore
O Abul
R Grossi
SP Pissis
VS Verykios
Publication venue
Publication date: 01/01/2008
Field of study

This paper studies the problem of estimating the sub-frame temporal off-set between unsychronized, non-stationary cameras. Based on motion trajec-tory correspondences, the estimation is done in two steps. First, we propose an algorithm to robustly estimate the frame accurate offset by analyzing the trajectories and matching their characteristic time patterns. Using this result, we then show how the estimation of the fundamental matrix between two cameras can be reformulated to yield the sub-frame accurate offset from nine correspondences. We verify the robustness and performance of our approach on synthetic data as well as on real video sequences.

CiteSeerX

Crossref

CWI's Institutional Repository

King's Research Portal

String sanitization: a combinatorial approach

Author: B Cazaux
CC Aggarwal
D Pissinger
J Gallant
M Crochemore
O Abul
R Grossi
SP Pissis
VS Verykios
Publication venue: Springer LNCS
Publication date: 08/06/2019
Field of study

String data are often disseminated to support applications such as location-based service provision or DNA sequence analysis. This dissemination, however, may expose sensitive patterns that model confidential knowledge (e.g., trips to mental health clinics from a string representing a user’s location history). In this paper, we consider the problem of sanitizing a string by concealing the occurrences of sensitive patterns, while maintaining data utility. First, we propose a time-optimal algorithm, TFS-ALGO, to construct

Crossref

CWI's Institutional Repository

University of Birmingham Research Portal

INRIA a CCSD electronic archive server

Archivio della Ricerca - Università di Pisa

King's Research Portal

Application and Algorithm:Maximal Motif Discovery for Biological Data in a Sliding Window

Author: A-C Leonard
AM Carvalho
CS Iliopoulos
G Pavesi
J van Helden
M Meijer
MS Waterman
N Pisanti
R Grossi
RS Fuller
S Sinha
SP Pissis
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

Crossref

King's Research Portal