13 research outputs found
Recommended from our members
Seedability: optimizing alignment parameters for sensitive sequence comparison
Data availability:
The data underlying this article are available either in https://github.com/lorrainea/Seedability or in the ensembl database at https://www.ensembl.org, and can be accessed using the gene names ENSPTRG00000044036 and ENSG00000174236 or in the NCBI database at https://www.ncbi.nlm.nih.gov and can be found using the reference sequence NC_000001.11.Motivation:
Most sequence alignment techniques make use of exact k-mer hits, called seeds, as anchors to optimize alignment speed. A large number of bioinformatics tools employing seed-based alignment techniques, such as Minimap2â , use a single value of k per sequencing technology, without a strong guarantee that this is the best possible value. Given the ubiquity of sequence alignment, identifying values of k that lead to more sensitive alignments is thus an important task. To aid this, we present Seedabilityâ , a seed-based alignment framework designed for estimating an optimal seed k-mer length (as well as a minimal number of shared seeds) based on a given alignment identity threshold. In particular, we were motivated to make Minimap2 more sensitive in the pairwise alignment of short sequences.
Results:
The experimental results herein show improved alignments of short and divergent sequences when using the parameter values determined by Seedability in comparison to the default values of Minimap2. We also show several cases of pairs of real divergent sequences, where the default parameter values of Minimap2 yield no output alignments, but the values output by Seedability produce plausible alignments.
Availability and implementation:
https://github.com/lorrainea/Seedability (distributed under GPL v3.0).R.C. was supported by ANR Full-RNA, SeqDigger, Inception, and PRAIRIE grants (ANR-22-CE45-0007, ANR-19-CE45-0008, PIA/ANR16-CONV-0005, ANR-19-P3IA-0001). This project has received funding from the European Unionâs Horizon 2020 research and innovation programme under the Marie SkĆodowska-Curie grant agreements No. 872539 (PANGAIA) and 956229 (ALPACA)
String Sanitization: A Combinatorial Approach
String data are often disseminated to support applications such as location-based service provision or DNA sequence analysis. This dissemination, however, may expose sensitive patterns that model confidential knowledge (e.g., trips to mental health clinics from a string representing a userâs loc
Fast Indexes for Gapped Pattern Matching
We describe indexes for searching large data sets for variable-length-gapped
(VLG) patterns. VLG patterns are composed of two or more subpatterns, between
each adjacent pair of which is a gap-constraint specifying upper and lower
bounds on the distance allowed between subpatterns. VLG patterns have numerous
applications in computational biology (motif search), information retrieval
(e.g., for language models, snippet generation, machine translation) and
capture a useful subclass of the regular expressions commonly used in practice
for searching source code. Our best approach provides search speeds several
times faster than prior art across a broad range of patterns and texts.Comment: This research is supported by Academy of Finland through grant 319454
and has received funding from the European Union's Horizon 2020 research and
innovation programme under the Marie Sklodowska-Curie Actions
H2020-MSCA-RISE-2015 BIRDS GA No. 69094
Recommended from our members
CNEFinder: Finding conserved non-coding elements in genomes
Availability and implementation:
Free software under the terms of the GNU GPL (https://github.com/lorrainea/CNEFinder).Motivation:
Conserved non-coding elements (CNEs) represent an enigmatic class of genomic elements which, despite being extremely conserved across evolution, do not encode for proteins. Their functions are still largely unknown. Thus, there exists a need to systematically investigate their roles in genomes. Towards this direction, identifying sets of CNEs in a wide range of organisms is an important first step. Currently, there are no tools published in the literature for systematically identifying CNEs in genomes.
Results
We fill this gap by presenting CNEFinderâ ; a tool for identifying CNEs between two given DNA sequences with user-defined criteria. The results presented here show the toolâs ability of identifying CNEs accurately and efficiently. CNEFinder is based on a k-mer technique for computing maximal exact matches. The tool thus does not require or compute whole-genome alignments or indexes, such as the suffix array or the Burrows Wheeler Transform (BWT), which makes it flexible to use on a wide scale.This work was supported by the Engineering and Physical Sciences Research Council [grant number EP/M50788X/1]
Recommended from our members
Sparse Suffix and LCP Array: Simple, Direct, Small, and Fast
A preprint version of this article is available at arXiv:2310.09023v1 [cs.DS] (https://arxiv.org/abs/2310.09023) under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/). It has not been certified by peer review.Sparse suffix sorting is the problem of sorting b=o(n) suffixes of a string of length n. Efficient sparse suffix sorting algorithms have existed for more than a decade. Despite the multitude of works and their justified claims for applications in text indexing, the existing algorithms have not been employed by practitioners. Arguably this is because there are no simple, direct, and efficient algorithms for sparse suffix array construction. We provide two new algorithms for constructing the sparse suffix and LCP arrays that are simultaneously simple, direct, small, and fast. In particular, our algorithms are: simple in the sense that they can be implemented using only basic data structures; direct in the sense that the output arrays are not a byproduct of constructing the sparse suffix tree or an LCE data structure; fast in the sense that they run in O(nlogb) time, in the worst case, or in O(n) time, when the total number of suffixes with an LCP value greater than 2âlognbâ+1â1 is in O(b/logb), matching the time of the optimal yet much more complicated algorithms [Gawrychowski and Kociumaka, SODA 2017; Birenzwige et al., SODA 2020]; and small in the sense that they can be implemented using only 8b+o(b) machine words. Our algorithms are simplified, yet non-trivial, space-efficient adaptations of the Monte Carlo algorithm by I et al. for constructing the sparse suffix tree in O(nlogb) time [STACS 2014]. We also provide proof-of-concept experiments to justify our claims on simplicity and efficiency.SPP and HV are supported by the PANGAIA project (GA 872539). SPP is supported by the ALPACA project (GA 956229). HV is supported by a Constance van Eeden Fellowship
Recommended from our members
IsoXpressor: A tool to assess transcriptional activity within isochores
Data Availability: The data underlying this article are available in the article and in its Supplementary Material online at: https://academic.oup.com/gbe/article/12/9/1573/5898630#207856986.Genomes are characterized by large regions of homogeneous base compositions known as isochores. The latter are divided into GC-poor and GC-rich classes linked to distinct functional and structural properties. Several studies have addressed how isochores shape function and structure. To aid in this important subject, we present IsoXpressor, a tool designed for the analysis of the functional property of transcription within isochores. IsoXpressor allows users to process RNA-Seq data in relation to the isochores, and it can be employed to investigate any biological question of interest for any species. The results presented herein as proof of concept are focused on the preimplantation process in Homo sapiens (human) and Macaca mulatta (rhesus monkey)
Subframe Temporal Alignment of Non-Stationary Cameras
This paper studies the problem of estimating the sub-frame temporal off-set between unsychronized, non-stationary cameras. Based on motion trajec-tory correspondences, the estimation is done in two steps. First, we propose an algorithm to robustly estimate the frame accurate offset by analyzing the trajectories and matching their characteristic time patterns. Using this result, we then show how the estimation of the fundamental matrix between two cameras can be reformulated to yield the sub-frame accurate offset from nine correspondences. We verify the robustness and performance of our approach on synthetic data as well as on real video sequences.
String sanitization: a combinatorial approach
String data are often disseminated to support applications such as location-based service provision or DNA sequence analysis. This dissemination, however, may expose sensitive patterns that model confidential knowledge (e.g., trips to mental health clinics from a string representing a userâs location history). In this paper, we consider the problem of sanitizing a string by concealing the occurrences of sensitive patterns, while maintaining data utility. First, we propose a time-optimal algorithm, TFS-ALGO, to construct