Search CORE

430 research outputs found

A practical index for approximate dictionary matching with few mismatches

Author: Cisłak Aleksander
Grabowski Szymon
Publication venue
Publication date: 11/02/2016
Field of study

Approximate dictionary matching is a classic string matching problem (checking if a query string occurs in a collection of strings) with applications in, e.g., spellchecking, online catalogs, geolocation, and web searchers. We present a surprisingly simple solution called a split index, which is based on the Dirichlet principle, for matching a keyword with few mismatches, and experimentally show that it offers competitive space-time tradeoffs. Our implementation in the C++ language is focused mostly on data compaction, which is beneficial for the search speed (e.g., by being cache friendly). We compare our solution with other algorithms and we show that it performs better for the Hamming distance. Query times in the order of 1 microsecond were reported for one mismatch for the dictionary size of a few megabytes on a medium-end PC. We also demonstrate that a basic compression technique consisting in

q

-gram substitution can significantly reduce the index size (up to 50% of the input text size for the DNA), while still keeping the query time relatively low

arXiv.org e-Print Archive

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

Indexing large genome collections on a PC

Author: Danek Agnieszka
Deorowicz Sebastian
Grabowski Szymon
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 28/03/2014
Field of study

Motivation: The availability of thousands of invidual genomes of one species should boost rapid progress in personalized medicine or understanding of the interaction between genotype and phenotype, to name a few applications. A key operation useful in such analyses is aligning sequencing reads against a collection of genomes, which is costly with the use of existing algorithms due to their large memory requirements. Results: We present MuGI, Multiple Genome Index, which reports all occurrences of a given pattern, in exact and approximate matching model, against a collection of thousand(s) genomes. Its unique feature is the small index size fitting in a standard computer with 16--32\,GB, or even 8\,GB, of RAM, for the 1000GP collection of 1092 diploid human genomes. The solution is also fast. For example, the exact matching queries are handled in average time of 39\,

\mu

s and with up to 3 mismatches in 373\,

\mu

s on the test PC with the index size of 13.4\,GB. For a smaller index, occupying 7.4\,GB in memory, the respective times grow to 76\,

\mu

s and 917\,

\mu

s. Availability: Software and Suuplementary material: \url{http://sun.aei.polsl.pl/mugi}

arXiv.org e-Print Archive

Public Library of Science (PLOS)

Directory of Open Access Journals

PubMed Central

FigShare

Sparse and skew hashing of K-mers

Author: Pibiri G. E.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2022
Field of study

Motivation: A dictionary of k-mers is a data structure that stores a set of n distinct k-mers and supports membership queries. This data structure is at the hearth of many important tasks in computational biology. High-Throughput sequencing of DNA can produce very large k-mer sets, in the size of billions of strings-in such cases, the memory consumption and query efficiency of the data structure is a concrete challenge. Results: To tackle this problem, we describe a compressed and associative dictionary for k-mers, that is: A data structure where strings are represented in compact form and each of them is associated to a unique integer identifier in the range [0,n). We show that some statistical properties of k-mer minimizers can be exploited by minimal perfect hashing to substantially improve the space/time trade-off of the dictionary compared to the best-known solutions

PubMed Central

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Fulgor: A Fast and Compact {k-mer} Index for Large-Scale Matching and Color Queries

Author: Fan Jason
Khan Jamshed
Patro Rob
Pibiri Giulio Ermanno
Singh Noor Pratap
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)
Publication date: 01/01/2023
Field of study

The problem of sequence identification or matching - determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence - is relevant for many important tasks in Computational Biology, such as metagenomics and pan-genome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe how recent advancements in associative, order-preserving, k-mer dictionaries can be combined with a compressed inverted index to implement a fast and compact colored de Bruijn graph data structure. This index takes full advantage of the fact that unitigs in the colored de Bruijn graph are monochromatic (all k-mers in a unitig have the same set of references of origin, or "color"), leveraging the order-preserving property of its dictionary. In fact, k-mers are kept in unitig order by the dictionary, thereby allowing for the encoding of the map from k-mers to their inverted lists in as little as 1+o(1) bits per unitig. Hence, one inverted list per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for inverted lists, the index achieves very small space. We implement these methods in a tool called Fulgor. Compared to Themisto, the prior state of the art, Fulgor indexes a heterogeneous collection of 30,691 bacterial genomes in 3.8× less space, a collection of 150,000 Salmonella enterica genomes in approximately 2× less space, is at least twice as fast for color queries, and is 2-6 × faster to construct

Dagstuhl Research Online Publication Server

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Adapting a relation extraction pipeline for the BioCreAtIvE II task

Author: Grover Claire
Haddow Barry
Klein Ewan
Matthews Michael
Nielsen Leif Arda
Tobin Richard
Wang Xinglong
Publication venue
Publication date: 01/01/2007
Field of study

Edinburgh Research Explorer

Overview of BioCreative II gene normalization

Author: Cohen Aaron M
Cohen K Bretonnel
Divoli Anna
Fluck Juliane
Fundel Katrin
Hakenberg Jörg
Hirschman Lynette
Hsu Chun-Nan
Krauthammer Michael
Lau William W
Leaman Robert
Liu Heng-hui
Liu Hongfang
Lu Zhiyong
Morgan Alexander A
Ruch Patrick
Schuemie Martijn
Sun Chengjie
Torres Rafael
Wang Xinglong
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Background: The goal of the gene normalization task is to link genes or gene products mentioned in the literature to biological databases. This is a key step in an accurate search of the biological literature. It is a challenging task, even for the human expert; genes are often described rather than referred to by gene symbol and, confusingly, one gene name may refer to different genes (often from different organisms). For BioCreative II, the task was to list the Entrez Gene identifiers for human genes or gene products mentioned in PubMed/MEDLINE abstracts. We selected abstracts associated with articles previously curated for human genes. We provided 281 expert-annotated abstracts containing 684 gene identifiers for training, and a blind test set of 262 documents containing 785 identifiers, with a gold standard created by expert annotators. Inter-annotator agreement was measured at over 90%. Results: Twenty groups submitted one to three runs each, for a total of 54 runs. Three systems achieved F-measures (balanced precision and recall) between 0.80 and 0.81. Combining the system outputs using simple voting schemes and classifiers obtained improved results; the best composite system achieved an F-measure of 0.92 with 10-fold cross-validation. A 'maximum recall' system based on the pooled responses of all participants gave a recall of 0.97 (with precision 0.23), identifying 763 out of 785 identifiers. Conclusion: Major advances for the BioCreative II gene normalization task include broader participation (20 versus 8 teams) and a pooled system performance comparable to human experts, at over 90% agreement. These results show promise as tools to link the literature with biological databases

Crossref

Springer - Publisher Connector

Fraunhofer-ePrints

PubMed Central

Open Access LMU

EUR Research Repository

Erasmus University Digital Repository

Archive ouverte UNIGE

Selected abstracts of “Bioinformatics: from Algorithms to Applications 2020” conference

Author: García Santamaría Fernando
Molina Mora José Arturo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

El documento solamente contiene el resumen de la ponenciaUCR::Vicerrectoría de Investigación::Unidades de Investigación::Ciencias de la Salud::Centro de Investigación en Enfermedades Tropicales (CIET)UCR::Vicerrectoría de Docencia::Salud::Facultad de Microbiologí

Repositorio Institucional de la Universidad de Costa Rica