54 research outputs found
A resource-frugal probabilistic dictionary and applications in (meta)genomics
Genomic and metagenomic fields, generating huge sets of short genomic
sequences, brought their own share of high performance problems. To extract
relevant pieces of information from the huge data sets generated by current
sequencing techniques, one must rely on extremely scalable methods and
solutions. Indexing billions of objects is a task considered too expensive
while being a fundamental need in this field. In this paper we propose a
straightforward indexing structure that scales to billions of element and we
propose two direct applications in genomics and metagenomics. We show that our
proposal solves problem instances for which no other known solution scales-up.
We believe that many tools and applications could benefit from either the
fundamental data structure we provide or from the applications developed from
this structure.Comment: Submitted to PSC 201
Selected abstracts of “Bioinformatics: from Algorithms to Applications 2020” conference
El documento solamente contiene el resumen de la ponenciaUCR::VicerrectorĂa de InvestigaciĂłn::Unidades de InvestigaciĂłn::Ciencias de la Salud::Centro de InvestigaciĂłn en Enfermedades Tropicales (CIET)UCR::VicerrectorĂa de Docencia::Salud::Facultad de MicrobiologĂ
Multiple Comparative Metagenomics using Multiset k-mer Counting
Background. Large scale metagenomic projects aim to extract biodiversity
knowledge between different environmental conditions. Current methods for
comparing microbial communities face important limitations. Those based on
taxonomical or functional assignation rely on a small subset of the sequences
that can be associated to known organisms. On the other hand, de novo methods,
that compare the whole sets of sequences, either do not scale up on ambitious
metagenomic projects or do not provide precise and exhaustive results.
Methods. These limitations motivated the development of a new de novo
metagenomic comparative method, called Simka. This method computes a large
collection of standard ecological distances by replacing species counts by
k-mer counts. Simka scales-up today's metagenomic projects thanks to a new
parallel k-mer counting strategy on multiple datasets.
Results. Experiments on public Human Microbiome Project datasets demonstrate
that Simka captures the essential underlying biological structure. Simka was
able to compute in a few hours both qualitative and quantitative ecological
distances on hundreds of metagenomic samples (690 samples, 32 billions of
reads). We also demonstrate that analyzing metagenomes at the k-mer level is
highly correlated with extremely precise de novo comparison techniques which
rely on all-versus-all sequences alignment strategy or which are based on
taxonomic profiling
Fractional Hitting Sets for Efficient and Lightweight Genomic Data Sketching
The exponential increase in publicly available sequencing data and genomic resources necessitates the development of highly efficient methods for data processing and analysis. Locality-sensitive hashing techniques have successfully transformed large datasets into smaller, more manageable sketches while maintaining comparability using metrics such as Jaccard and containment indices. However, fixed-size sketches encounter difficulties when applied to divergent datasets.
Scalable sketching methods, such as Sourmash, provide valuable solutions but still lack resource-efficient, tailored indexing. Our objective is to create lighter sketches with comparable results while enhancing efficiency. We introduce the concept of Fractional Hitting Sets, a generalization of Universal Hitting Sets, which uniformly cover a specified fraction of the k-mer space. In theory and practice, we demonstrate the feasibility of achieving such coverage with simple but highly efficient schemes.
By encoding the covered k-mers as super-k-mers, we provide a space-efficient exact representation that also enables optimized comparisons. Our novel tool, SuperSampler, implements this scheme, and experimental results with real bacterial collections closely match our theoretical findings.
In comparison to Sourmash, SuperSampler achieves similar outcomes while utilizing an order of magnitude less space and memory and operating several times faster. This highlights the potential of our approach in addressing the challenges presented by the ever-expanding landscape of genomic data
Genotyping Structural Variations using Long Read data
National audienc
Adoption of big data analytics and its impact on organizational performance in higher education mediated by knowledge management
Due to SARS-CoV-2 pandemic, higher education institutions are challenged to continue
providing quality teaching, consulting, and research production through virtual education
environments. In this context, a large volume of data is being generated, and technologies
such as big data analytics are needed to create opportunities for open innovation by obtaining
valuable knowledge. The purpose of this paper was to investigate the factors that influence the
adoption of big data analytics, as well as to evaluate the relationship it has with performance
and knowledge management, taking into consideration that this technology is in its initial
stages and that previous research has provided varied results depending on the sector in focus.
To address these challenges, a theoretical framework was developed to empirically test the
relationship of these variables. A total of 265 members of universities in Latin America were
surveyed and structural equation modeling was used for hypothesis testing. The findings
identified compatibility, an adequate organizational data environment, and external support as
factors required to adopt big data analytics and their positive relationship is tested with
knowledge management processes and organizational performance. This study provides
practical guidance for decision-makers involved in or in charge of defining the
implementation strategy of big data analytics in higher education institutions.Debido a la pandemia del SARS-CoV-2, las instituciones de educaciĂłn superior tienen el
desafĂo de continuar brindando enseñanza, consultorĂa y producciĂłn de investigaciĂłn de
calidad a través de entornos educativos virtuales. En este contexto, se está generando un
gran volumen de datos y se necesitan tecnologĂas como la analĂtica de Big Data para crear
oportunidades de innovaciĂłn abierta mediante la obtenciĂłn de conocimientos valiosos. El
propĂłsito de este trabajo fue investigar los factores que influyen en la adopciĂłn de la
analĂtica de Big Data, asĂ como evaluar la relaciĂłn que tiene con el desempeño y la
gestiĂłn del conocimiento, tomando en consideraciĂłn que esta tecnologĂa se encuentra en
sus etapas iniciales y que investigaciones previas han proporcionado resultados variados
segĂşn el sector en cuestiĂłn. Para abordar estos desafĂos, se desarrollĂł un marco teĂłrico
que comprobĂł empĂricamente la relaciĂłn de estas variables; se encuestaron a 265
miembros de universidades de América Latina y se utilizó el modelado de ecuaciones
estructurales. Los hallazgos identificaron a la compatibilidad, un entorno de datos
organizacional adecuado y el apoyo externo como factores necesarios para adoptar la
analĂtica de Big Data y se comprobĂł su relaciĂłn positiva con los procesos de gestiĂłn del
conocimiento y el desempeño organizacional. Este estudio proporciona una guĂa práctica
para los tomadores de decisiĂłn involucrados o encargados de definir la estrategia de
implementaciĂłn de la analĂtica de Big Data en instituciones de educaciĂłn superior
DiscoSnp-RAD: de novo detection of small variants for RAD-Seq population genomics
International audienceRestriction site Associated DNA Sequencing (RAD-Seq) is a technique characterized by the sequencing of specific loci along the genome that is widely employed in the field of evolutionary biology since it allows to exploit variants (mainly Single Nucleotide Polymorphism—SNPs) information from entire populations at a reduced cost. Common RAD dedicated tools, such as STACKS or IPyRAD, are based on all-vs-all read alignments, which require consequent time and computing resources. We present an original method, DiscoSnp-RAD, that avoids this pitfall since variants are detected by exploiting specific parts of the assembly graph built from the reads, hence preventing all-vs-all read alignments. We tested the implementation on simulated datasets of increasing size, up to 1,000 samples, and on real RAD-Seq data from 259 specimens of Chiastocheta flies, morphologically assigned to seven species. All individuals were successfully assigned to their species using both STRUCTURE and Maximum Likelihood phylogenetic reconstruction. Moreover, identified variants succeeded to reveal a within-species genetic structure linked to the geographic distribution. Furthermore, our results show that DiscoSnp-RAD is significantly faster than state-of-the-art tools. The overall results show that DiscoSnp-RAD is suitable to identify variants from RAD-Seq data, it does not require time-consuming parameterization steps and it stands out from other tools due to its completely different principle, making it substantially faster, in particular on large datasets
- …