54 research outputs found

    A resource-frugal probabilistic dictionary and applications in (meta)genomics

    Get PDF
    Genomic and metagenomic fields, generating huge sets of short genomic sequences, brought their own share of high performance problems. To extract relevant pieces of information from the huge data sets generated by current sequencing techniques, one must rely on extremely scalable methods and solutions. Indexing billions of objects is a task considered too expensive while being a fundamental need in this field. In this paper we propose a straightforward indexing structure that scales to billions of element and we propose two direct applications in genomics and metagenomics. We show that our proposal solves problem instances for which no other known solution scales-up. We believe that many tools and applications could benefit from either the fundamental data structure we provide or from the applications developed from this structure.Comment: Submitted to PSC 201

    Selected abstracts of “Bioinformatics: from Algorithms to Applications 2020” conference

    Get PDF
    El documento solamente contiene el resumen de la ponenciaUCR::VicerrectorĂ­a de InvestigaciĂłn::Unidades de InvestigaciĂłn::Ciencias de la Salud::Centro de InvestigaciĂłn en Enfermedades Tropicales (CIET)UCR::VicerrectorĂ­a de Docencia::Salud::Facultad de MicrobiologĂ­

    Multiple Comparative Metagenomics using Multiset k-mer Counting

    Get PDF
    Background. Large scale metagenomic projects aim to extract biodiversity knowledge between different environmental conditions. Current methods for comparing microbial communities face important limitations. Those based on taxonomical or functional assignation rely on a small subset of the sequences that can be associated to known organisms. On the other hand, de novo methods, that compare the whole sets of sequences, either do not scale up on ambitious metagenomic projects or do not provide precise and exhaustive results. Methods. These limitations motivated the development of a new de novo metagenomic comparative method, called Simka. This method computes a large collection of standard ecological distances by replacing species counts by k-mer counts. Simka scales-up today's metagenomic projects thanks to a new parallel k-mer counting strategy on multiple datasets. Results. Experiments on public Human Microbiome Project datasets demonstrate that Simka captures the essential underlying biological structure. Simka was able to compute in a few hours both qualitative and quantitative ecological distances on hundreds of metagenomic samples (690 samples, 32 billions of reads). We also demonstrate that analyzing metagenomes at the k-mer level is highly correlated with extremely precise de novo comparison techniques which rely on all-versus-all sequences alignment strategy or which are based on taxonomic profiling

    Fractional Hitting Sets for Efficient and Lightweight Genomic Data Sketching

    Get PDF
    The exponential increase in publicly available sequencing data and genomic resources necessitates the development of highly efficient methods for data processing and analysis. Locality-sensitive hashing techniques have successfully transformed large datasets into smaller, more manageable sketches while maintaining comparability using metrics such as Jaccard and containment indices. However, fixed-size sketches encounter difficulties when applied to divergent datasets. Scalable sketching methods, such as Sourmash, provide valuable solutions but still lack resource-efficient, tailored indexing. Our objective is to create lighter sketches with comparable results while enhancing efficiency. We introduce the concept of Fractional Hitting Sets, a generalization of Universal Hitting Sets, which uniformly cover a specified fraction of the k-mer space. In theory and practice, we demonstrate the feasibility of achieving such coverage with simple but highly efficient schemes. By encoding the covered k-mers as super-k-mers, we provide a space-efficient exact representation that also enables optimized comparisons. Our novel tool, SuperSampler, implements this scheme, and experimental results with real bacterial collections closely match our theoretical findings. In comparison to Sourmash, SuperSampler achieves similar outcomes while utilizing an order of magnitude less space and memory and operating several times faster. This highlights the potential of our approach in addressing the challenges presented by the ever-expanding landscape of genomic data

    Adoption of big data analytics and its impact on organizational performance in higher education mediated by knowledge management

    Get PDF
    Due to SARS-CoV-2 pandemic, higher education institutions are challenged to continue providing quality teaching, consulting, and research production through virtual education environments. In this context, a large volume of data is being generated, and technologies such as big data analytics are needed to create opportunities for open innovation by obtaining valuable knowledge. The purpose of this paper was to investigate the factors that influence the adoption of big data analytics, as well as to evaluate the relationship it has with performance and knowledge management, taking into consideration that this technology is in its initial stages and that previous research has provided varied results depending on the sector in focus. To address these challenges, a theoretical framework was developed to empirically test the relationship of these variables. A total of 265 members of universities in Latin America were surveyed and structural equation modeling was used for hypothesis testing. The findings identified compatibility, an adequate organizational data environment, and external support as factors required to adopt big data analytics and their positive relationship is tested with knowledge management processes and organizational performance. This study provides practical guidance for decision-makers involved in or in charge of defining the implementation strategy of big data analytics in higher education institutions.Debido a la pandemia del SARS-CoV-2, las instituciones de educación superior tienen el desafío de continuar brindando enseñanza, consultoría y producción de investigación de calidad a través de entornos educativos virtuales. En este contexto, se está generando un gran volumen de datos y se necesitan tecnologías como la analítica de Big Data para crear oportunidades de innovación abierta mediante la obtención de conocimientos valiosos. El propósito de este trabajo fue investigar los factores que influyen en la adopción de la analítica de Big Data, así como evaluar la relación que tiene con el desempeño y la gestión del conocimiento, tomando en consideración que esta tecnología se encuentra en sus etapas iniciales y que investigaciones previas han proporcionado resultados variados según el sector en cuestión. Para abordar estos desafíos, se desarrolló un marco teórico que comprobó empíricamente la relación de estas variables; se encuestaron a 265 miembros de universidades de América Latina y se utilizó el modelado de ecuaciones estructurales. Los hallazgos identificaron a la compatibilidad, un entorno de datos organizacional adecuado y el apoyo externo como factores necesarios para adoptar la analítica de Big Data y se comprobó su relación positiva con los procesos de gestión del conocimiento y el desempeño organizacional. Este estudio proporciona una guía práctica para los tomadores de decisión involucrados o encargados de definir la estrategia de implementación de la analítica de Big Data en instituciones de educación superior

    DiscoSnp-RAD: de novo detection of small variants for RAD-Seq population genomics

    Get PDF
    International audienceRestriction site Associated DNA Sequencing (RAD-Seq) is a technique characterized by the sequencing of specific loci along the genome that is widely employed in the field of evolutionary biology since it allows to exploit variants (mainly Single Nucleotide Polymorphism—SNPs) information from entire populations at a reduced cost. Common RAD dedicated tools, such as STACKS or IPyRAD, are based on all-vs-all read alignments, which require consequent time and computing resources. We present an original method, DiscoSnp-RAD, that avoids this pitfall since variants are detected by exploiting specific parts of the assembly graph built from the reads, hence preventing all-vs-all read alignments. We tested the implementation on simulated datasets of increasing size, up to 1,000 samples, and on real RAD-Seq data from 259 specimens of Chiastocheta flies, morphologically assigned to seven species. All individuals were successfully assigned to their species using both STRUCTURE and Maximum Likelihood phylogenetic reconstruction. Moreover, identified variants succeeded to reveal a within-species genetic structure linked to the geographic distribution. Furthermore, our results show that DiscoSnp-RAD is significantly faster than state-of-the-art tools. The overall results show that DiscoSnp-RAD is suitable to identify variants from RAD-Seq data, it does not require time-consuming parameterization steps and it stands out from other tools due to its completely different principle, making it substantially faster, in particular on large datasets
    • …
    corecore