22 research outputs found

    Genozip - a universal extensible genomic data compressor

    Get PDF
    We present Genozip, a universal and fully featured compression software for genomic data. Genozip is designed to be a general-purpose software and a development framework for genomic compression by providing five core capabilities - universality (support for all common genomic file formats), high compression ratios, speed, feature-richness, and extensibility. Genozip delivers high-performance compression for widely-used genomic data formats in genomics research, namely FASTQ, SAM/BAM/CRAM, VCF, GVF, FASTA, PHYLIP, and 23andMe formats. Our test results show that Genozip is fast and achieves greatly improved compression ratios, even when the files are already compressed. Further, Genozip is architected with a separation of the Genozip Framework from file-format-specific Segmenters and data-type-specific Codecs. With this, we intend for Genozip to be a general-purpose compression platform where researchers can implement compression for additional file formats, as well as new codecs for data types or fields within files, in the future. We anticipate that this will ultimately increase the visibility and adoption of these algorithms by the user community, thereby accelerating further innovation in this space. Availability: Genozip is written in C. The code is open-source and available on GitHub (https://github.com/divonlan/genozip). The package is free for non-commercial use. It is distributed as a Docker container on DockerHub and through the conda package manager. Genozip is tested on Linux, Mac, and Windows. Supplementary information: Supplementary data are available at Bioinformatics online.Divon Lan, Ray Tobler, Yassine Souilmi, Bastien Llama

    Scalable and cost-effective NGS genotyping in the cloud

    Get PDF
    Background: While next-generation sequencing (NGS) costs have plummeted in recent years, cost and complexity of computation remain substantial barriers to the use of NGS in routine clinical care. The clinical potential of NGS will not be realized until robust and routine whole genome sequencing data can be accurately rendered to medically actionable reports within a time window of hours and at scales of economy in the 10’s of dollars. Results: We take a step towards addressing this challenge, by using COSMOS, a cloud-enabled workflow management system, to develop GenomeKey, an NGS whole genome analysis workflow. COSMOS implements complex workflows making optimal use of high-performance compute clusters. Here we show that the Amazon Web Service (AWS) implementation of GenomeKey via COSMOS provides a fast, scalable, and cost-effective analysis of both public benchmarking and large-scale heterogeneous clinical NGS datasets. Conclusions: Our systematic benchmarking reveals important new insights and considerations to produce clinical turn-around of whole genome analysis optimization and workflow management including strategic batching of individual genomes and efficient cluster resource configuration.Yassine Souilmi, Alex K. Lancaster, Jae-Yoon Jung, Ettore Rizzo, Jared B. Hawkins, Ryan Powles, Saaïd Amzazi, Hassan Ghazal, Peter J. Tonellato and Dennis P. Wal

    The role of genetic selection and climatic factors in the dispersal of anatomically modern humans out of Africa.

    Get PDF
    The evolutionarily recent dispersal of anatomically modern humans (AMH) out of Africa (OoA) and across Eurasia provides a unique opportunity to examine the impacts of genetic selection as humans adapted to multiple new environments. Analysis of ancient Eurasian genomic datasets (~1,000 to 45,000 y old) reveals signatures of strong selection, including at least 57 hard sweeps after the initial AMH movement OoA, which have been obscured in modern populations by extensive admixture during the Holocene. The spatiotemporal patterns of these hard sweeps provide a means to reconstruct early AMH population dispersals OoA. We identify a previously unsuspected extended period of genetic adaptation lasting ~30,000 y, potentially in the Arabian Peninsula area, prior to a major Neandertal genetic introgression and subsequent rapid dispersal across Eurasia as far as Australia. Consistent functional targets of selection initiated during this period, which we term the Arabian Standstill, include loci involved in the regulation of fat storage, neural development, skin physiology, and cilia function. Similar adaptive signatures are also evident in introgressed archaic hominin loci and modern Arctic human groups, and we suggest that this signal represents selection for cold adaptation. Surprisingly, many of the candidate selected loci across these groups appear to directly interact and coordinately regulate biological processes, with a number associated with major modern diseases including the ciliopathies, metabolic syndrome, and neurodegenerative disorders. This expands the potential for ancestral human adaptation to directly impact modern diseases, providing a platform for evolutionary medicine.Raymond Tobler, Yassine Souilmi, Christian D. Huber, and Alan Coope

    Admixture has obscured signals of historical hard sweeps in humans (advance online)

    Get PDF
    The role of natural selection in shaping biological diversity is an area of intense interest in modern biology. To date, studies of positive selection have primarily relied on genomic datasets from contemporary populations, which are susceptible to confounding factors associated with complex and often unknown aspects of population history. In particular, admixture between diverged populations can distort or hide prior selection events in modern genomes, though this process is not explicitly accounted for in most selection studies despite its apparent ubiquity in humans and other species. Through analyses of ancient and modern human genomes, we show that previously reported Holocene-era admixture has masked more than 50 historic hard sweeps in modern European genomes. Our results imply that this canonical mode of selection has probably b een underappreciated in the evolutionary history of humans and suggest that our current understanding of the tempo and mode of selection in natural populations may be inaccurat

    genozip: a fast and efficient compression tool for VCF files

    No full text
    genozip is a new lossless compression tool for VCF (Variant Call Format) files. By applying field-specific algorithms and fully utilizing the available computational hardware, genozip achieves the highest compression ratios amongst existing lossless compression tools known to the authors, at speeds comparable with the fastest multi-threaded compressors. genozip is freely available to non-commercial users. It can be installed via conda-forge, Docker Hub, or downloaded from github.com/divonlan/genozip. Supplementary data are available at Bioinformatics online.Divon Lan, Raymond Tobler, Yassine Souilmi and Bastien Llama

    Systematic benchmark of ancient DNA read mapping

    No full text
    The current standard practice for assembling individual genomes involves mapping millions of short DNA sequences (also known as DNA 'reads') against a pre-constructed reference genome. Mapping vast amounts of short reads in a timely manner is a computationally challenging task that inevitably produces artefacts, including biases against alleles not found in the reference genome. This reference bias and other mapping artefacts are expected to be exacerbated in ancient DNA (aDNA) studies, which rely on the analysis of low quantities of damaged and very short DNA fragments (~30-80 bp). Nevertheless, the current gold-standard mapping strategies for aDNA studies have effectively remained unchanged for nearly a decade, during which time new software has emerged. In this study, we used simulated aDNA reads from three different human populations to benchmark the performance of 30 distinct mapping strategies implemented across four different read mapping software-BWA-aln, BWA-mem, NovoAlign and Bowtie2-and quantified the impact of reference bias in downstream population genetic analyses. We show that specific NovoAlign, BWA-aln and BWA-mem parameterizations achieve high mapping precision with low levels of reference bias, particularly after filtering out reads with low mapping qualities. However, unbiased NovoAlign results required the use of an IUPAC reference genome. While relevant only to aDNA projects where reference population data are available, the benefit of using an IUPAC reference demonstrates the value of incorporating population genetic information into the aDNA mapping process, echoing recent results based on graph genome representations.Adrien Oliva, Raymond Tobler, Alan Cooper, Bastien Llamas and Yassine Souilm

    Ancient DNA studies in pre-Columbian Mesoamerica

    No full text
    Mesoamerica is a historically and culturally defined geographic area comprising current central and south Mexico, Belize, Guatemala, El Salvador, and border regions of Honduras, western Nicaragua, and northwestern Costa Rica. The permanent settling of Mesoamerica was accompanied by the development of agriculture and pottery manufacturing (2500 BCE–150 CE), which led to the rise of several cultures connected by commerce and farming. Hence, Mesoamericans probably carried an invaluable genetic diversity partly lost during the Spanish conquest and the subsequent colonial period. Mesoamerican ancient DNA (aDNA) research has mainly focused on the study of mitochondrial DNA in the Basin of Mexico and the Yucatán Peninsula and its nearby territories, particularly during the Postclassic period (900–1519 CE). Despite limitations associated with the poor preservation of samples in tropical areas, recent methodological improvements pave the way for a deeper analysis of Mesoamerica. Here, we review how aDNA research has helped discern population dynamics patterns in the pre-Columbian Mesoamerican context, how it supports archaeological, linguistic, and anthropological conclusions, and finally, how it offers new working hypotheses.Xavier Roca-Rada, Yassine Souilmi, João C. Teixeira and Bastien Llama

    Epidemiological characteristics of childhood urolithiasis in Morocco

    Get PDF
    Objectives: Due to the increase observed in the incidence of pediatric urolithiasis in the world, and the scarcity of studies of this pathology in Morocco, we assessed whether epidemiological characteristics of pediatric urolithiasis have a similar profile like in developed countries further we tried to assess the prevalence of this pathology among children in Hassan II University-Hospital of Fez. Subjects and methods: Between January 2003 to November 2013, 104 pediatric patients with urolithiasis were presented to Hassan II University-Hospital of Fez. Eighty one were boys and 23 girls. Patients were referred from different regions of Moroccan states. Results: Out of 104 children diagnosed with urolithiasis, 5 patients with positive family history of renal stones, and 12 were recurrent (12%). Their age varied between 8 months and 15 years old, with a mean age of 7.86 ± 4. The sex ratio was 3.5:1 boys to girls. Clinical presentations were dominated by micturition disorder (59%), abdominal or flank pain (28%), nephritic colic (22%), hematuria (22%) and urinary tract infection (13%). Stones were located in the upper urinary tract in 62.5% of cases. Stones were treated by surgery in 89 cases (89%), and with ESWL in only 2 cases (2%). Over these years of study, a prevalence of 0.83% of childhood urolithiasis was calculated. Conclusions: This preliminary study represents only a region of the country, so more epidemiological analyses should be done. Stone analysis should be performed more frequently, and patients must be presented at earlier stages, before any development of renal failure
    corecore