12 research outputs found

    Alignment-free Genomic Analysis via a Big Data Spark Platform

    Get PDF
    Motivation: Alignment-free distance and similarity functions (AF functions, for short) are a well established alternative to two and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Due to data-intensive applications, the computation of AF functions is a Big Data problem, with the recent Literature indicating that the development of fast and scalable algorithms computing AF functions is a high-priority task. Somewhat surprisingly, despite the increasing popularity of Big Data technologies in Computational Biology, the development of a Big Data platform for those tasks has not been pursued, possibly due to its complexity. Results: We fill this important gap by introducing FADE, the first extensible, efficient and scalable Spark platform for Alignment-free genomic analysis. It supports natively eighteen of the best performing AF functions coming out of a recent hallmark benchmarking study. FADE development and potential impact comprises novel aspects of interest. Namely, (a) a considerable effort of distributed algorithms, the most tangible result being a much faster execution time of reference methods like MASH and FSWM; (b) a software design that makes FADE user-friendly and easily extendable by Spark non-specialists; (c) its ability to support data- and compute-intensive tasks. About this, we provide a novel and much needed analysis of how informative and robust AF functions are, in terms of the statistical significance of their output. Our findings naturally extend the ones of the highly regarded benchmarking study, since the functions that can really be used are reduced to a handful of the eighteen included in FADE

    Generating small mammals mitogenome reference dataset for Malaysian species

    Get PDF
    Malaysia’s biological diversity is among the richest in the world but rapidly declining due to various human activities and climate change. Despite the continuous loss in biodiversity with the most recent death of our last Sumatran Rhino in May 2019, Malaysia's biodiversity data is still poorly characterized and systematically documented. Most species data have limited visibility with very abysmal publications which are mostly restricted to morphological traits and lack genomic data. As in line with the country’s National Policy on Biological Diversity 2016- 2025 to combat biodiversity loss and global effort to sequence all life by 2028 (Earth Biogenome Project), this study has focused on generating Malaysian small mammals mitogenome reference dataset. The genomic DNA of two fresh tissue samples (Balionycteris maculata and Callosciurus notatus) were extracted, fragmented to 300 bp, and further constructed into Illumina-compatible libraries using BEST protocol. Next, about 15 amplification cycles during library indexing was used to produce maximum data output with high complexity, and low clonality suggesting the rare variants could be easily detected. Prior to sequencing using BGISEQ-500 platform, the 3-indexed libraries (including extraction blanks) with approximately 300-400 bp were pooled to equimolar DNA (<12,000 pmol/L). Approximately 5 gigabases raw sequence data per sample were generated comprising of the whole genome data (mitochondrial DNA and nuclear DNA). The new high quality mitogenomes has been successfully assembled using MITOBIM and PALEOMIX with an average size of 16-17 Kbp and an average depth of coverage of 140.27x. The detailed pipeline and challenges on mitogenome assembly for species with and without reference genome in Genbank was discussed. The mitogenomes were further annotated to its 37 designated genes via MitoZ. The robust pipeline of mitogenome sequence generation established in this work could be applied to generate more genomic data from thousands of tissue samples available from local biodiversity key players such as Perbadanan Taman Negara Johor, FRIM and PERHILITAN. The further enrichment of DNA reference database will strongly magnify species detection in invertebrate-derived DNA (iDNA) research for biodiversity assessment, wildlife forensics to monitor illegal trade of endangered species in this region, as well as population genetic studies

    SPIKEPIPE: A metagenomic pipeline for the accurate quantification of eukaryotic species occurrences and intraspecific abundance change using DNA barcodes or mitogenomes

    Get PDF
    The accurate quantification of eukaryotic species abundances from bulk samples remains a key challenge for community ecology and environmental biomonitoring. We resolve this challenge by combining shotgun sequencing, mapping to reference DNA barcodes or to mitogenomes, and three correction factors: (a) a percent-coverage threshold to filter out false positives, (b) an internal-standard DNA spike-in to correct for stochasticity during sequencing, and (c) technical replicates to correct for stochasticity across sequencing runs. The SPIKEPIPE pipeline achieves a strikingly high accuracy of intraspecific abundance estimates (in terms of DNA mass) from samples of known composition (mapping to barcodes R2 = .93, mitogenomes R2 = .95) and a high repeatability across environmental-sample replicates (barcodes R2 = .94, mitogenomes R2 = .93). As proof of concept, we sequence arthropod samples from the High Arctic, systematically collected over 17 years, detecting changes in species richness, species-specific abundances, and phenology. SPIKEPIPE provides cost-efficient and reliable quantification of eukaryotic communities

    Comparing novel shotgun DNA sequencing and state-of-the-art proteomics approaches for authentication of fish species in mixed samples

    Get PDF
    Replacement of high-value fish species with cheaper varieties or mislabelling of food unfit for human con-sumption is a global problem violating both consumers' rights and safety. For distinguishing fish species in pure samples, DNA approaches are available; however, authentication and quantification of fish species in mixtures remains a challenge. In the present study, a novel high-throughput shotgun DNA sequencing approach applying masked reference libraries was developed and used for authentication and abundance calculations of fish species in mixed samples. Results demonstrate that the analytical protocol presented here can discriminate and predict relative abundances of different fish species in mixed samples with high accuracy. In addition to DNA analyses, shotgun proteomics tools based on direct spectra comparisons were employed on the same mixture. Similar to the DNA approach, the identification of individual fish species and the estimation of their respective relative abundances in a mixed sample also were feasible. Furthermore, the data obtained indicated that DNA sequencing using masked libraries predicted species-composition of the fish mixture with higher specificity, while at a taxonomic family level, relative abundances of the different species in the fish mixture were predicted with slightly higher accuracy using proteomics tools. Taken together, the results demonstrate that both DNA and protein-based approaches presented here can be used to efficiently tackle current challenges in feed and food authentication analyses.Proteomic

    High-Performance approaches for Phylogenetic Placement, and its application to species and diversity quantification

    Get PDF
    In den letzten Jahren haben Fortschritte in der Hochdurchsatz-Genesequenzierung, in Verbindung mit dem anhaltenden exponentiellen Wachstum und der Verfügbarkeit von Rechenressourcen, zu fundamental neuen analytischen Ansätzen in der Biologie geführt. Es ist nun möglich den genetischen Inhalt ganzer Organismengemeinschaften anhand einzelner Umweltproben umfassend zu sequenzieren. Solche Methoden sind besonders für die Mikrobiologie relevant. Die Mikrobiologie war zuvor weitgehend auf die Untersuchung jener Mikroben beschränkt, welche im Labor (d.h., in vitro) kultiviert werden konnten, was jedoch lediglich einen kleinen Teil der in der Natur vorkommenden Diversität abdeckt. Im Gegensatz dazu ermöglicht die Hochdurchsatzsequenzierung nun die direkte Erfassung der genetischen Sequenzen eines Mikrobioms, wie es in seiner natürlichen Umgebung vorkommt (d.h., in situ). Ein typisches Ziel von Mikrobiomstudien besteht in der taxonomischen Klassifizierung der in einer Probe enthaltenen Sequenzen (Querysequenzen). Üblicherweise werden phylogenetische Methoden eingesetzt, um detaillierte taxonomische Beziehungen zwischen Querysequenzen und vertrauenswürdigen Referenzsequenzen, die von bereits klassifizierten Organismen stammen, zu bestimmen. Aufgrund des hohen Volumens (106 10 ^ 6 bis 109 10 ^ 9 ) von Querysequenzen, die aus einer Mikrobiom-Probe mittels Hochdurchsatzsequenzierung generiert werden können, ist eine akkurate phylogenetische Baumrekonstruktion rechnerisch nicht mehr möglich. Darüber hinaus erzeugen derzeit üblicherweise verwendete Sequenzierungstechnologien vergleichsweise kurze Sequenzen, die ein begrenztes phylogenetisches Signal aufweisen, was zu einer Instabilität bei der Inferenz der Phylogenien aus diesen Sequenzen führt. Ein weiteres typisches Ziel von Mikrobiomstudien besteht in der Quantifizierung der Diversität innerhalb einer Probe, bzw. zwischen mehreren Proben. Auch hierfür werden üblicherweise phylogenetische Methoden verwendet. Oftmals setzen diese Methoden die Inferenz eines phylogenetischen Baumes voraus, welcher entweder alle Sequenzen, oder eine geclusterte Teilmenge dieser Sequenzen, umfasst. Wie bei der taxonomischen Identifizierung können Analysen, die auf dieser Art von Bauminferenz basieren, zu ungenauen Ergebnissen führen und/oder rechnerisch nicht durchführbar sein. Im Gegensatz zu einer umfassenden phylogenetischen Inferenz ist die phylogenetische Platzierung eine Methode, die den phylogenetischen Kontext einer Querysequenz innerhalb eines etablierten Referenzbaumes bestimmt. Dieses Verfahren betrachtet den Referenzbaum typischerweise als unveränderlich, d.h. der Referenzbaum wird vor, während oder nach der Platzierung einer Sequenz nicht geändert. Dies erlaubt die phylogenetische Platzierung einer Sequenz in linearer Zeit in Bezug auf die Größe des Referenzbaums durchzuführen. In Kombination mit taxonomischen Informationen über die Referenzsequenzen ermöglicht die phylogenetische Platzierung somit die taxonomische Identifizierung einer Sequenz. Darüber hinaus erlaubt eine phylogenetische Platzierung die Anwendung einer Vielzahl zusätzlicher Analyseverfahren, die beispielsweise die Zuordnung der Zusammensetzungen humaner Mikrobiome zu klinisch-diagnostischen Eigenschaften ermöglicht. In dieser Dissertation präsentiere ich meine Arbeit bezüglich des Entwurfs, der Implementierung, und Verbesserung von EPA-ng, einer Hochleistungsimplementierung der phylogenetischen Platzierung anhand des Maximum-Likelihood Modells. EPA-ng wurde entwickelt um auf Milliarden von Querysequenzen zu skalieren und auf Tausenden von Kernen in Systemen mit gemeinsamem und verteiltem Speicher ausgeführt zu werden. EPA-ng beschleunigt auch die Verarbeitungsgeschwindigkeit auf einzelnen Kernen um das bis zu 3030-fache, im Vergleich zu dessen direkten Konkurrenzprogrammen. Vor kurzem haben wir eine zusätzliche Methode für EPA-ng eingeführt, welche die Platzierung in wesentlich größeren Referenzbäumen ermöglicht. Hierfür verwenden wir einen aktiven Speicherverwaltungsansatz, bei dem reduzierter Speicherverbrauch gegen größere Ausführungszeiten eingetauscht wird. Zusätzlich präsentiere ich einen massiv-parallelen Ansatz um die Diversität einer Probe zu quantifizieren, welcher auf den Ergebnissen phylogenetischer Platzierungen basiert. Diese Software, genannt \toolname{SCRAPP}, kombiniert aktuelle Methoden für die Maximum-Likelihood basierte phylogenetische Inferenz mit Methoden zur Abgrenzung molekularer Spezien. Daraus resultiert eine Verteilung der Artenanzahl auf den Kanten eines Referenzbaums für eine gegebene Probe. Darüber hinaus beschreibe ich einen neuartigen Ansatz zum Clustering von Platzierungsergebnissen, anhand dessen der Benutzer den Rechenaufwand reduzieren kann

    Skmer: assembly-free and alignment-free sample identification using genome skims

    No full text
    The ability to inexpensively describe taxonomic diversity is critical in this era of rapid climate and biodiversity changes. The recent genome-skimming approach extends current barcoding practices beyond short markers by applying low-pass sequencing and recovering whole organelle genomes computationally. This approach discards the nuclear DNA, which constitutes the vast majority of the data. In contrast, we suggest using all unassembled reads. We introduce an assembly-free and alignment-free tool, Skmer, to compute genomic distances between the query and reference genome skims. Skmer shows excellent accuracy in estimating distances and identifying the closest match in reference datasets

    Proteomic Tools for Food and Feed Authentication

    Get PDF
    På grunn av globalt økende etterspørsel etter mat og fôr, introduseres nye proteinholdige ingredienser i matsystemene våre i økende skala. Innføring av nye ingredienser og introduksjon av sirkulære matsystemer gir nye utfordringer når det gjelder metoder for avsløring av henholdsvis fôr- og matsvindel. I denne sammenhengen er det viktig å utvikle raske, sensitive og robuste molekylære metoder som kan implementeres i kontroll og overvåkningsøyemed. Tidligere har fremskritt ved bruk av slike verktøy blitt hemmet av en generell mangel på annoterte referansegenomer for målarter som ofte brukes, eller nylig er introdusert, i fôr eller matpreparater. Fokuset for denne doktorgraden er å utvikle og implementere massespektrometriske metoder (LC-MS/MS) som er i stand til å identifisere, differensiere og kvantifisere proteinholdige ingredienser av animalsk og planteopprinnelse i ulike mat- og fôrblandinger ved bruk av massespektra fingeravtrykk. Arbeidet som presenteres i denne doktorgraden omfatter «bottom-up» proteomiske arbeidsflyter ved bruk av høytrykksvæskekromatografi (HPLC) tandem massespektrometri (MS/MS). Databehandling ble utført ved å bruke direkte spektrasammenligning (compareMS2) og spektrabibliotekmatching (SLM) analyser ved bruk av verktøy fra Trans-Proteomics Pipeline (TPP) og annen åpen kildekode til bioinformatisk programvare. Alle data generert og publisert i løpet av denne doktorgraden har blitt gjort tilgjengelig på offentlige repositrium for MS-data, for eksempel Mass Spectrometry Interactive Virtual Environment (MassIVE), som følger FAIR-prinsippene. Den SLM baserte arbeidsflyten brukt i denne doktorgraden klarte å differensiere ulike prosesserte animalske proteiner (PAP) som storfemelk og bovint blod. SLM ble også brukt til å differensiere ulike insektarter og for å detektere om larver av svart soldatflue (BSF) var fôret med PAP. SLM-metoden ble også brukt til å identifisere og kvantifisere innholdet i et blandingsprodukt av 3 ulike fiskearter. Det ble også funnet at SLM basert proteomikk kan brukes til å identifisere vanlige allergener i insektsprøver tiltenkt humant konsum. Denne tilnærmingen ble også implementert med suksess for å differensiere mellom soyabønneprøver som var enten dyrket organisk, konvensjonelt eller inneholdt genetiske modifikasjoner (GM). I tillegg ble differensiell proteinekspresjon påvist mellom prøver av GM, konvensjonelt og økologisk dyrkede soyabønner. Dette førte til identifisering av to nye peptidmarkører for effektiv sporing av GM-avlinger i mat og fôr. Denne doktorgraden har vist at den SLM baserte metoden er i stand til å identifisere både art og vevstype brukt i et proteinholdig matprodukt eller fôringredients det være seg PAP, plante-, pattedyr- eller fiskeproteiner. Fremtidig arbeid bør fokusere på differensiering og avsløring av svindel i sjømat, som nylig ble fremhevet som et fremvoksende tema i det globale matmarkedet. Alle arts- og vevsspesifikke MS-data samlet inn i det ovennevnte arbeidet vil gjøres tilgjengelig fra i dedikert nettbaserte tjenester. Sistnevnte utvikles for tiden internt, og etter skikkelig kvalitetstesting er det tenkt å bli utgitt offentlig for å gi forskningsmiljøer og myndigheter en lett tilgjengelig plattform for autentisering og identifisering av proteinholdige ingredienser i fôr- og mat. Due to globally rising demands for food and feed, novel proteinaceous ingredients are introduced into our food systems on an increasing scale. The introduction of novel ingredients and circularity of the food system gives rise to novel challenges concerning the detection of feed and food fraud and the determination of feed and food authenticity, respectively. In this context, developing and increasing the implementation of rapid, sensitive, and robust molecular methods are essential. In the past, progress in applying such tools has been hampered by a general lack of wellannotated reference genomes of target species commonly used or newly introduced in feed or food preparations. This PhD focused on developing and implementing mass spectrometry-based approaches to identify, differentiate, and quantify proteinaceous ingredients of animal and plant origin in various food and feed mixes without using any genomic information. The work presented in this PhD implemented bottom-up proteomic workflows using high-performance liquid chromatography (HPLC) tandem mass spectrometry (MS/MS). Data analyses were done using direct spectra comparison (compareMS2), spectra library matching (SLM), Trans-Proteomics Pipeline (TPP), and MaxQuant software. All data generated and published during this PhD have been made available on public repositories for proteomics data, such as the Mass Spectrometry Interactive Virtual Environment (MassIVE), following Findable, Accessible, Interoperable, and Reusable (FAIR) principles. The untargeted proteomics SLM workflow implemented during this PhD successfully differentiated processed animal proteins such as bovine milk and bovine blood. The SLM was also used to identify and authenticate food and feed-grade insect species and to detect if black soldier fly (BSF) larvae were fed on the prohibited PAP. Using the SLM workflow, it was also possible to quantify and authenticate the different species in fish mixtures containing muscle tissues from three different fish species. It was also shown that untargeted proteomics could be used to identify common allergens in foodgrade insect samples. Also, the proteomic approach was successfully implemented to separate thirty-one ready-to-market soybean samples farmed organically, conventionally, and with genetic modifications (GM). Differential protein expression was detected between GM, conventionally, and organically farmed soybean samples. Additional bioinformatics analyses led to the detection of two novel peptide markers for the efficient tracing of GM crops in food and feed. The proteomic tools implemented during this PhD were capable of species and tissues specific identification of proteinaceous food and feed ingredients, including processed animal proteins, plant, mammalian, and fish proteins. Future work should focus on the differentiation and detection of fraud in food and feed in the global food market. Webbased interphase will be developed for food and feed authentication using spectra libraries created during this PhD. Following proper quality testing, the web-based interphase will be released publicly to provide research and regulatory laboratories with an easily accessible platform for authenticating and identifying protein ingredients in feed and food samples.Doktorgradsavhandlin
    corecore