16 research outputs found
ReCoil - an algorithm for compression of extremely large datasets of dna data
The growing volume of generated DNA sequencing data makes the problem of its long term storage increasingly important. In this work we present ReCoil - an I/O efficient external memory algorithm designed for compression of very large collections of short reads DNA data. Typically each position of DNA sequence is covered by multiple reads of a short read dataset and our algorithm makes use of resulting redundancy to achieve high compression rate
EuroEXA - D2.6: Final ported application software
This document describes the ported software of the EuroEXA applications to the single CRDB testbed and it discusses the experiences extracted from porting and optimization activities that should be actively taken into account in future redesign and optimization. This document accompanies the ported application software, found in the EuroEXA private repository (https://github.com/euroexa). In particular, this document describes the status of the software for each of the EuroEXA applications, sketches the redesign and optimization strategy for each application, discusses issues and difficulties faced during the porting activities and the relative lesson learned. A few preliminary evaluation results have been presented, however the full evaluation will be discussed in deliverable 2.8
ELGARâa European Laboratory for Gravitation and Atom-interferometric Research
Gravitational waves (GWs) were observed for the first time in 2015, one century after Einstein predicted their existence. There is now growing interest to extend the detection bandwidth to low frequency. The scientific potential of multi-frequency GW astronomy is enormous as it would enable to obtain a more complete picture of cosmic events and mechanisms. This is a unique and entirely new opportunity for the future of astronomy, the success of which depends upon the decisions being made on existing and new infrastructures. The prospect of combining observations from the future space-based instrument LISA together with third generation ground based detectors will open the way toward multi-band GW astronomy, but will leave the infrasound (0.1â10 Hz) band uncovered. GW detectors based on matter wave interferometry promise to fill such a sensitivity gap. We propose the European Laboratory for Gravitation and Atom-interferometric Research (ELGAR), an underground infrastructure based on the latest progress in atomic physics, to study spaceâtime and gravitation with the primary goal of detecting GWs in the infrasound band. ELGAR will directly inherit from large research facilities now being built in Europe for the study of large scale atom interferometry and will drive new pan-European synergies from top research centers developing quantum sensors. ELGAR will measure GW radiation in the infrasound band with a peak strain sensitivity of at 1.7 Hz. The antenna will have an impact on diverse fundamental and applied research fields beyond GW astronomy, including gravitation, general relativity, and geology.AB acknowledges support from the ANR (project EOSBECMR), IdEx BordeauxâLAPHIA (project OE-TWR), theQuantERA ERA-NET (project TAIOL) and the Aquitaine Region (projets IASIG3D and USOFF).XZ thanks the China Scholarships Council (No. 201806010364) program for financial support. JJ thanks âAssociationNationale de la Recherche et de la Technologieâ for financial support (No. 2018/1565).SvAb, NG, SL, EMR, DS, and CS gratefully acknowledge support by the German Space Agency (DLR) with funds provided by the Federal Ministry for Economic Affairs and Energy (BMWi) due to an enactment of the German Bundestag under Grants No. DLRâŒ50WM1641 (PRIMUS-III), 50WM1952 (QUANTUS-V-Fallturm), and 50WP1700 (BECCAL), 50WM1861 (CAL), 50WM2060 (CARIOQA) as well as 50RK1957 (QGYRO)SvAb, NG, SL, EMR, DS, and CS gratefully acknowledge support by âNiedersĂ€chsisches Vorabâ through the âQuantum- and Nano-Metrology (QUANOMET)â initiative within the project QT3, and through âFörderung von Wissenschaft und Technik in Forschung und Lehreâ for the initial funding of research in the new DLR-SI Institute, the CRC 1227 DQ-mat within the projects A05 and B07DS gratefully acknowledges funding by the Federal Ministry of Education and Research (BMBF) through the funding program Photonics Research Germany under contract number 13N14875.RG acknowledges Ville de Paris (Emergence programme HSENS-MWGRAV), ANR (project PIMAI) and the Fundamental Physics and Gravitational Waves (PhyFOG) programme of Observatoire de Paris for support. We also acknowledge networking support by the COST actions GWverse CA16104 and AtomQT CA16221 (Horizon 2020 Framework Programme of the European Union).The work was also supported by the German Space Agency (DLR) with funds provided by the Federal Ministry for Economic Affairs and Energy (BMWi) due to an enactment of the German Bundestag under Grant Nos.âŒ50WM1556, 50WM1956 and 50WP1706 as well as through the DLR Institutes DLR-SI and DLR-QT.PA-S, MN, and CFS acknowledge support from contracts ESP2015-67234-P and ESP2017-90084-P from the Ministry of Economy and Business of Spain (MINECO), and from contract 2017-SGR-1469 from AGAUR (Catalan government).SvAb, NG, SL, EMR, DS, and CS gratefully acknowledge support by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germanyâs Excellence StrategyâEXC-2123 QuantumFrontiersâ390837967 (B2) andCRC1227 âDQ-matâ within projects A05, B07 and B09.LAS thanks Sorbonne UniversitĂ©s (Emergence project LORINVACC) and Conseil Scientifique de l'Observatoire de Paris for funding.This work was realized with the financial support of the French State through the âAgence Nationale de la Rechercheâ (ANR) in the frame of the âMRSEIâ program (Pre-ELGAR ANR-17-MRS5-0004-01) and the âInvestissement d'Avenirâ program (Equipex MIGA: ANR-11-EQPX-0028, IdEx BordeauxâLAPHIA: ANR-10-IDEX-03-02).Peer Reviewe
Compressing and Querrying the Human Genome
With high throughput DNA sequencing costs dropping below $1000 for human genomes, data storage, retrieval and analysis are the major bottlenecks in biological studies. In order to address the large-data challenges on genomics, this thesis advocates : 1) A highly efficient read-level compression of the data which is achieved through reference-based compression by a tool called SLIMGENE and 2) a clean separation between evidence collection and inference in variant calling which is achieved though our Genome Query Language (GQL) that allows for the rapid collection of evidence needed for calling variants. The first contribution, SLIMGENE, introduces a set of domain specific lossless compression schemes that achieve over 40x compression of the ASCII representation of short reads, outperforming bzip2 by over 6x. Including quality values, we show 5x compression using less running time than bzip2. Secondly, given the discrepancy between the compression factor obtained with and without quality values, we initiate the study of using lossy transformations of the quality values. Specifically we show that a lossy quality value quantization results in 14x compression but has minimal impact on downstream applications like SNP calling that use quality values. The second contribution, GQL, introduces a novel framework for querying large genomic datasets. We provide a number of cases to showcase the user of GQL for complex evidence collection, such as the evidence for large structural variations. Specifically, typical GQL queries can be written in 5-10 lines of code and search large datasets ( 100GB) in only a few minutes on a cheap desktop computer. We show that GQL is faster and more concise than writing equivalent queries in existing frameworks such as GATK. We show that existing callers by an order of magnitude by using GQL to retrieve evidence. We also show how GQL output can be visualized using the UCSC browse
Prognostic significance of atypical leukemic cell morphology in chronic lymphocytic leukemia.
CHARLES UNIVERSITY IN PRAGUE Faculty of Pharmacy in Hradec KrĂĄlovĂ© Department of Biological and Medical Sciences Study program: Health Care Bioanalytics Candidate: Bc. Nikola FuÄĂkovĂĄ Supervisor: doc. MUDr. LukĂĄĆĄ Smolej, Ph.D. Title of diploma thesis: Prognostic significance of atypical leukemic cell morphology in chronic lymphocytic leukemia The aim of this thesis is to evaluate the prognostic significance of atypical cell morphology and smudge cells in patients with untreated chronic lymphocytic leukemia. We performed differential leukocytes count and classified lymphocytes as typical and atypical in a cohort of 101 patients (median age, 66 years; males, 69%, Rai III/IV stages, 18%). For atypical CLL, we used the 15% threshold and 59% of patients were classified as atypical CLL (aCLL). For smudge cells, we chose the 30% threshold and 33% of patients were classified as smudge cells positive. Patients in early clinical Rai stage (0) had significantly higher number of smudge cells (p=0.04). We didn't find a significant association between aCLL / smudge cells with modern prognostic indicators. We didn't find a relationship between aCLL and the time to first-line therapy (p=0.394). However, patients with aCLL had a significantly shorter overall survival (p=0.0397). There was a trend toward shorter..
FPGA based architecture for DNA sequence comparison and database search
DNA sequence comparison is a computationally intensive problem, known widely since the competition for human DNA decryption. Database search for DNA sequence comparison is of great value to computational biologists. Several algorithms have been developed and implemented to solve this problem efficiently, but from a user base point of view the BLAST algorithm is the most widely used one. In this paper we present a new architecture for the BLAST algorithm. The new architecture was fully designed, placed and routed. The post place-and-route cycle-accurate simulation, accounting for the I/O, shows a better performance than a cluster of workstations running highly optimized code over identical datasets. The new architecture and detailed performance results are presented in this paper. 1
Using Genome Query Language (GQL) to uncover genetic variation
Motivation:With high throughput DNA sequencing costs dropping below $1,000 for human genomes, data storage, retrieval, and analysis are the major bottlenecks in biological studies. In order to address the large-data challenges, we advocate a clean separation between the evidence collection and the inference in variant calling. We define and implement a Genome Query Language (GQL) that allows for the rapid collection of evidence needed for calling variants. Results: We provide a number of cases to showcase the use of GQL for complex evidence collection, such as the evidence for large structural variations. Specifically, typical GQL queries can be written in5-10 lines of high level code, and search large data sets (100GB) in minutes. We also demonstrate its complementarity with other variant calling tools. Popular variant calling tools can achieve one order of magnitude speed-up by using GQL to retrieve evidence. Finally, we show how GQL can be used to query and compare multiple data-sets. By separating the evidence and inference for variant calling, it frees all variant detection tools from the data intensive evidence collection, and focus on statistical inference. Availability: GQL can be downloaded fro
VineTalk: Simplifying Software Access and Sharing of FPGAs in Datacenters
FPGA-based accelerators are becoming a first class citizen in data centers. Adding FPGAs in data centers can lead to higher compute densities with improved energy efficiency for latency critical workloads, such as financial applications. However deployment of FPGAs in datacenters is hindered, as both developers and cloud providers face difficulties. Application writers need to deal with FPGA interfacing as well as application logic/algorithms. On the other hand, cloud providers are reluctant to deploy FPGAs in large scale, due to the FPGAs lack of sharing that results in reduced utilization and questionable ROI. In this paper, we introduce VineTalk, a framework that reduces the programming effort associated with FPGA-based accelerators and FPGA virtualization. We integrate VineTalk with the Xilinx SDAccel development framework and we map it to the Kintex UltraScale FPGA. Our preliminary evaluation with a use-case of financial applications shows that VineTalk can offer effective FPGA sharing introducing less than 4% overhead to application execution time