Search CORE

2 research outputs found

Representing variant calling format as directed acyclic graphs to enable the use of cloud computing for efficient and cost effective genome analysis

Author: Aizad Sanna
Anjum Ashiq
Sakellariou Rizos
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

Ever since the completion of the Human Genome Project in 2003, the human genome has been represented as a linear sequence of 3.2 billion base pairs and is referred to as the "Reference Genome". Since then it has become easier to sequence genomes of individuals due to rapid advancements in technology, which in turn has created a need to represent the new information using a different representation. Several attempts have been made to represent the genome sequence as a graph albeit for different purposes. Here we take a look at the Variant Calling Format (VCF) file which carries information about variations within genomes and is the primary format of choice for genome analysis tools. This short paper aims to motivate work in representing the VCF file as Directed Acyclic Graphs (DAGs) to run on a cloud in order to exploit the high performance capabilities provided by cloud computing.N/

Crossref

UDORA - University of Derby Online Research Archive

Graph data modelling for genomic variants

Author: Aizad Sanna
Anjum Ashiq
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

Genome variant analysis is performed on Variant Call Format (VCF) files. It can take days to process these files for genome analytics due to challenges such as loading the files for each user query and processing them to answer questions of interest. As data sizes grow, timely processing of this data is putting enormous pressure on the computational resources, leading to significant processing delays and may jeopardise the ultimate goal of bringing the genomic discoveries to masses. We believe this problem will not be solved until the underlying data structure to organise and process these files undergoes a transformation. To overcome this problem, we have proposed a graph based system to represent the data in VCF files. This allows the data to be loaded once in a graph model which is then subsequently queried and processed numerous times without any additional computational and data access penalties. This helps reduce data access time by giving a constant time access to any node and addresses performance and scalability challenges that have been a limiting factor for the mass scale adoption of genome analytics. It takes only 2ms to access any data node in our graph model and remains constant for any number of nodes.University of Derby, U

UDORA - University of Derby Online Research Archive