Search CORE

11 research outputs found

Correction to: IonCRAM: a reference-based compression tool for ion torrent sequence files

Author: Shokrof Moustafa,
Publication venue
Publication date: 25/10/2023
Field of study

Ezid

IonCRAM: a reference-based compression tool for ion torrent sequence files

Author: Shokrof Moustafa,
Publication venue
Publication date: 25/10/2023
Field of study

Ezid

Recommended from our members

Large-Scale Analysis of Population Structural Variants Using Terabytes of SRA Sequencing Data

Author: Shokrof Moustafa
Publication venue: eScholarship, University of California
Publication date: 01/01/2024
Field of study

Studying structural variants (SV) in populations is crucial since they cause more diversity and have more effects on gene function than small variants. However, population scale studies are challenging since finding SV from inexpensive short-read sequencing(SRS) methods have a high false positive rate. Conversely, long-read sequencing(LRS) are more accurate for SV discovery but are expensive at a population scale. Here, I develop new unbiased techniques to study SV in populations that are more scalable than the state-of-the-art. I show their utility in creating an SV catalog for the cattle breed augmented with their allele frequency

eScholarship - University of California

Correction to: IonCRAM: a reference-based compression tool for ion torrent sequence files

Author: M Shokrof
Mohamed Abouelhoda
Moustafa Shokrof
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Recommended from our members

IonCRAM: a reference-based compression tool for ion torrent sequence files

Author: Abouelhoda Mohamed
Shokrof Moustafa
Publication venue: eScholarship, University of California
Publication date: 01/12/2020
Field of study

BackgroundIon Torrent is one of the major next generation sequencing (NGS) technologies and it is frequently used in medical research and diagnosis. The built-in software for the Ion Torrent sequencing machines delivers the sequencing results in the BAM format. In addition to the usual SAM/BAM fields, the Ion Torrent BAM file includes technology-specific flow signal data. The flow signals occupy a big portion of the BAM file (about 75% for the human genome). Compressing SAM/BAM into CRAM format significantly reduces the space needed to store the NGS results. However, the tools for generating the CRAM formats are not designed to handle the flow signals. This missing feature has motivated us to develop a new program to improve the compression of the Ion Torrent files for long term archiving.ResultsIn this paper, we present IonCRAM, the first reference-based compression tool to compress Ion Torrent BAM files for long term archiving. For the BAM files, IonCRAM could achieve a space saving of about 43%. This space saving is superior to what achieved with the CRAM format by about 8-9%.ConclusionsReducing the space consumption of NGS data reduces the cost of storage and data transfer. Therefore, developing efficient compression software for clinical NGS data goes beyond the computational interest; as it ultimately contributes to the overall cost reduction of the clinical test. The space saving achieved by our tool is a practical step in this direction. The tool is open source and available at Code Ocean, github, and http://ioncram.saudigenomeproject.com

eScholarship - University of California

Recommended from our members

Correction to: IonCRAM: a reference-based compression tool for ion torrent sequence files

Author: Abouelhoda Mohamed
Shokrof Moustafa
Publication venue: eScholarship, University of California
Publication date: 01/12/2020
Field of study

An amendment to this paper has been published and can be accessed via the original article

eScholarship - University of California

Recommended from our members

MQF and buffered MQF: quotient filters for efficient storage of k-mers with their counts and metadata

Author: Brown C Titus
Mansour Tamer A
Shokrof Moustafa
Publication venue: eScholarship, University of California
Publication date: 01/12/2021
Field of study

BackgroundSpecialized data structures are required for online algorithms to efficiently handle large sequencing datasets. The counting quotient filter (CQF), a compact hashtable, can efficiently store k-mers with a skewed distribution.ResultHere, we present the mixed-counters quotient filter (MQF) as a new variant of the CQF with novel counting and labeling systems. The new counting system adapts to a wider range of data distributions for increased space efficiency and is faster than the CQF for insertions and queries in most of the tested scenarios. A buffered version of the MQF can offload storage to disk, trading speed of insertions and queries for a significant memory reduction. The labeling system provides a flexible framework for assigning labels to member items while maintaining good data locality and a concise memory representation. These labels serve as a minimal perfect hash function but are ~ tenfold faster than BBhash, with no need to re-analyze the original data for further insertions or deletions.ConclusionsThe MQF is a flexible and efficient data structure that extends our ability to work with high throughput sequencing data

eScholarship - University of California

Recommended from our members

Sequencing and assembly of the Egyptian buffalo genome.

Author: Abouelhoda Mohamed
Ageez Amr
El-Khishin Dina
Hassan Laila
Ibrahim Amr
Saad Mohamed
Shokrof Moustafa
Publication venue: eScholarship, University of California
Publication date: 01/01/2020
Field of study

Water buffalo (Bubalus bubalis) is an important source of meat and milk in countries with relatively warm weather. Compared to the cattle genome, a little has been done to reveal its genome structure and genomic traits. This is due to the complications stemming from the large genome size, the complexity of the genome, and the high repetitive content. In this paper, we introduce a high-quality draft assembly of the Egyptian water buffalo genome. The Egyptian breed is used as a dual purpose animal (milk/meat). It is distinguished by its adaptability to the local environment, quality of feed changes, as well as its high resistance to diseases. The genome assembly of the Egyptian water buffalo has been achieved using a reference-based assembly workflow. Our workflow significantly reduced the computational complexity of the assembly process, and improved the assembly quality by integrating different public resources. We also compared our assembly to the currently available draft assemblies of water buffalo breeds. A total of 21,128 genes were identified in the produced assembly. A list of milk virgin-related genes; milk pregnancy-related genes; milk lactation-related genes; milk involution-related genes; and milk mastitis-related genes were identified in the assembly. Our results will significantly contribute to a better understanding of the genetics of the Egyptian water buffalo which will eventually support the ongoing breeding efforts and facilitate the future discovery of genes responsible for complex processes of dairy, meat production and disease resistance among other significant traits

eScholarship - University of California

Sequencing and assembly of the Egyptian buffalo genome.

Author: Amr Ageez
Amr Ibrahim
Dina A El-Khishin
Laila R Hassan
Mohamed E Saad
Mohamed I Abouelhoda
Moustafa Shokrof
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2020
Field of study

Directory of Open Access Journals

eScholarship - University of California

Exploiting in-memory systems for gnomic data analysis.

Author: Abouelhoda Mohamed
Aljafar Hussain
Alnakhli Yasser
Anjum Ashiq
El-Kalioby Mohamed
Faquih Tariq
Shah Zeeshan Ali
Shokrof Moustafa
Subhani Shazia
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

With the increasing adoption of next generation sequencing technology in the medical practice, there is an increasing demand for faster data processing to gain immediate insights from the patient’s genome. Due to the extensive amount of genomic information and its big data nature, data processing takes long time and delays are often experienced. In this paper, we show how to exploit in-memory platforms for big genomic data analysis, with focus on the variant analysis workflow. We will determine where different in-memory techniques are used in the workflow and explore different memory-based strategies to speed up the analysis. Our experiments show promising results and encourage further research in this area, especially with the rapid advancement in memory and SSD technologies.N/

Crossref

UDORA - University of Derby Online Research Archive