43 research outputs found
VOE: automated analysis of variant epitopes of SARS-CoV-2 for the development of diagnostic tests or vaccines for COVID-19
Background The development of serodiagnostic tests and vaccines for COVID-19 depends on the identification of epitopes from the SARS-CoV-2 genome. An epitope is the specific part of an antigen that is recognized by the immune system and can elicit an immune response. However, when the genetic variants contained in epitopes are used to develop rapid antigen tests (Ag-RDTs) and DNA or RNA vaccines, test sensitivity and vaccine efficacy can be low. Methods Here, we developed a âvariant on epitope (VOE)â software, a new Python script for identifying variants located on an epitope. Variant analysis and sensitivity calculation for seven recommended epitopes were processed by VOE. Variants in 1,011 Omicron SRA reads from two variant databases (BCFtools and SARS-CoV-2-Freebayes) were processed by VOE. Results A variant with HIGH or MODERATE impact was found on all epitopes from both variant databases except the epitopes KLNDLCFTNV, RVQPTES, LKPFERD, and ITLCFTLKRK on the S gene and ORF7a gene. All epitope variants from the BCFtools and SARS-CoV-2 Freebayes variant databases showed about 100% sensitivity except epitopes APGQTGK and DSKVGGNYN on the S gene, which showed respective sensitivities of 28.4866% and 6.8249%, and 87.7349% and 71.1177%. Conclusions Therefore, the epitopes KLNDLCFTNV, RVQPTES, LKPFERD, and ITLCFTLKRK may be useful for the development of an epitope-based peptide vaccine and GGDGKMKD on the N gene may be useful for the development of serodiagnostic tests. Moreover, VOE can also be used to analyze other epitopes, and a new variant database for VOE may be further established when a new variant of SARS-CoV-2 emerges
Parallelization of logic regression analysis on SNP-SNP interactions of a Crohnâs disease dataset model
SNP-SNP interactions have been recognized to be basically important for understanding genetic causes of complex disease traits. Logic regression is an effective methods for identifying SNP-SNP interactions associated with risk of complex disease. However, identifying SNP-SNP interactions are computationally challenging and may take hours, weeks and months to complete. Although parallel computing is a powerful method to accelerate computing time, it is arduous for users to apply this method to logic regression analyses of SNP-SNP interactions because it requires advanced programming skills to correctly partition and distribute data, control and monitor tasks across multi-core CPUs or several computers, and merge output files. In this paper, we present a novel R-library called SNPInt to automatically speed up analyses of SNP-SNP interactions of genome-wide association (GWA) studies using parallel computing without the advanced programming skills. The Crohnâs disease GWA studies dataset from the Wellcome Trust Case Control Consortium (WTCCC) that includes 4,680 individuals with 500,000 SNPsâ genotypes was analyzed using logic regression on a computer cluster to evaluate SNPInt performance. The results from SNPInt with any number of CPUs are the same as the results from non-parallel approach, and SNPInt library quite accelerated the logic regression analysis. For instance, with two hundred genes and twenty permutation rounds, the computing time was continuously decreased from 7.3 days to only 0.9 day when SNPInt applied eight CPUs. Executing analyses of SNP-SNP interactions using the SNPInt library is an effective way to boost performance, and simplify the parallelization of analyses of SNP-SNP interactions
ParallABEL: an R library for generalized parallelization of genome-wide association studies
Background: Genome-Wide Association (GWA) analysis is a powerful method for identifying loci associated with complex traits and drug response. Parts of GWA analyses, especially those involving thousands of individuals and consuming hours to months, will benefit from parallel computation. It is arduous acquiring the necessary programming skills to correctly partition and distribute data, control and monitor tasks on clustered computers, and merge output files.Results: Most components of GWA analysis can be divided into four groups based on the types of input data and statistical outputs. The first group contains statistics computed for a particular Single Nucleotide Polymorphism (SNP), or trait, such as SNP characterization statistics or association test statistics. The input data of this group includes the SNPs/traits. The second group concerns statistics characterizing an individual in a study, for example, the summary statistics of genotype quality for each sample. The input data of this group includes individuals. The third group consists of pair-wise statistics derived from analyses between each pair of individuals in the study, for example genome-wide identity-by-state or genomic kinship analyses. The input data of this group includes pairs of SNPs/traits. The final group concerns pair-wise statistics derived for pairs of SNPs, such as the linkage disequilibrium characterisation. The input data of this group includes pairs of individuals. We developed the ParallABEL library, which utilizes the Rmpi library, to parallelize these four types of computations. ParallABEL library is not only aimed at GenABEL, but may also be employed to parallelize various GWA packages in R. The data set from the North American Rheumatoid Arthritis Consortium (NARAC) includes 2,062 individuals with 545,080, SNPs' genotyping, was used to measure ParallABEL performance. Almost perfect speed-up was achieved for many types of analyses. For example, the computing time for the identity-by-state matrix was linearly reduced from approximately eight hours to one hour when ParallABEL employed eight processors.Conclusions: Executing genome-wide association analysis using the ParallABEL library on a computer cluster is an effective way to boost performance, and simplify the parallelization of GWA studies. ParallABEL is a user-friendly parallelization of GenABEL
āļāļēāļĢāđāļāļīāđāļĄāļāļ§āļēāļĄāđāļĢāđāļ§āđāļāļāļēāļĢāļ§āļīāđāļāļĢāļēāļ°āļŦāđāļāļ§āļēāļĄāļŠāļąāļĄāļāļąāļāļāđāļāļąāđāļ§āļāļąāđāļāļāļĩāđāļāļĄāđāļāļĒāļāļēāļĢ āļāļĢāļ°āļĒāļļāļāļāđāđāļāđāļāļēāļĢāļāļĢāļ°āļĄāļ§āļĨāļāļĨāđāļāļāļāļāļēāļ
Thesis (Ph.D., Molecular Biology and Bioinformatics)--Prince of Songkla University, 201
Ovarian Transcriptome Analysis of Vitellogenic and Non-Vitellogenic Female Banana Shrimp (Fenneropenaeus merguiensis).
The banana shrimp (Fenneropenaeus merguiensis) is one of the most commercially important penaeid species in the world. Its numbers are declining in the wild, leading to a loss of broodstock for farmers of the shrimp and a need for more successful breeding programs. However, the molecular mechanism of the genes involved in this shrimp's ovarian maturation is still unclear. Consequently, we compared transcriptomic profiles of ovarian tissue from females in both the vitellogenic stage and the non-vitellogenic stage. Using RNA-Seq technology to prepare the transcriptome libraries, a total of 12,187,412 and 11,694,326 sequencing reads were acquired from the non-vitellogenic and vitellogenic stages respectively. The analysis of the differentially expressed genes identified 1,025 which were significantly differentially expressed between the two stages, of which 694 were up-regulated and 331 down-regulated. Four genes putatively involved in the ovarian maturation pathway were chosen for validation by quantitative real-time PCR (RT-qPCR). The data from this study provided information about gene expression in ovarian tissue of the banana shrimp which could be useful for a better understanding of the regulation of this species' reproductive cycle
DisVar: an R library for identifying variants associated with diseases using large-scale personal genetic information
Background Genetic variants may potentially play a contributing factor in the development of diseases. Several genetic disease databases are used in medical research and diagnosis but the web applications used to search these databases for disease-associated variants have limitations. The application may not be able to search for large-scale genetic variants, the results of searches may be difficult to interpret and variants mapped from the latest reference genome (GRCH38/hg38) may not be supported. Methods In this study, we developed a novel R library called âDisVarâ to identify disease-associated genetic variants in large-scale individual genomic data. This R library is compatible with variants from the latest reference genome version. DisVar uses five databases of disease-associated variants. Over 100 million variants can be simultaneously searched for specific associated diseases. Results The package was evaluated using 24 Variant Call Format (VCF) files (215,054 to 11,346,899 sites) from the 1000 Genomes Project. Disease-associated variants were detected in 298,227 hits across all the VCF files, taking a total of 63.58 m to complete. The package was also tested on ClinVarâs VCF file (2,120,558 variants), where 20,657 hits associated with diseases were identified with an estimated elapsed time of 45.98 s. Conclusions DisVar can overcome the limitations of existing tools and is a fast and effective diagnostic and preventive tool that identifies disease-associated variations from large-scale genetic variants against the latest reference genome
Integrated Automatic Workflow for Phylogenetic Tree Analysis Using Public Access and Local Web Services
At the present, coding sequence (CDS) has been discovered and larger CDS is being revealed frequently. Approaches and related tools have also been developed and upgraded concurrently, especially for phylogenetic tree analysis. This paper proposes an integrated automatic Taverna workflow for the phylogenetic tree inferring analysis using public access web services at European Bioinformatics Institute (EMBL-EBI) and Swiss Institute of Bioinformatics (SIB), and our own deployed local web services. The workflow input is a set of CDS in the Fasta format. The workflow supports 1,000 to 20,000 numbers in bootstrapping replication. The workflow performs the tree inferring such as Parsimony (PARS), Distance Matrix - Neighbor Joining (DIST-NJ), and Maximum Likelihood (ML) algorithms of EMBOSS PHYLIPNEW package based on our proposed Multiple Sequence Alignment (MSA) similarity score. The local web services are implemented and deployed into two types using the Soaplab2 and Apache Axis2 deployment. There are SOAP and Java Web Service (JWS) providing WSDL endpoints to Taverna Workbench, a workflow manager. The workflow has been validated, the performance has been measured, and its results have been verified. Our workflowâs execution time is less than ten minutes for inferring a tree with 10,000 replicates of the bootstrapping numbers. This paper proposes a new integrated automatic workflow which will be beneficial to the bioinformaticians with an intermediate level of knowledge and experiences. The all local services have been deployed at our portal http://bioservices.sci.psu.ac.t
In silico analysis of protein toxin and bacteriocins from Lactobacillus paracasei SD1 genome and available online databases.
Lactobacillus paracasei SD1 is a potential probiotic strain due to its ability to survive several conditions in human dental cavities. To ascertain its safety for human use, we therefore performed a comprehensive bioinformatics analysis and characterization of the bacterial protein toxins produced by this strain. We report the complete genome of Lactobacillus paracasei SD1 and its comparison to other Lactobacillus genomes. Additionally, we identify and analyze its protein toxins and antimicrobial proteins using reliable online database resources and establish its phylogenetic relationship with other bacterial genomes. Our investigation suggests that this strain is safe for human use and contains several bacteriocins that confer health benefits to the host. An in silico analysis of protein-protein interactions between the target bacteriocins and the microbial proteins gtfB and luxS of Streptococcus mutans was performed and is discussed here