25 research outputs found
The First Kazakh Whole Genomes: The First Report of NGS Data
Introduction: The human genome sequence will underpin human biology and medicine in the next century, providing a single, essential reference to all genetic information. Extraordinary technological advances and decreases in the cost of DNA sequencing have made the possibility of whole genome sequencing (WGS) feasible as a highly accessible test for numerous indications. The international project “Genetic architecture of Kazakh population” is well underway to determine the complete DNA. Next generation sequencing is a powerful tool for genetic analysis, which will enable us to uncover the association of loci at specific sites in the genome associated with disease. The aim of this study was to introduce first data on WGS of 6 Kazakh individuals.Methods: This pilot study is among the first WGS performed on 6 healthy Kazakh individuals, using next generation sequencing platform HiSeq2000, Illumina by manufacturer’s protocols. All generated *.bcl files were simultaneously converted and demultiplexed using bcl2fasta application. Alignment of sequence reads performed using bwa-mem against human b19 reference genome. Sorting, removing of intermediate files, *.bam files assembling, and marking duplicates were performed using PicardTools package. GATK haplotype caller tool was used for variant calling. ClinVar, SNPedia, and Cosmic databases were processed to identify clinical genomic variants in 6 Kazakh whole genomes. Java Runtime Environment and R. Bioconductor packages were installed to perform raw data processing and run program scripts.Results: The sequence alignment and mapping procedures on reference genome hg19 of each 6 healthy Kazakh individual were completed. Between 87,308,581,400 and 107,526,741,301 total base pairs were sequenced with average coverage x29.85. Between 98.85% and 99.58% base pairs were totally mapped and on average 96.07% were properly paired. Het/Hom and Ti/Tv ratios for each whole genome ranged from 1.35 to 1.52 and from 2.07 to 2.08, respectively. We compared and analyzed each genome with on existing clinical databases ClinVar, SNPedia, Cosmic and found from 20 to 25, from 269 to 288, from 7 to 12 SNP records, respectively. The availability of a reference Kazakh genome sequences provides the basis for studying the nature of sequence variation, particularly single nucleotide polymorphisms.Conclusion: The first whole genome sequencing of Kazakhs were performed. In this pilot study, we identified SNPs associated with different conditions. Further studies of WGS on Kazakh population are needed to identify possible unique genetic variants in Kazakhs
Genetic Diversity of IF?, IL1?, TLR2, and TLR8 Loci in Pulmonary Tuberculosis in Kazakhstan
Introduction. Tuberculosis (TB) is caused by bacterium Mycobacterium tuberculosis (MTB), and according to the WHO, up to 30% of world population is infected with latent TB. Pathogenesis of TB is multifactorial, and its development depends on environmental, social, microbial, and genetic factors of both the bacterium and the host. The number of TB cases in Kazakhstan has decreased in the past decade, but multidrug-resistant (MDR) TB cases are dramatically increasing. Polymorphisms in genes responsible for immune response have been associated with TB susceptibility. The objective of this study was to investigate the risk of developing pulmonary TB (PTB) associated with polymorphisms in several inflammatory pathway genes among Kazakhstani population.Methods. 703 participants from 3 regions of Kazakhstan were recruited for a case-control study. 251 participants had pulmonary TB (PTB), and 452 were healthy controls (HC). Males and females represented 42.39% and 57.61%, respectively. Of all participants, 67.4% were Kazakhs, 22.8% Russians, 3.4% Ukrainians, and 6.4% were of other origins. Clinical and epidemiological data were collected from medical records, interviews, and questionnaires. DNA samples were genotyped using TaqMan assay on 4 polymorphisms: IFN? (rs2430561) and IL1? (rs16944), TLR2 (rs5743708) and TLR8 (rs3764880). Statistical data was analyzed using SPSS 19.Results. Genotyping by IF?, IL1?, TLR2 showed no significant association with PTB susceptibility (p > 0.05). TLR8 genotype A/G was significantly higher in females (F/M – 41.5%/1.3%) and G/G in males (M/F – 49%/20.7%) (?2=161.43, p < 0.001). A significantly increased risk of PTB development was observed for TLR A/G with an adjusted OR of 1.48 (95%, CI: 0.96 - 2.28), and a protective feature was revealed for TLR8 G/G genotype (OR: 0.81, 95%, CI: 0.56 - 1.16, p = 0.024). Additional grouping by gender revealed that TLR8 G/G contributes as protective genotype (OR: 1.83, 95%, CI: 1.18 - 2.83, p = 0.036) in males of the control group.Conclusion. Results indicate that heterozygous genotype A/G of TLR8 increases the risk of PTB development, while G/G genotype may serve as protection mechanism. A/A genotype is strongly associated with susceptibility to PTB. To clarify the role of other polymorphisms in susceptibility to PTB in Kazakhstani population, further investigations are needed.
Meta-Analysis of Esophageal Cancer Transcriptomes Using Independent Component Analysis
Independent Component Analysis is a matrix factorization method for data dimension reduction. ICA has been widely applied for the analysis of transcriptomic data for blind separation of biological, environmental, and technical factors affecting gene expression. The study aimed to analyze the publicly available esophageal cancer data using the ICA for identification and comprehensive analysis of reproducible signaling pathways and molecular signatures involved in this cancer type. In this study, four independent esophageal cancer transcriptomic datasets from GEO databases were used. A bioinformatics tool « BiODICA—Independent Component Analysis of Big Omics Data» was applied to compute independent components (ICs). Gene Set Enrichment Analysis (GSEA) and ToppGene uncovered the most significantly enriched pathways. Construction and visualization of gene networks and graphs were performed using the Cytoscape, and HPRD database. The correlation graph between decompositions into 30 ICs was built with absolute correlation values exceeding 0.3. Clusters of components—pseudocliques were observed in the structure of the correlation graph. The top 1,000 most contributing genes of each ICs in the pseudocliques were mapped to the PPI network to construct associated signaling pathways. Some cliques were composed of densely interconnected nodes and included components common to most cancer types (such as cell cycle and extracellular matrix signals), while others were specific to EC. The results of this investigation may reveal potential biomarkers of esophageal carcinogenesis, functional subsystems dysregulated in the tumor cells, and be helpful in predicting the early development of a tumor
Draft genome sequences of two clinical Isolates of mycobacterium tuberculosis from sputum of Kazakh patients
Here, we report the draft genome sequences of two clinical isolates of Mycobacterium tuberculosis (MTB-476 and MTB-489) isolated
from sputum of Kazakh patients
Draft genome sequence of Lactobacillus rhamnosus CLS17
The human gut microbiome is an organ that provides primary
barrier protection against foreign agents. Most of the microorganisms
are different strains of commensal bacteria that are colonized in the gut. Gut flora influence food metabolism and have an antagonistic effect on different pathogens and immunomodulatory properties (1). One of the main species of gut flora is
in the genus Lactobacillus...This work was supported by grant 0113PK00783 from the Ministry of
Education and Science of the Republic of Kazakhstan
Meta-Analysis of Esophageal Cancer Transcriptomes Using Independent Component Analysis
International audienceIndependent Component Analysis is a matrix factorization method for data dimension reduction. ICA has been widely applied for the analysis of transcriptomic data for blind separation of biological, environmental, and technical factors affecting gene expression. The study aimed to analyze the publicly available esophageal cancer data using the ICA for identification and comprehensive analysis of reproducible signaling pathways and molecular signatures involved in this cancer type. In this study, four independent esophageal cancer transcriptomic datasets from GEO databases were used. A bioinformatics tool « BiODICA—Independent Component Analysis of Big Omics Data» was applied to compute independent components (ICs). Gene Set Enrichment Analysis (GSEA) and ToppGene uncovered the most significantly enriched pathways. Construction and visualization of gene networks and graphs were performed using the Cytoscape, and HPRD database. The correlation graph between decompositions into 30 ICs was built with absolute correlation values exceeding 0.3. Clusters of components—pseudocliques were observed in the structure of the correlation graph. The top 1,000 most contributing genes of each ICs in the pseudocliques were mapped to the PPI network to construct associated signaling pathways. Some cliques were composed of densely interconnected nodes and included components common to most cancer types (such as cell cycle and extracellular matrix signals), while others were specific to EC. The results of this investigation may reveal potential biomarkers of esophageal carcinogenesis, functional subsystems dysregulated in the tumor cells, and be helpful in predicting the early development of a tumor
Determining the optimal number of independent components for reproducible transcriptomic data analysis
International audienceBACKGROUND: Independent Component Analysis (ICA) is a method that models gene expression data as an action of a set of statistically independent hidden factors. The output of ICA depends on a fundamental parameter: the number of components (factors) to compute. The optimal choice of this parameter, related to determining the effective data dimension, remains an open question in the application of blind source separation techniques to transcriptomic data.RESULTS: Here we address the question of optimizing the number of statistically independent components in the analysis of transcriptomic data for reproducibility of the components in multiple runs of ICA (within the same or within varying effective dimensions) and in multiple independent datasets. To this end, we introduce ranking of independent components based on their stability in multiple ICA computation runs and define a distinguished number of components (Most Stable Transcriptome Dimension, MSTD) corresponding to the point of the qualitative change of the stability profile. Based on a large body of data, we demonstrate that a sufficient number of dimensions is required for biological interpretability of the ICA decomposition and that the most stable components with ranks below MSTD have more chances to be reproduced in independent studies compared to the less stable ones. At the same time, we show that a transcriptomics dataset can be reduced to a relatively high number of dimensions without losing the interpretability of ICA, even though higher dimensions give rise to components driven by small gene sets.CONCLUSIONS: We suggest a protocol of ICA application to transcriptomics data with a possibility of prioritizing components with respect to their reproducibility that strengthens the biological interpretation. Computing too few components (much less than MSTD) is not optimal for interpretability of the results. The components ranked within MSTD range have more chances to be reproduced in independent studies
Induction of Apoptosis in U937 Cells by Using a Combination of Bortezomib and Low-Intensity Ultrasound
Background: We scrutinized the feasibility of apoptosis induction in blood cancer cells by means of low-intensity ultrasoundand the proteasome inhibitor bortezomib (Velcade).
Material/Methods: Human leukemic monocyte lymphoma U937 cells were subjected to ultrasound in the presence of bortezomib and the echo contrast agent Sonazoid. Two types of acoustic intensity (0.18 W/cm2 and 0.05 W/cm2) were used for the experiments. Treated U937 cells were analyzed for viability and levels of early and late apoptosis.
In addition, scanning electron microscopy analysis of treated cells was performed.
Results: The percentage of cells that underwent early apoptosis in the group treated with ultrasound and Sonazoid was 8.0±1.31% (intensity 0.18 W/cm2) and 7.0±1.69% (0.05 W/cm2). However, coupling of bortezomib and Sonazoid resulted in an increase in the percentage of cells in the early apoptosis phase, up to 32.50±3.59% (intensity 0.18 W/cm2) and 33.0±4.90% (0.05 W/cm2). The percentage of U937 cells in the late apoptosis stage was not significantly different from that in the group treated with bortezomib only. Conclusions: Our findings indicate the feasibility of apoptosis induction in blood cancer cells by using a combination of bortezomib, ultrasound contrast agents, and low-intensity ultrasound
A USER-FRIENDLY TOOL FOR SIMPLIFIED GENOMICS DATA MINING FROM LARGE VCF FILES
Introduction: High-throughput sequencing platforms generate a massive amount of high-dimensional
genomic datasets that are available for analysis. Modern and user-friendly bioinformatics tools for analysis
and interpretation of genomics data becomes essential during the analysis of sequencing data. Variant
Call Format (VCF) is a standard format containing genomic information and variants of sequenced
samples. Existing tools for processing VCF files don’t usually have an intuitive graphical interface, but
instead have just a command-line interface that may be challenging to use for the broader biomedical
community interested in genomics data analysis. We present re-Searcher, a new bioinformatics application
with a user-friendly GUI developed to simplify genomic data mining from VCF files.
Methods: re-Searcher application was written in a Python 3. Pandas library solves the problem of analyzing
large VCF files by not loading the whole file directly into RAM, but instead pre-processing it in
chunks. Simple and intuitive GUI was built using Tkinter library.
Results: The generalized workflow of the re-Searcher consists of several steps: selecting an input file,
setting up necessary filtering parameters, data processing, and exporting a filtered output VCF file.
re-Searcher browses and opens VCF files with extensions .txt or .vcf, before performing the following
filtering and extraction options: header extraction, keyword search, sample extraction, and genotype
format conversion.
Conclusion: Exploring and analyzing VCF files generated after the bioinformatics processing of
sequencing data is one of the important steps performed by researchers during analysis and metaanalysis
of genotype/phenotype associations. We have developed and introduced an easy-to-use
bioinformatics tool, re-Searcher, with several unique features for mining big VCF files and realized with
a simple graphical user interface that makes it easily available for clinicians and researchers without
any computational skills. The software publicly available on the GitHub repository (https://github.com/
LabBandSB/re-Searcher
The First Kazakh Whole Genomes: The First Report of NGS Data
Introduction: The human genome sequence will underpin human biology and medicine in the next century, providing a single, essential reference to all genetic information. Extraordinary technological advances and decreases in the cost of DNA sequencing have made the possibility of whole genome sequencing (WGS) feasible as a highly accessible test for numerous indications. The international project “Genetic architecture of Kazakh population” is well underway to determine the complete DNA. Next generation sequencing is a powerful tool for genetic analysis, which will enable us to uncover the association of loci at specific sites in the genome associated with disease. The aim of this study was to introduce first data on WGS of 6 Kazakh individuals.
Methods: This pilot study is among the first WGS performed on 6 healthy Kazakh individuals, using next generation sequencing platform HiSeq2000, Illumina by manufacturer’s protocols. All generated *.bcl files were simultaneously converted and demultiplexed using bcl2fasta application. Alignment of sequence reads performed using bwa-mem against human b19 reference genome. Sorting, removing of intermediate files, *.bam files assembling, and marking duplicates were performed using PicardTools package. GATK haplotype caller tool was used for variant calling. ClinVar, SNPedia, and Cosmic databases were processed to identify clinical genomic variants in 6 Kazakh whole genomes. Java Runtime Environment and R. Bioconductor packages were installed to perform raw data processing and run program scripts.
Results: The sequence alignment and mapping procedures on reference genome hg19 of each 6 healthy Kazakh individual were completed. Between 87,308,581,400 and 107,526,741,301 total base pairs were sequenced with average coverage x29.85. Between 98.85% and 99.58% base pairs were totally mapped and on average 96.07% were properly paired. Het/Hom and Ti/Tv ratios for each whole genome ranged from 1.35 to 1.52 and from 2.07 to 2.08, respectively. We compared and analyzed each genome with on existing clinical databases ClinVar, SNPedia, Cosmic and found from 20 to 25, from 269 to 288, from 7 to 12 SNP records, respectively. The availability of a reference Kazakh genome sequences provides the basis for studying the nature of sequence variation, particularly single nucleotide polymorphisms.
Conclusion: The first whole genome sequencing of Kazakhs were performed. In this pilot study, we identified SNPs associated with different conditions. Further studies of WGS on Kazakh population are needed to identify possible unique genetic variants in Kazakhs