30 research outputs found

    Computational identification of transposable elements in the mouse genome

    Get PDF
    Repeat sequences cover about 39 percent of the mouse genome and completion of sequencing of the mouse genome [1] has enabled extensive research on the role of repeat sequences in mammalian genomics. This research covers the identification of Transposable elements (TEs) within the mouse transcriptome, based on available sequence information on mouse cDNAs (complementary DNAs) from GenBank [28]. The transcripts are screened for repeats using RepeatMasker [23], whose results are sieved to retain only Interspersed repeats (IRS). Using various bioinformatics software tools as well as tailor made programming, the research establishes: (i) the absolute location coordinates of the TEs on the transcript. (ii) The location of the IRs with respect to the 5’UTR, CDS and 3’UTR sequence features. (iii) The quality of alignment of the TE’s consensus sequence on the transcripts where they exist, (iv) the frequencies and distributions of the TEs on the cDNAs, (v) descriptions of the types and roles of transcripts containing TEs. This information has been collated and stored in a relational database (MTEDB) at http://warta.bio.psu.edu/htt_doc/M TEDB/homepage.htm)

    Effect of the Transposable Element Environment of Human Genes on Gene Length and Expression

    Get PDF
    Independent lines of investigation have documented effects of both transposable elements (TEs) and gene length (GL) on gene expression. However, TE gene fractions are highly correlated with GL, suggesting that they cannot be considered independently. We evaluated the TE environment of human genes and GL jointly in an attempt to tease apart their relative effects. TE gene fractions and GL were compared with the overall level of gene expression and the breadth of expression across tissues. GL is strongly correlated with overall expression level but weakly correlated with the breadth of expression, confirming the selection hypothesis that attributes the compactness of highly expressed genes to selection for economy of transcription. However, TE gene fractions overall, and for the L1 family in particular, show stronger anticorrelations with expression level than GL, indicating that GL may not be the most important target of selection for transcriptional economy. These results suggest a specific mechanism, removal of TEs, by which highly expressed genes are selectively tuned for efficiency. MIR elements are the only family of TEs with gene fractions that show a positive correlation with tissue-specific expression, suggesting that they may provide regulatory sequences that help to control human gene expression. Consistent with this notion, MIR fractions are relatively enriched close to transcription start sites and associated with coexpression in specific sets of related tissues. Our results confirm the overall relevance of the TE environment to gene expression and point to distinct mechanisms by which different TE families may contribute to gene regulation

    Meta-analysis of African ancestry genome-wide association studies identified novel locus and validates multiple loci associated with kidney function

    Get PDF
    Despite recent efforts to increase diversity in genome-wide association studies (GWASs), most loci currently associated with kidney function are still limited to European ancestry due to the underlying sample selection bias in available GWASs. We set out to identify susceptibility loci associated with estimated glomerular filtration rate (eGFRcrea) in 80027 individuals of African-ancestry from the UK Biobank (UKBB), Million Veteran Program (MVP), and Chronic Kidney Disease genetics (CKDGen) consortia. We identified 8 lead SNPs, 7 of which were previously associated with eGFR in other populations. We identified one novel variant, rs77408001 which is an intronic variant mapped to the ELN gene. We validated three previously reported loci at GATM-SPATA5L1, SLC15A5 and AGPAT3. Fine-mapping analysis identified variants rs77121243 and rs201602445 as having a 99.9% posterior probability of being causal. Our results warrant designing bigger studies within individuals of African ancestry to gain new insights into the pathogenesis of Chronic Kidney Disease (CKD), and identify genomic variants unique to this ancestry that may influence renal function and disease

    Molecular Dynamic Simulation Reveals Structure Differences in APOL1 Variants and Implication in Pathogenesis of Chronic Kidney Disease.

    Get PDF
    BACKGROUND: According to observational studies, two polymorphisms in the apolipoprotein L1 (APOL1) gene have been linked to an increased risk of chronic kidney disease (CKD) in Africans. One polymorphism involves the substitution of two amino-acid residues (S342G and I384M; known as G1), while the other involves the deletion of two amino-acid residues in a row (N388 and Y389; termed G2). Despite the strong link between APOL1 polymorphisms and kidney disease, the molecular mechanisms via which these APOL1 mutations influence the onset and progression of CKD remain unknown. METHODS: To predict the active site and allosteric site on the APOL1 protein, we used the Computed Atlas of Surface Topography of Proteins (CASTp) and the Protein Allosteric Sites Server (PASSer). Using an extended molecular dynamics simulation, we investigated the characteristic structural perturbations in the 3D structures of APOL1 variants. RESULTS: According to CASTp's active site characterization, the topmost predicted site had a surface area of 964.892 Å2 and a pocket volume of 900.792 Å3. For the top three allosteric pockets, the allostery probability was 52.44%, 46.30%, and 38.50%, respectively. The systems reached equilibrium in about 125 ns. From 0-100 ns, there was also significant structural instability. When compared to G1 and G2, the wildtype protein (G0) had overall high stability throughout the simulation. The root-mean-square fluctuation (RMSF) of wildtype and variant protein backbone Cα fluctuations revealed that the Cα of the variants had a large structural fluctuation when compared to the wildtype. CONCLUSION: Using a combination of different computational techniques, we identified binding sites within the APOL1 protein that could be an attractive site for potential inhibitors of APOL1. Furthermore, the G1 and G2 mutations reduced the structural stability of APOL1

    QuasiFlow: a Nextflow pipeline for analysis of NGS-based HIV-1 drug resistance data.

    Get PDF
    SUMMARY: Next-generation sequencing (NGS) enables reliable detection of resistance mutations in minority variants of human immunodeficiency virus type 1 (HIV-1). There is paucity of evidence for the association of minority resistance to treatment failure, and this requires evaluation. However, the tools for analyzing HIV-1 drug resistance (HIVDR) testing data are mostly web-based which requires uploading data to webservers. This is a challenge for laboratories with internet connectivity issues and instances with restricted data transfer across networks. We present QuasiFlow, a pipeline for reproducible analysis of NGS-based HIVDR testing data across different computing environments. Since QuasiFlow entirely depends on command-line tools and a local copy of the reference database, it eliminates challenges associated with uploading HIV-1 NGS data onto webservers. The pipeline takes raw sequence reads in FASTQ format as input and generates a user-friendly report in PDF/HTML format. The drug resistance scores obtained using QuasiFlow were 100% and 99.12% identical to those obtained using web-based HIVdb program and HyDRA web respectively at a mutation detection threshold of 20%. AVAILABILITY AND IMPLEMENTATION: QuasiFlow and corresponding documentation are publicly available at https://github.com/AlfredUg/QuasiFlow. The pipeline is implemented in Nextflow and requires regular updating of the Stanford HIV drug resistance interpretation algorithm. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics Advances online

    Transcriptional activity, chromosomal distribution and expression effects of transposable elements in Coffea genomes

    Get PDF
    Plant genomes are massively invaded by transposable elements (TEs), many of which are located near host genes and can thus impact gene expression. In flowering plants, TE expression can be activated (de-repressed) under certain stressful conditions, both biotic and abiotic, as well as by genome stress caused by hybridization. In this study, we examined the effects of these stress agents on TE expression in two diploid species of coffee, Coffea canephora and C. eugenioides, and their allotetraploid hybrid C. arabica. We also explored the relationship of TE repression mechanisms to host gene regulation via the effects of exonized TE sequences. Similar to what has been seen for other plants, overall TE expression levels are low in Coffea plant cultivars, consistent with the existence of effective TE repression mechanisms. TE expression patterns are highly dynamic across the species and conditions assayed here are unrelated to their classification at the level of TE class or family. In contrast to previous results, cell culture conditions per se do not lead to the de-repression of TE expression in C. arabica. Results obtained here indicate that differing plant drought stress levels relate strongly to TE repression mechanisms. TEs tend to be expressed at significantly higher levels in non-irrigated samples for the drought tolerant cultivars but in drought sensitive cultivars the opposite pattern was shown with irrigated samples showing significantly higher TE expression. Thus, TE genome repression mechanisms may be finely tuned to the ideal growth and/or regulatory conditions of the specific plant cultivars in which they are active. Analysis of TE expression levels in cell culture conditions underscored the importance of nonsense-mediated mRNA decay (NMD) pathways in the repression of Coffea TEs. These same NMD mechanisms can also regulate plant host gene expression via the repression of genes that bear exonized TE sequences. (Résumé d'auteur

    Phylogenomic analysis uncovers a 9-year variation of Uganda influenza type-A strains from the WHO-recommended vaccines and other Africa strains

    Get PDF
    Genetic characterisation of circulating influenza viruses directs annual vaccine strain selection and mitigation of infection spread. We used next-generation sequencing to locally generate whole genomes from 116 A(H1N1)pdm09 and 118 A(H3N2) positive patient swabs collected across Uganda between 2010 and 2018. We recovered sequences from 92% (215/234) of the swabs, 90% (193/215) of which were whole genomes. The newly-generated sequences were genetically and phylogenetically compared to the WHO-recommended vaccines and other Africa strains sampled since 1994. Uganda strain hemagglutinin (n = 206), neuraminidase (n = 207), and matrix protein (MP, n = 213) sequences had 95.23–99.65%, 95.31–99.79%, and 95.46–100% amino acid similarity to the 2010–2020 season vaccines, respectively, with several mutated hemagglutinin antigenic, receptor binding, and N-linked glycosylation sites. Uganda influenza type-A virus strains sequenced before 2016 clustered uniquely while later strains mixed with other Africa and global strains. We are the first to report novel A(H1N1)pdm09 subclades 6B.1A.3, 6B.1A.5(a,b), and 6B.1A.6 (± T120A) that circulated in Eastern, Western, and Southern Africa in 2017–2019. Africa forms part of the global influenza ecology with high viral genetic diversity, progressive antigenic drift, and local transmissions. For a continent with inadequate health resources and where social distancing is unsustainable, vaccination is the best option. Hence, African stakeholders should prioritise routine genome sequencing and analysis to direct vaccine selection and virus control

    Genome-wide association analysis of cystatin-C kidney function in continental Africa

    Get PDF
    BACKGROUND: Chronic kidney disease is becoming more prevalent in Africa, and its genetic determinants are poorly understood. Creatinine-based estimated glomerular filtration rate (eGFR) is commonly used to estimate kidney function, modelling the excretion of the endogenous biomarker (creatinine). However, eGFR based on creatinine has been shown to inadequately detect individuals with low kidney function in Sub-Saharan Africa, with eGFR based on cystatin-C (eGFRcys) exhibiting significantly superior performance. Therefore, we opted to conduct a GWAS for eGFRcys. METHODS: Using the Uganda Genomic Resource, we performed a genome-wide association study (GWAS) of eGFRcys in 5877 Ugandans and evaluated replication in independent studies. Subsequently, putative causal variants were screened through Bayesian fine-mapping. Functional annotation of the GWAS loci was performed using Functional Mapping and Annotation (FUMA). FINDINGS: Three independent lead single nucleotide polymorphisms (SNPs) (P-value 99%. The rs911119 SNP maps to the cystatin C gene and has been previously associated with eGFRcys among Europeans. With gene-set enrichment analyses of the olfactory receptor family 51 overlapping genes, we identified an association with the G-alpha-S signalling events. INTERPRETATION: Our study found two previously unreported associated SNPs for eGFRcys in continental Africans (rs59288815 and rs4277141) and validated a previously well-established SNP (rs911119) for eGFRcys. The identified gene-set enrichment for the G-protein signalling pathways relates to the capacity of the kidney to readily adapt to an ever-changing environment. Additional GWASs are required to represent the diverse regions in Africa. FUNDING: Wellcome (220740/Z/20/Z)

    Effects of repetitive DNA and epigenetics on human genome regulation

    Get PDF
    The highly developed and specialized anatomical and physiological characteristics observed for eukaryotes in general and mammals in particular are underwritten by an elaborate and intricate process of genome regulation. This precise control of the location, timing and amplitude of gene expression is achieved by a variety of genetic and epigenetic tools and mechanisms. While several of these regulatory mechanisms have been extensively studied, our understanding of the complex and diverse associations between various epigenetic marks and genetic elements with genome regulatory systems has remained incomplete. However, the recent profound improvements in sequencing technologies have significantly improved the depth and breadth to which their functions and relationships can be understood. The objective of this thesis has been to apply bioinformatics, computational and statistical tools to analyze and interpret various recent high throughput datasets from a combination of Next generation sequencing and Chromatin immune precipitation (ChIP-seq) experiments. These datasets have been analyzed to further our understanding of the dynamics of gene regulation in humans, particularly as it relates to repetitive DNA, cis-regulation and DNA methylation. The thesis thus resides at the intersection of three major areas; transposable elements, cis-regulatory elements and epigenetics. It explores how those three aspects of regulation relate with gene expression and the functional implications of those interactions. From this analysis, the thesis provides new insights into; 1) the relationship between the transposable element environment of human genes and their expression, 2) the role of mammalian-wide interspersed repeats (MIRs) in the function of human enhancers and enhancement of tissue-specic functions, 3) the existence and function of composite cis-regulatory elements and 4) the dynamics and relationship between human gene-body DNA methylation and gene expression.Ph.D

    Generalizability of machine learning in predicting antimicrobial resistance in E. coli: a multi-country case study in Africa

    No full text
    BackgroundAntimicrobial resistance (AMR) remains a significant global health threat particularly impacting low- and middle-income countries (LMICs). These regions often grapple with limited healthcare resources and access to advanced diagnostic tools. Consequently, there is a pressing need for innovative approaches that can enhance AMR surveillance and management. Machine learning (ML) though underutilized in these settings, presents a promising avenue. This study leverages ML models trained on whole-genome sequencing data from England, where such data is more readily available, to predict AMR in E. coli, targeting key antibiotics such as ciprofloxacin, ampicillin, and cefotaxime. A crucial part of our work involved the validation of these models using an independent dataset from Africa, specifically from Uganda, Nigeria, and Tanzania, to ascertain their applicability and effectiveness in LMICs.ResultsModel performance varied across antibiotics. The Support Vector Machine excelled in predicting ciprofloxacin resistance (87% accuracy, F1 Score: 0.57), Light Gradient Boosting Machine for cefotaxime (92% accuracy, F1 Score: 0.42), and Gradient Boosting for ampicillin (58% accuracy, F1 Score: 0.66). In validation with data from Africa, Logistic Regression showed high accuracy for ampicillin (94%, F1 Score: 0.97), while Random Forest and Light Gradient Boosting Machine were effective for ciprofloxacin (50% accuracy, F1 Score: 0.56) and cefotaxime (45% accuracy, F1 Score:0.54), respectively. Key mutations associated with AMR were identified for these antibiotics.ConclusionAs the threat of AMR continues to rise, the successful application of these models, particularly on genomic datasets from LMICs, signals a promising avenue for improving AMR prediction to support large AMR surveillance programs. This work thus not only expands our current understanding of the genetic underpinnings of AMR but also provides a robust methodological framework that can guide future research and applications in the fight against AMR
    corecore