27 research outputs found

    Management of biological sequences using suffix trees

    Get PDF
    The amount of available biological sequences, represented as strings over the DNA and protein alphabets, grows at phenomenal rate. Supporting various search tasks over such data efficiently requires development of sophisticated indexing techniques. Recently, suffix tree (ST) and suffix array (SA) received considerable attention as suitable data structures in this context. However, existing solutions often focus on either efficiency or scalability, but not both. Further, some of the solutions require advanced computational resources or are tailored towards a specific application. We investigate, both theoretically and experimentally, ways to improve efficiency and scalability in management of biological sequence data. Our goal is to develop an indexing technique that is reasonable in construction time and space utilization, and supports efficiently versatile search applications in biological sequences of various sizes, running on a typical desktop computer. The contributions of this research include development of a ST based indexing technique, called HST, together with exact and approximate search algorithms that use the index. The results of our experiments indicate that the index construction cost is comparable to other ST based techniques, such as TDD and Trellis, in terms of construction time and main memory requirement. While HST exhibits slower construction time than Vmatch, the best known SA based solution, with the same amount of main memory HST can handle sequences that are an order of magnitude longer. In terms of the index size, HST is comparable to TDD and Vmatch, which is half of the Trellis index size. We also develop efficient and scalable search applications using HST, including exact match, k-mismatch, and structured motif search. Our experiments using real-life sequences indicated that for short sequences (e.g., human chromosomes), our exact match search is comparable to Vmatch, about 3 times faster than TDD, and more than 10 times faster than Trellis. Further, HST can be used to search directly in longer DNA sequences, as opposed to partitioning such a sequence and search in the parts - the only option to follow with Vmatch. We found that a direct exact match search using HST is twice faster when searching in the entire human genome, compared to using Vmatch on parts. Compared to Trellis, which can handle direct search in human genome, HST was more than 20 times faster. To further compare performance of HST and Vmatch, we considered k-mismatch search. Our results indicated significant improvement of the HST based solution over Vmatch, ranging from 2 to 9 times faster k-mismatch search on average, for short and long sequences, respectively. For structured motif search, HST was about 6 times faster than SMOTIF1, the best known structured motif search tool

    Defining bacterial species in the genomic era : insights from the genus Acinetobacter

    Get PDF
    Background: Microbial taxonomy remains a conservative discipline, relying on phenotypic information derived from growth in pure culture and techniques that are time-consuming and difficult to standardize, particularly when compared to the ease of modern high-throughput genome sequencing. Here, drawing on the genus Acinetobacter as a test case, we examine whether bacterial taxonomy could abandon phenotypic approaches and DNA-DNA hybridization and, instead, rely exclusively on analyses of genome sequence data. Results: In pursuit of this goal, we generated a set of thirteen new draft genome sequences, representing ten species, combined them with other publically available genome sequences and analyzed these 38 strains belonging to the genus. We found that analyses based on 16S rRNA gene sequences were not capable of delineating accepted species. However, a core genome phylogenetic tree proved consistent with the currently accepted taxonomy of the genus, while also identifying three misclassifications of strains in collections or databases. Among rapid distance-based methods, we found average-nucleotide identity (ANI) analyses delivered results consistent with traditional and phylogenetic classifications, whereas gene content based approaches appear to be too strongly influenced by the effects of horizontal gene transfer to agree with previously accepted species. Conclusion: We believe a combination of core genome phylogenetic analysis and ANI provides an appropriate method for bacterial species delineation, whereby bacterial species are defined as monophyletic groups of isolates with genomes that exhibit at least 95% pair-wise ANI. The proposed method is backwards compatible; it provides a scalable and uniform approach that works for both culturable and non-culturable species; is faster and cheaper than traditional taxonomic methods; is easily replicable and transferable among research institutions; and lastly, falls in line with Darwin’s vision of classification becoming, as far as is possible, genealogical

    Calculating Orthologs in Bacteria and Archaea: A Divide and Conquer Approach

    Get PDF
    Among proteins, orthologs are defined as those that are derived by vertical descent from a single progenitor in the last common ancestor of their host organisms. Our goal is to compute a complete set of protein orthologs derived from all currently available complete bacterial and archaeal genomes. Traditional approaches typically rely on all-against-all BLAST searching which is prohibitively expensive in terms of hardware requirements or computational time (requiring an estimated 18 months or more on a typical server). Here, we present xBASE-Orth, a system for ongoing ortholog annotation, which applies a “divide and conquer” approach and adopts a pragmatic scheme that trades accuracy for speed. Starting at species level, xBASE-Orth carefully constructs and uses pan-genomes as proxies for the full collections of coding sequences at each level as it progressively climbs the taxonomic tree using the previously computed data. This leads to a significant decrease in the number of alignments that need to be performed, which translates into faster computation, making ortholog computation possible on a global scale. Using xBASE-Orth, we analyzed an NCBI collection of 1,288 bacterial and 94 archaeal complete genomes with more than 4 million coding sequences in 5 weeks and predicted more than 700 million ortholog pairs, clustered in 175,531 orthologous groups. We have also identified sets of highly conserved bacterial and archaeal orthologs and in so doing have highlighted anomalies in genome annotation and in the proposed composition of the minimal bacterial genome. In summary, our approach allows for scalable and efficient computation of the bacterial and archaeal ortholog annotations. In addition, due to its hierarchical nature, it is suitable for incorporating novel complete genomes and alternative genome annotations. The computed ortholog data and a continuously evolving set of applications based on it are integrated in the xBASE database, available at http://www.xbase.ac.uk/

    Increased ultra-rare variant load in an isolated Scottish population impacts exonic and regulatory regions

    Get PDF
    Human population isolates provide a snapshot of the impact of historical demographic processes on population genetics. Such data facilitate studies of the functional impact of rare sequence variants on biomedical phenotypes, as strong genetic drift can result in higher frequencies of variants that are otherwise rare. We present the first whole genome sequencing (WGS) study of the VIKING cohort, a representative collection of samples from the isolated Shetland population in northern Scotland, and explore how its genetic characteristics compare to a mainland Scottish population. Our analyses reveal the strong contributions played by the founder effect and genetic drift in shaping genomic variation in the VIKING cohort. About one tenth of all high-quality variants discovered are unique to the VIKING cohort or are seen at frequencies at least ten fold higher than in more cosmopolitan control populations. Multiple lines of evidence also suggest relaxation of purifying selection during the evolutionary history of the Shetland isolate. We demonstrate enrichment of ultra-rare VIKING variants in exonic regions and for the first time we also show that ultra-rare variants are enriched within regulatory regions, particularly promoters, suggesting that gene expression patterns may diverge relatively rapidly in human isolates

    Flexible and scalable diagnostic filtering of genomic variants using G2P with Ensembl VEP.

    Get PDF
    We aimed to develop an efficient, flexible and scalable approach to diagnostic genome-wide sequence analysis of genetically heterogeneous clinical presentations. Here we present G2P ( www.ebi.ac.uk/gene2phenotype ) as an online system to establish, curate and distribute datasets for diagnostic variant filtering via association of allelic requirement and mutational consequence at a defined locus with phenotypic terms, confidence level and evidence links. An extension to Ensembl Variant Effect Predictor (VEP), VEP-G2P was used to filter both disease-associated and control whole exome sequence (WES) with Developmental Disorders G2P (G2PDD; 2044 entries). VEP-G2PDD shows a sensitivity/precision of 97.3%/33% for de novo and 81.6%/22.7% for inherited pathogenic genotypes respectively. Many of the missing genotypes are likely false-positive pathogenic assignments. The expected number and discriminative features of background genotypes are defined using control WES. Using only human genetic data VEP-G2P performs well compared to other freely-available diagnostic systems and future phenotypic matching capabilities should further enhance performance

    Identification and functional modelling of plausibly causative cis-regulatory variants in a highly-selected cohort with X-linked intellectual disability.

    Get PDF
    Identifying causative variants in cis-regulatory elements (CRE) in neurodevelopmental disorders has proven challenging. We have used in vivo functional analyses to categorize rigorously filtered CRE variants in a clinical cohort that is plausibly enriched for causative CRE mutations: 48 unrelated males with a family history consistent with X-linked intellectual disability (XLID) in whom no detectable cause could be identified in the coding regions of the X chromosome (chrX). Targeted sequencing of all chrX CRE identified six rare variants in five affected individuals that altered conserved bases in CRE targeting known XLID genes and segregated appropriately in families. Two of these variants, FMR1CRE and TENM1CRE, showed consistent site- and stage-specific differences of enhancer function in the developing zebrafish brain using dual-color fluorescent reporter assay. Mouse models were created for both variants. In male mice Fmr1CRE induced alterations in neurodevelopmental Fmr1 expression, olfactory behavior and neurophysiological indicators of FMRP function. The absence of another likely causative variant on whole genome sequencing further supported FMR1CRE as the likely basis of the XLID in this family. Tenm1CRE mice showed no phenotypic anomalies. Following the release of gnomAD 2.1, reanalysis showed that TENM1CRE exceeded the maximum plausible population frequency of a XLID causative allele. Assigning causative status to any ultra-rare CRE variant remains problematic and requires disease-relevant in vivo functional data from multiple sources. The sequential and bespoke nature of such analyses renders them time-consuming and challenging to scale for routine clinical use

    A recurrent de novo mutation in ACTG1 causes isolated ocular coloboma

    Get PDF
    Ocular coloboma (OC) is a defect in optic fissure closure and is a common cause of severe congenital visual impairment. Bilateral OC is primarily genetically determined and shows marked locus heterogeneity. Whole-exome sequencing (WES) was used to analyze 12 trios (child affected with OC and both unaffected parents). This identified de novo mutations in 10 different genes in eight probands. Three of these genes encoded proteins associated with actin cytoskeleton dynamics: ACTG1, TWF1, and LCP1. Proband-only WES identified a second unrelated individual with isolated OC carrying the same ACTG1 allele, encoding p.(Pro70Leu). Both individuals have normal neurodevelopment with no extra-ocular signs of Baraitser–Winter syndrome. We found this mutant protein to be incapable of incorporation into F-actin. The LCP1 and TWF1 variants each resulted in only minor disturbance of actin interactions, and no further plausibly causative variants were identified in these genes on resequencing 380 unrelated individuals with OC

    An actionable KCNH2 Long QT Syndrome variant detected by sequence and haplotype analysis in a population research cohort

    Get PDF
    The Viking Health Study Shetland is a population-based research cohort of 2,122 volunteer participants with ancestry from the Shetland Isles in northern Scotland. The high kinship and detailed phenotype data support a range of approaches for associating rare genetic variants, enriched in this isolate population, with quantitative traits and diseases. As an exemplar, the c.1750G > A; p.Gly584Ser variant within the coding sequence of the KCNH2 gene implicated in Long QT Syndrome (LQTS), which occurred once in 500 whole genome sequences from this population, was investigated. Targeted sequencing of the KCNH2 gene in family members of the initial participant confirmed the presence of the sequence variant and identified two further members of the same family pedigree who shared the variant. Investigation of these three related participants for whom single nucleotide polymorphism (SNP) array genotypes were available allowed a unique shared haplotype of 1.22 Mb to be defined around this locus. Searching across the full cohort for this haplotype uncovered two additional apparently unrelated individuals with no known genealogical connection to the original kindred. All five participants with the defined haplotype were shown to share the rare variant by targeted Sanger sequencing. If this result were verified in a healthcare setting, it would be considered clinically actionable, and has been actioned in relatives ascertained independently through clinical presentation. The General Practitioners of four study participants with the rare variant were alerted to the research findings by letters outlining the phenotype (prolonged electrocardiographic QTc interval). A lack of detectable haplotype sharing between c.1750G > A; p.Gly584Ser chromosomes from previously reported individuals from Finland and those in this study from Shetland suggests that this mutation has arisen more than once in human history. This study showcases the potential value of isolate population-based research resources for genomic medicine. It also illustrates some challenges around communication of actionable findings in research participants in this context.Peer reviewe

    Genomic epidemiology of a protracted hospital outbreak caused by multidrug-resistant Acinetobacter baumannii in Birmingham, England

    Get PDF
    BACKGROUND: Multidrug-resistant Acinetobacter baumannii commonly causes hospital outbreaks. However, within an outbreak, it can be difficult to identify the routes of cross-infection rapidly and accurately enough to inform infection control. Here, we describe a protracted hospital outbreak of multidrug-resistant A. baumannii, in which whole-genome sequencing (WGS) was used to obtain a high-resolution view of the relationships between isolates. METHODS: To delineate and investigate the outbreak, we attempted to genome-sequence 114 isolates that had been assigned to the A. baumannii complex by the Vitek2 system and obtained informative draft genome sequences from 102 of them. Genomes were mapped against an outbreak reference sequence to identify single nucleotide variants (SNVs). RESULTS: We found that the pulsotype 27 outbreak strain was distinct from all other genome-sequenced strains. Seventy-four isolates from 49 patients could be assigned to the pulsotype 27 outbreak on the basis of genomic similarity, while WGS allowed 18 isolates to be ruled out of the outbreak. Among the pulsotype 27 outbreak isolates, we identified 31 SNVs and seven major genotypic clusters. In two patients, we documented within-host diversity, including mixtures of unrelated strains and within-strain clouds of SNV diversity. By combining WGS and epidemiological data, we reconstructed potential transmission events that linked all but 10 of the patients and confirmed links between clinical and environmental isolates. Identification of a contaminated bed and a burns theatre as sources of transmission led to enhanced environmental decontamination procedures. CONCLUSIONS: WGS is now poised to make an impact on hospital infection prevention and control, delivering cost-effective identification of routes of infection within a clinically relevant timeframe and allowing infection control teams to track, and even prevent, the spread of drug-resistant hospital pathogens
    corecore