21 research outputs found

    De Novo Assembly of Nucleotide Sequences in a Compressed Feature Space

    Get PDF
    Sequencing technologies allow for an in-depth analysis of biological species but the size of the generated datasets introduce a number of analytical challenges. Recently, we demonstrated the application of numerical sequence representations and data transformations for the alignment of short reads to a reference genome. Here, we expand out approach for de novo assembly of short reads. Our results demonstrate that highly compressed data can encapsulate the signal suffi- ciently to accurately assemble reads to big contigs or complete genomes

    A Method for Comparing Multivariate Time Series with Different Dimensions

    Get PDF
    In many situations it is desirable to compare dynamical systems based on their behavior. Similarity of behavior often implies similarity of internal mechanisms or dependency on common extrinsic factors. While there are widely used methods for comparing univariate time series, most dynamical systems are characterized by multivariate time series. Yet, comparison of multivariate time series has been limited to cases where they share a common dimensionality. A semi-metric is a distance function that has the properties of non-negativity, symmetry and reflexivity, but not sub-additivity. Here we develop a semi-metric – SMETS – that can be used for comparing groups of time series that may have different dimensions. To demonstrate its utility, the method is applied to dynamic models of biochemical networks and to portfolios of shares. The former is an example of a case where the dependencies between system variables are known, while in the latter the system is treated (and behaves) as a black box

    Local Binary Patterns as a Feature Descriptor in Alignment-free Visualisation of Metagenomic Data

    Get PDF
    Shotgun sequencing has facilitated the analysis of complex microbial communities. However, clustering and visualising these communities without prior taxonomic information is a major challenge. Feature descriptor methods can be utilised to extract these taxonomic relations from the data. Here, we present a novel approach consisting of local binary patterns (LBP) coupled with randomised singular value decomposition (RSVD) and Barnes-Hut t-stochastic neighbor embedding (BH-tSNE) to highlight the underlying taxonomic structure of the metagenomic data. The effectiveness of our approach is demonstrated using several simulated and a real metagenomic datasets

    Marginalised stack denoising autoencoders for metagenomic data binning

    Get PDF
    Shotgun sequencing has facilitated the analysis of complex microbial communities. Recently we have shown how local binary patterns (LBP) from image processing can be used to analyse the sequenced samples. LBP codes represent the data in a sparse high dimensional space. To improve the performance of our pipeline, marginalised stacked autoencoders are used here to learn frequent LBP codes and map the high dimensional space to a lower dimension dense space. We demonstrate its performance using both low and high complexity simulated metagenomic data and compare the performance of our method with several existing techniques including principal component analysis (PCA) in the dimension reduction step and fc-mer frequency in feature extraction step

    The Utility of Data Transformation for Alignment, De Novo Assembly and Classification of Short Read Virus Sequences.

    Get PDF
    Advances in DNA sequencing technology are facilitating genomic analyses of unprecedented scope and scale, widening the gap between our abilities to generate and fully exploit biological sequence data. Comparable analytical challenges are encountered in other data-intensive fields involving sequential data, such as signal processing, in which dimensionality reduction (i.e., compression) methods are routinely used to lessen the computational burden of analyses. In this work, we explored the application of dimensionality reduction methods to numerically represent high-throughput sequence data for three important biological applications of virus sequence data: reference-based mapping, short sequence classification and de novo assembly. Leveraging highly compressed sequence transformations to accelerate sequence comparison, our approach yielded comparable accuracy to existing approaches, further demonstrating its suitability for sequences originating from diverse virus populations. We assessed the application of our methodology using both synthetic and real viral pathogen sequences. Our results show that the use of highly compressed sequence approximations can provide accurate results, with analytical performance retained and even enhanced through appropriate dimensionality reduction of sequence data

    Investigation of Salmonella Phage–Bacteria Infection Profiles: Network Structure Reveals a Gradient of Target-Range from Generalist to Specialist Phage Clones in Nested Subsets

    Get PDF
    From MDPI via Jisc Publications RouterHistory: accepted 2021-06-23, pub-electronic 2021-06-28Publication status: PublishedFunder: Horizon 2020 Framework Programme; Grant(s): 767015Funder: The International Science and Technology Center; Grant(s): ISTC grant A-2140Bacteriophages that lyse Salmonella enterica are potential tools to target and control Salmonella infections. Investigating the host range of Salmonella phages is a key to understand their impact on bacterial ecology, coevolution and inform their use in intervention strategies. Virus–host infection networks have been used to characterize the “predator–prey” interactions between phages and bacteria and provide insights into host range and specificity. Here, we characterize the target-range and infection profiles of 13 Salmonella phage clones against a diverse set of 141 Salmonella strains. The environmental source and taxonomy contributed to the observed infection profiles, and genetically proximal phages shared similar infection profiles. Using in vitro infection data, we analyzed the structure of the Salmonella phage–bacteria infection network. The network has a non-random nested organization and weak modularity suggesting a gradient of target-range from generalist to specialist species with nested subsets, which are also observed within and across the different phage infection profile groups. Our results have implications for our understanding of the coevolutionary mechanisms shaping the ecological interactions between Salmonella phages and their bacterial hosts and can inform strategies for targeting Salmonella enterica with specific phage preparations

    Whole-genome analysis of Nigerian patients with breast cancer reveals ethnic-driven somatic evolution and distinct genomic subtypes

    Get PDF
    Black women across the African diaspora experience more aggressive breast cancer with higher mortality rates than white women of European ancestry. Although inter-ethnic germline variation is known, differential somatic evolution has not been investigated in detail. Analysis of deep whole genomes of 97 breast cancers, with RNA-seq in a subset, from women in Nigeria in comparison with The Cancer Genome Atlas (n = 76) reveal a higher rate of genomic instability and increased intra-tumoral heterogeneity as well as a unique genomic subtype defined by early clonal GATA3 mutations with a 10.5-year younger age at diagnosis. We also find non-coding mutations in bona fide drivers (ZNF217 and SYPL1) and a previously unreported INDEL signature strongly associated with African ancestry proportion, underscoring the need to expand inclusion of diverse populations in biomedical research. Finally, we demonstrate that characterizing tumors for homologous recombination deficiency has significant clinical relevance in stratifying patients for potentially life-saving therapies

    Local binary patterns as a feature descriptor in alignment-free visualisation of metagenomic data

    No full text

    A signal processing method for alignment-free metagenomic binning: multi-resolution genomic binary patterns

    Get PDF
    Algorithms in bioinformatics use textual representations of genetic information, sequences of the characters A, T, G and C represented computationally as strings or sub-strings. Signal and related image processing methods offer a rich source of alternative descriptors as they are designed to work in the presence of noisy data without the need for exact matching. Here we introduce a method, multi-resolution local binary patterns (MLBP) adapted from image processing to extract local 'texture' changes from nucleotide sequence data. We apply this feature space to the alignment-free binning of metagenomic data. The effectiveness of MLBP is demonstrated using both simulated and real human gut microbial communities. Sequence reads or contigs can be represented as vectors and their 'texture' compared efficiently using machine learning algorithms to perform dimensionality reduction to capture eigengenome information and perform clustering (here using randomized singular value decomposition and BH-tSNE). The intuition behind our method is the MLBP feature vectors permit sequence comparisons without the need for explicit pairwise matching. We demonstrate this approach outperforms existing methods based on k-mer frequencies. The signal processing method, MLBP, thus offers a viable alternative feature space to textual representations of sequence data. The source code for our Multi-resolution Genomic Binary Patterns method can be found at https://github.com/skouchaki/MrGBP

    Distance matrices for the primary commodity prices.

    No full text
    <p>Distance values were measured using the average and SMETS distances and are encoded in grayscale.</p
    corecore