301 research outputs found

    TargetCall: Eliminating the Wasted Computation in Basecalling via Pre-Basecalling Filtering

    Full text link
    Basecalling is an essential step in nanopore sequencing analysis where the raw signals of nanopore sequencers are converted into nucleotide sequences, i.e., reads. State-of-the-art basecallers employ complex deep learning models to achieve high basecalling accuracy. This makes basecalling computationally-inefficient and memory-hungry; bottlenecking the entire genome analysis pipeline. However, for many applications, the majority of reads do no match the reference genome of interest (i.e., target reference) and thus are discarded in later steps in the genomics pipeline, wasting the basecalling computation. To overcome this issue, we propose TargetCall, the first fast and widely-applicable pre-basecalling filter to eliminate the wasted computation in basecalling. TargetCall's key idea is to discard reads that will not match the target reference (i.e., off-target reads) prior to basecalling. TargetCall consists of two main components: (1) LightCall, a lightweight neural network basecaller that produces noisy reads; and (2) Similarity Check, which labels each of these noisy reads as on-target or off-target by matching them to the target reference. TargetCall filters out all off-target reads before basecalling; and the highly-accurate but slow basecalling is performed only on the raw signals whose noisy reads are labeled as on-target. Our thorough experimental evaluations using both real and simulated data show that TargetCall 1) improves the end-to-end basecalling performance of the state-of-the-art basecaller by 3.31x while maintaining high (98.88%) sensitivity in keeping on-target reads, 2) maintains high accuracy in downstream analysis, 3) precisely filters out up to 94.71% of off-target reads, and 4) achieves better performance, sensitivity, and generality compared to prior works. We freely open-source TargetCall at https://github.com/CMU-SAFARI/TargetCall

    Identification of Phage Viral Proteins With Hybrid Sequence Features

    Get PDF
    The uniqueness of bacteriophages plays an important role in bioinformatics research. In real applications, the function of the bacteriophage virion proteins is the main area of interest. Therefore, it is very important to classify bacteriophage virion proteins and non-phage virion proteins accurately. Extracting comprehensive and effective sequence features from proteins plays a vital role in protein classification. In order to more fully represent protein information, this paper is more comprehensive and effective by combining the features extracted by the feature information representation algorithm based on sequence information (CCPA) and the feature representation algorithm based on sequence and structure information. After extracting features, the Max-Relevance-Max-Distance (MRMD) algorithm is used to select the optimal feature set with the strongest correlation between class labels and low redundancy between features. Given the randomness of the samples selected by the random forest classification algorithm and the randomness features for producing each node variable, a random forest method is employed to perform 10-fold cross-validation on the bacteriophage protein classification. The accuracy of this model is as high as 93.5% in the classification of phage proteins in this study. This study also found that, among the eight physicochemical properties considered, the charge property has the greatest impact on the classification of bacteriophage proteins These results indicate that the model discussed in this paper is an important tool in bacteriophage protein research

    Exploring The Interactions Between SARS-CoV-2 and Host Proteins.

    Get PDF
    The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is the causative agent of the current pandemic, Coronavirus Disease 2019 (COVID-19). SARS-CoV-2 is considered to be of zoonotic origin; it originated in non-human animals and was transmitted to humans. Since the early stage of the pandemic, however, the evidence of transmissions from humans to animals (reverse zoonoses) has been found in multiple animal species including mink, white-tailed deer, and pet and zoo animals. Furthermore, secondary zoonotic events of SARS-CoV-2, transmissions from animals to humans, have been also reported. It is suggested that non-human hosts can act as SARS-CoV-2 reservoirs where accumulated mutations in viral proteins could change the transmissibility and/or pathogenicity of the virus when it is spilled over again to human populations. Our goal, therefore, is to examine the SARS-CoV-2 genomic changes in non-human hosts and to identify the changes responsible for the adaptation of the virus in non-human hosts. Changes in the physicochemical properties of viral proteins potentially affect and influence their functions. Therefore, in this study, we compared SARS-CoV-2 proteins among human and non-human hosts and analyzed the differences in their physicochemical properties using the principal component analysis. In addition to the viral proteins from bat and pangolin, those from white-tailed deer and mink showed larger differences in the properties. Van der Waals volume, isoelectric point, charge, and thermostability index were found to be the main contributing factors. We next performed the comparisons of protein-protein interaction (PPI) prediction methods that use different features including physicochemical properties and those based on natural language processing. It showed that the Cross-attention PHV had slightly better performance scores than InterSPPI-HVPPI and LGCA-VHPPI. Finally, to examine the effect of changes in physicochemical properties in viral proteins against host proteins, PPI prediction was performed using the Cross-attention PHV between viral proteins from different SARS-CoV-2 variants and host proteins. The prediction scores between the different variants and host proteins from human and white-tailed deer were highly similar. The results showed that the analysis of physicochemical properties of viral proteins helps to understand how physicochemical properties of viral proteins affect viral-host PPIs and how viral proteins evolve to adapt different host cell environments

    Designing Methods for Representation Learning of Molecular Sequences and its Application in Analysis Tasks

    Get PDF
    Molecular sequence analysis serves as a fundamental process for elucidating the intricate functions, structures, and behaviors inherent in sequences. Its application extends to char- acterizing associated organisms, such as viruses, facilitating the development of preventive measures to mitigate their dissemination and influence. Given the potential of viruses to trigger epidemics with global ramifications, comprehensive sequence analysis is pivotal in understanding and managing their impact effectively. The rapid expansion of bio-sequence data has surpassed the computational capabilities of traditional analytical techniques, such as the phylogenetic approach, due to their high computational costs. Consequently, clustering and classification have emerged as compelling alternatives, with machine learning (ML) and deep learning (DL) algorithms capable of effectively implementing these methods. Although ML/DL models are known for their high analytical capabilities, however, they typically require the inputs to be either in numerical or image form. Therefore, efficient and effective mechanisms are needed to transform bio-sequences into ML/DL-compatible inputs, and this research intends to devise such techniques. In this regard, alignment-free and fast feature-engineering-based approaches and image-based approaches are put forward in this work to convert the bio-sequences into numerical and image form respectively. The feature-engineering-based methods, PSSMFreq2Vec and PSSM2Vec combine the power of k- mers and position weight matrix (PWM) to be scalable, alignment-free, and compact, while Hashing2Vec utilizes the combination of hashing and k-mers to achieve high embedding generation speed and to be alignment-free respectively. Furthermore, two of the image-based approaches follow the underlying concept of Chaos Game Representation (CGR) to map sequences to images while one uses Bezier function-based mapping of sequences into images, and they aim to enable the application of sophisticated vision DL analytical models on bio- sequences. The representations gained from both feature-engineering-based and image-based methods are passed on to ML/DL models to perform classification tasks and their results illustrate high predictive performance as compared to the respective baseline models

    Application of Software Engineering Principles to Synthetic Biology and Emerging Regulatory Concerns

    Get PDF
    As the science of synthetic biology matures, engineers have begun to deliver real-world applications which are the beginning of what could radically transform our lives. Recent progress indicates synthetic biology will produce transformative breakthroughs. Examples include: 1) synthesizing chemicals for medicines which are expensive and difficult to produce; 2) producing protein alternatives; 3) altering genomes to combat deadly diseases; 4) killing antibiotic-resistant pathogens; and 5) speeding up vaccine production. Although synthetic biology promises great benefits, many stakeholders have expressed concerns over safety and security risks from creating biological behavior never seen before in nature. As with any emerging technology, there is the risk of malicious use known as the dual-use problem. The technology is becoming democratized and de-skilled, and people in do-it-yourself communities can tinker with genetic code, similar to how programming has become prevalent through the ease of using macros in spreadsheets. While easy to program, it may be non-trivial to validate novel biological behavior. Nevertheless, we must be able to certify synthetically engineered organisms behave as expected, and be confident they will not harm natural life or the environment. Synthetic biology is an interdisciplinary engineering domain, and interdisciplinary problems require interdisciplinary solutions. Using an interdisciplinary approach, this dissertation lays foundations for verifying, validating, and certifying safety and security of synthetic biology applications through traditional software engineering concepts about safety, security, and reliability of systems. These techniques can help stakeholders navigate what is currently a confusing regulatory process. The contributions of this dissertation are: 1) creation of domain-specific patterns to help synthetic biologists develop assurance cases using evidence and arguments to validate safety and security of designs; 2) application of software product lines and feature models to the modular DNA parts of synthetic biology commonly known as BioBricks, making it easier to find safety features during design; 3) a technique for analyzing DNA sequence motifs to help characterize proteins as toxins or non-toxins; 4) a legal investigation regarding what makes regulating synthetic biology challenging; and 5) a repeatable workflow for leveraging safety and security artifacts to develop assurance cases for synthetic biology systems. Advisers: Myra B. Cohen and Brittany A. Dunca

    Leveraging Machine Learning for the Analysis and Prediction of Influenza A Virus

    Get PDF
    Influenza, commonly known as flu, is a respiratory disease that poses a significant challenge to global public health due to its high prevalence and potential for serious health complications. The disease is caused by influenza viruses, among which influenza A viruses are of particular concern. These viruses are known for their rapid transmission, potential to cause severe health issues, and frequent mutations, which underscore the need for ongoing research and surveillance. A key aspect of managing influenza outbreaks includes understanding host origins, antigenic properties, and the ability of influenza A viruses to transmit between species, as this knowledge is critical in forecasting outbreaks and developing effective vaccines. Traditional approaches, such as hemagglutination inhibition assays for antigenicity assessment and phylogenetic analysis to determine genetic relationships, host origins and subtypes, have been fundamental in understanding influenza viruses. These methods, while informative, often face limitations in terms of time, resources, and the ability to keep pace with the rapid evolutionary changes of viruses. To mitigate these limitations, this thesis uses advanced machine learning techniques to analyse critical protein sequence data from influenza A viruses, offering an alternative perspective for unravelling the complexities of influenza, and potentially opening new avenues for analysis without strict reliance on prior biological knowledge. The core of the thesis is the application and refinement of predictive models to determine host origins, subtypes, and antigenic relationships of influenza A viruses. These models are evaluated comprehensively, considering factors such as the impact of incomplete sequences, performance across various host taxonomies and individual hosts, as well as the influence of reference databases on model performance. This evaluation illuminates the potential of machine learning to enhance our understanding of influenza A viruses in real-world scenarios, pointing out the ongoing importance of this research in public health
    • …
    corecore