8 research outputs found

    EMMA: Adding Sequences into a Constraint Alignment with High Accuracy and Scalability (Abstract)

    Get PDF
    Multiple sequence alignment (MSA) is a crucial precursor to many downstream biological analyses, such as phylogeny estimation [Morrison, 2006], RNA structure prediction [Shapiro et al., 2007], protein structure prediction [Jumper et al., 2021], etc. Obtaining an accurate MSA can be challenging, especially when the dataset is large (i.e., more than 1000 sequences). A key technique for large-scale MSA estimation is to add sequences into an existing alignment. For example, biological knowledge can be used to form a reference alignment on a subset of the sequences, and then the remaining sequences can be added to the reference alignment. Another case where adding sequences into an existing alignment occurs is when new sequences or genomes are added to databases, leading to the opportunity to add the new sequences for each gene in the genome into a growing alignment. A third case is for de novo multiple sequence alignment, where a subset of the sequences is selected and aligned, and then the remaining sequences are added into this "backbone alignment" [Nguyen et al., 2015; Park et al., 2023; Shen et al., 2022; Liu and Warnow, 2023; Park and Warnow, 2023; Yamada et al., 2016]. Thus, adding sequences into existing alignments is a natural problem with multiple applications to biological sequence analysis. A few methods have been developed to add sequences into an existing alignment, with MAFFT--add [Katoh and Frith, 2012] perhaps the most well-known. However, several multiple sequence alignment methods that operate in two steps (first extract and align the backbone sequences and then add the remaining sequences into this backbone alignment) also provide utilities for adding sequences into a user-provided alignment. We present EMMA, a new approach for adding "query" sequences into an existing "constraint" alignment. By construction, EMMA never changes the constraint alignment, except through the introduction of additional sites to represent homologies between the query sequences. EMMA uses a divide-and-conquer technique combined with MAFFT--add (using the most accurate setting, MAFFT-linsi--add) to add sequences into a user-provided alignment. We evaluate EMMA by comparing it to MAFFT-linsi--add, MAFFT--add (the default setting), and WITCH-ng-add. We include a range of biological and simulated datasets (nucleotides and proteins) ranging in size from 1000 to almost 200,000 sequences and evaluate alignment accuracy and scalability. MAFFT-linsi--add was the slowest and least scalable method, only able to run on datasets with at most 1000 sequences in this study, but had excellent accuracy (often the best) on those datasets. We also see that EMMA has better recall than WITCH-ng-add and MAFFT--add on large datasets, especially when the backbone alignment is small or clade-based

    Contrastive Masked Autoencoders are Stronger Vision Learners

    Full text link
    Masked image modeling (MIM) has achieved promising results on various vision tasks. However, the limited discriminability of learned representation manifests there is still plenty to go for making a stronger vision learner. Towards this goal, we propose Contrastive Masked Autoencoders (CMAE), a new self-supervised pre-training method for learning more comprehensive and capable vision representations. By elaboratively unifying contrastive learning (CL) and masked image model (MIM) through novel designs, CMAE leverages their respective advantages and learns representations with both strong instance discriminability and local perceptibility. Specifically, CMAE consists of two branches where the online branch is an asymmetric encoder-decoder and the momentum branch is a momentum updated encoder. During training, the online encoder reconstructs original images from latent representations of masked images to learn holistic features. The momentum encoder, fed with the full images, enhances the feature discriminability via contrastive learning with its online counterpart. To make CL compatible with MIM, CMAE introduces two new components, i.e. pixel shifting for generating plausible positive views and feature decoder for complementing features of contrastive pairs. Thanks to these novel designs, CMAE effectively improves the representation quality and transfer performance over its MIM counterpart. CMAE achieves the state-of-the-art performance on highly competitive benchmarks of image classification, semantic segmentation and object detection. Notably, CMAE-Base achieves 85.3%85.3\% top-1 accuracy on ImageNet and 52.5%52.5\% mIoU on ADE20k, surpassing previous best results by 0.7%0.7\% and 1.8%1.8\% respectively. The source code is publicly accessible at \url{https://github.com/ZhichengHuang/CMAE}.Comment: Accepted by TPAM

    BATCH-SCAMPP: Scaling Phylogenetic Placement Methods to Place Many Sequences (Abstract)

    No full text

    EMMA: a new method for computing multiple sequence alignments given a constraint subset alignment

    No full text
    Abstract Background Adding sequences into an existing (possibly user-provided) alignment has multiple applications, including updating a large alignment with new data, adding sequences into a constraint alignment constructed using biological knowledge, or computing alignments in the presence of sequence length heterogeneity. Although this is a natural problem, only a few tools have been developed to use this information with high fidelity. Results We present EMMA (Extending Multiple alignments using MAFFT--add) for the problem of adding a set of unaligned sequences into a multiple sequence alignment (i.e., a constraint alignment). EMMA builds on MAFFT--add, which is also designed to add sequences into a given constraint alignment. EMMA improves on MAFFT--add methods by using a divide-and-conquer framework to scale its most accurate version, MAFFT-linsi--add, to constraint alignments with many sequences. We show that EMMA has an accuracy advantage over other techniques for adding sequences into alignments under many realistic conditions and can scale to large datasets with high accuracy (hundreds of thousands of sequences). EMMA is available at https://github.com/c5shen/EMMA . Conclusions EMMA is a new tool that provides high accuracy and scalability for adding sequences into an existing alignment

    Large scale sequence alignment via efficient inference in generative models

    No full text
    Abstract Finding alignments between millions of reads and genome sequences is crucial in computational biology. Since the standard alignment algorithm has a large computational cost, heuristics have been developed to speed up this task. Though orders of magnitude faster, these methods lack theoretical guarantees and often have low sensitivity especially when reads have many insertions, deletions, and mismatches relative to the genome. Here we develop a theoretically principled and efficient algorithm that has high sensitivity across a wide range of insertion, deletion, and mutation rates. We frame sequence alignment as an inference problem in a probabilistic model. Given a reference database of reads and a query read, we find the match that maximizes a log-likelihood ratio of a reference read and query read being generated jointly from a probabilistic model versus independent models. The brute force solution to this problem computes joint and independent probabilities between each query and reference pair, and its complexity grows linearly with database size. We introduce a bucketing strategy where reads with higher log-likelihood ratio are mapped to the same bucket with high probability. Experimental results show that our method is more accurate than the state-of-the-art approaches in aligning long-reads from Pacific Bioscience sequencers to genome sequences

    Image_2_StRAB4 gene is required for filamentous growth, conidial development, and pathogenicity in Setosphaeria turcica.pdf

    No full text
    Setosphaeria turcica, the fungal pathogen responsible for northern corn leaf blight in maize, forms specialized infectious structures called appressoria that are critical for fungal penetration of maize epidermal cells. The Rab family of proteins play a crucial role in the growth, development, and pathogenesis of many eukaryotic species. Rab4, in particular, is a key regulator of endocytosis and vesicle trafficking, essential for filamentous growth and successful infection by other fungal pathogens. In this study, we silenced StRAB4 in S. turcica to gain a better understanding the function of Rab4 in this plant pathogen. Phenotypically, the mutants exhibited a reduced growth rate, a significant decline in conidia production, and an abnormal conidial morphology. These phenotypes indicate that StRab4 plays an instrumental role in regulating mycelial growth and conidial development in S. turcica. Further investigations revealed that StRab4 is a positive regulator of cell wall integrity and melanin secretion. Functional enrichment analysis of differentially expressed genes highlighted primary enrichments in peroxisome pathways, oxidoreductase and catalytic activities, membrane components, and cell wall organization processes. Collectively, our findings emphasize the significant role of StRab4 in S. turcica infection and pathogenicity in maize and provide valuable insights into fungal behavior and disease mechanisms.</p
    corecore