8 research outputs found
EMMA: Adding Sequences into a Constraint Alignment with High Accuracy and Scalability (Abstract)
Multiple sequence alignment (MSA) is a crucial precursor to many downstream biological analyses, such as phylogeny estimation [Morrison, 2006], RNA structure prediction [Shapiro et al., 2007], protein structure prediction [Jumper et al., 2021], etc. Obtaining an accurate MSA can be challenging, especially when the dataset is large (i.e., more than 1000 sequences). A key technique for large-scale MSA estimation is to add sequences into an existing alignment. For example, biological knowledge can be used to form a reference alignment on a subset of the sequences, and then the remaining sequences can be added to the reference alignment. Another case where adding sequences into an existing alignment occurs is when new sequences or genomes are added to databases, leading to the opportunity to add the new sequences for each gene in the genome into a growing alignment. A third case is for de novo multiple sequence alignment, where a subset of the sequences is selected and aligned, and then the remaining sequences are added into this "backbone alignment" [Nguyen et al., 2015; Park et al., 2023; Shen et al., 2022; Liu and Warnow, 2023; Park and Warnow, 2023; Yamada et al., 2016]. Thus, adding sequences into existing alignments is a natural problem with multiple applications to biological sequence analysis.
A few methods have been developed to add sequences into an existing alignment, with MAFFT--add [Katoh and Frith, 2012] perhaps the most well-known. However, several multiple sequence alignment methods that operate in two steps (first extract and align the backbone sequences and then add the remaining sequences into this backbone alignment) also provide utilities for adding sequences into a user-provided alignment. We present EMMA, a new approach for adding "query" sequences into an existing "constraint" alignment. By construction, EMMA never changes the constraint alignment, except through the introduction of additional sites to represent homologies between the query sequences. EMMA uses a divide-and-conquer technique combined with MAFFT--add (using the most accurate setting, MAFFT-linsi--add) to add sequences into a user-provided alignment. We evaluate EMMA by comparing it to MAFFT-linsi--add, MAFFT--add (the default setting), and WITCH-ng-add. We include a range of biological and simulated datasets (nucleotides and proteins) ranging in size from 1000 to almost 200,000 sequences and evaluate alignment accuracy and scalability. MAFFT-linsi--add was the slowest and least scalable method, only able to run on datasets with at most 1000 sequences in this study, but had excellent accuracy (often the best) on those datasets. We also see that EMMA has better recall than WITCH-ng-add and MAFFT--add on large datasets, especially when the backbone alignment is small or clade-based
Contrastive Masked Autoencoders are Stronger Vision Learners
Masked image modeling (MIM) has achieved promising results on various vision
tasks. However, the limited discriminability of learned representation
manifests there is still plenty to go for making a stronger vision learner.
Towards this goal, we propose Contrastive Masked Autoencoders (CMAE), a new
self-supervised pre-training method for learning more comprehensive and capable
vision representations. By elaboratively unifying contrastive learning (CL) and
masked image model (MIM) through novel designs, CMAE leverages their respective
advantages and learns representations with both strong instance
discriminability and local perceptibility. Specifically, CMAE consists of two
branches where the online branch is an asymmetric encoder-decoder and the
momentum branch is a momentum updated encoder. During training, the online
encoder reconstructs original images from latent representations of masked
images to learn holistic features. The momentum encoder, fed with the full
images, enhances the feature discriminability via contrastive learning with its
online counterpart. To make CL compatible with MIM, CMAE introduces two new
components, i.e. pixel shifting for generating plausible positive views and
feature decoder for complementing features of contrastive pairs. Thanks to
these novel designs, CMAE effectively improves the representation quality and
transfer performance over its MIM counterpart. CMAE achieves the
state-of-the-art performance on highly competitive benchmarks of image
classification, semantic segmentation and object detection. Notably, CMAE-Base
achieves top-1 accuracy on ImageNet and mIoU on ADE20k,
surpassing previous best results by and respectively. The
source code is publicly accessible at
\url{https://github.com/ZhichengHuang/CMAE}.Comment: Accepted by TPAM
EMMA: a new method for computing multiple sequence alignments given a constraint subset alignment
Abstract Background Adding sequences into an existing (possibly user-provided) alignment has multiple applications, including updating a large alignment with new data, adding sequences into a constraint alignment constructed using biological knowledge, or computing alignments in the presence of sequence length heterogeneity. Although this is a natural problem, only a few tools have been developed to use this information with high fidelity. Results We present EMMA (Extending Multiple alignments using MAFFT--add) for the problem of adding a set of unaligned sequences into a multiple sequence alignment (i.e., a constraint alignment). EMMA builds on MAFFT--add, which is also designed to add sequences into a given constraint alignment. EMMA improves on MAFFT--add methods by using a divide-and-conquer framework to scale its most accurate version, MAFFT-linsi--add, to constraint alignments with many sequences. We show that EMMA has an accuracy advantage over other techniques for adding sequences into alignments under many realistic conditions and can scale to large datasets with high accuracy (hundreds of thousands of sequences). EMMA is available at https://github.com/c5shen/EMMA . Conclusions EMMA is a new tool that provides high accuracy and scalability for adding sequences into an existing alignment
Large scale sequence alignment via efficient inference in generative models
Abstract Finding alignments between millions of reads and genome sequences is crucial in computational biology. Since the standard alignment algorithm has a large computational cost, heuristics have been developed to speed up this task. Though orders of magnitude faster, these methods lack theoretical guarantees and often have low sensitivity especially when reads have many insertions, deletions, and mismatches relative to the genome. Here we develop a theoretically principled and efficient algorithm that has high sensitivity across a wide range of insertion, deletion, and mutation rates. We frame sequence alignment as an inference problem in a probabilistic model. Given a reference database of reads and a query read, we find the match that maximizes a log-likelihood ratio of a reference read and query read being generated jointly from a probabilistic model versus independent models. The brute force solution to this problem computes joint and independent probabilities between each query and reference pair, and its complexity grows linearly with database size. We introduce a bucketing strategy where reads with higher log-likelihood ratio are mapped to the same bucket with high probability. Experimental results show that our method is more accurate than the state-of-the-art approaches in aligning long-reads from Pacific Bioscience sequencers to genome sequences
Image_2_StRAB4 gene is required for filamentous growth, conidial development, and pathogenicity in Setosphaeria turcica.pdf
Setosphaeria turcica, the fungal pathogen responsible for northern corn leaf blight in maize, forms specialized infectious structures called appressoria that are critical for fungal penetration of maize epidermal cells. The Rab family of proteins play a crucial role in the growth, development, and pathogenesis of many eukaryotic species. Rab4, in particular, is a key regulator of endocytosis and vesicle trafficking, essential for filamentous growth and successful infection by other fungal pathogens. In this study, we silenced StRAB4 in S. turcica to gain a better understanding the function of Rab4 in this plant pathogen. Phenotypically, the mutants exhibited a reduced growth rate, a significant decline in conidia production, and an abnormal conidial morphology. These phenotypes indicate that StRab4 plays an instrumental role in regulating mycelial growth and conidial development in S. turcica. Further investigations revealed that StRab4 is a positive regulator of cell wall integrity and melanin secretion. Functional enrichment analysis of differentially expressed genes highlighted primary enrichments in peroxisome pathways, oxidoreductase and catalytic activities, membrane components, and cell wall organization processes. Collectively, our findings emphasize the significant role of StRab4 in S. turcica infection and pathogenicity in maize and provide valuable insights into fungal behavior and disease mechanisms.</p