18 research outputs found

    The mean F1 scores on the TR (tan) and TS (blue) sets by SeqFold2D and selected DL and traditional models in two different dataset setups.

    No full text
    (A) Both TR and TS sets are from Stralign NR100. (B) TR from Stralign NR100 and TS from ArchiveII NR100.The models are sorted by their TS F1 scores. The names for the learning-based models are appended with the number of parameters and the trailing asterisk indicates the use of post-processing. At the end of each bar shows the F1 value. All learning-based models except SPOT-RNA are re-trained.</p

    Illustrations of the TR (left, tan) vs. TS (right, blue) performances for selected learning and physics-based models at the cross-family level with the Strive-NR80 dataset.

    No full text
    For each cross-family study, one RNA family is held out as the TS set and the rest eight families are used for model development (TR and VL). Each panel/row here shows one such study labelled by the TS family name (B-E), while the first panel, (A) [Baseline], shows a baseline study with randomly splits of all families for the TR, VL, and TS subsets. Panel A thus is de facto a cross-cluster study with all subsets derived from the same parent dataset. For each panel, the average TR and TS scores are shown at the top and highlighted for the learning-based model with the highest TS score (physics-based models excluded). All learning-based models are retrained with the numbers of parameters shown after names. It should be noted that, despite our best re-training efforts, the scores of MXfold2 and Ufold should be viewed as guides only as we are unable to match their reported performances when using the same datasets. Still, given the inverse correlation between TR and TS performances, their TR-TS gaps are expected to be under-estimates.</p

    Illustrations of the SeqFold2D network and the Stralign dataset.

    No full text
    (A) The two-module architecture of the SeqFold2D models. An input RNA sequence of length L is first embedded via one-hot encoding and feed-forward layers to yield an L×C tensor. The first module consists of N blocks of either bidirectional Long-Short-Term-Memory (LSTM) or transformer encoders. The resultant L×C tensor is then transformed into the L×L×C pair representation via outer-product, before being fed to the second module of N blocks of residual 2D convolutional layers. The output block is made up of three feed-forward layers and predicts the PPM of dimension L×L. (B) The population distributions of eight RNA families at different sequence similarity levels for the Stralign dataset. The abbreviations are, rRNA: ribosomal RNA, tRNA: transfer RNA, Intron I: group I intron, tmRNA: transfer messenger RNA, SRP: signal recognition particle, and TERC: telomerase RNA component. The innermost ring shows the original Stralign dataset with a total of 37,149 sequences, noting that the five under-represented families (counter-clockwise from Intron I to TERC) are scaled up for visibility and the multiplier N is shown as “N×” in the label (see Fig A in S1 Text for the unscaled version). The L600 ring is after removing sequences longer than 600; the NR100 ring shows the cross-sequence level; and the NR80 ring shows the cross-cluster level. Note that the 16S rRNA NR80 has only 50 sequences and is barely visible.</p

    Illustrations of the F1-unseen vs. F1-seen correlations of the SeqFold2D-960K model.

    No full text
    Each PSA or PSSA program is shown in the same color in all panels. (A) The F1-unseen over F1-seen ratio as a function of the PSI or PSSI threshold. The horizontal dashed line marks the F1 ratio between the entire unseen and seen datasets. (B) The PCC value as a function of the identity threshold. (C) The distributions of the F1-unseen and F1-seen scores at the nominal PSI or PSSI threshold of 50%. (D) The distributions of the F1-unseen and F1-seen scores at the identity threshold of 80%. It is common to find no seen sequences above high thresholds for an unseen sequence, leading to many null F1-seen values that are absent in (D).</p

    Illustrations of the TR-TS gaps of the SeqFold2D-1.4M model.

    No full text
    (A) Stral-NR100 as TR and Archi-NR100 as TS. (B) Stral-NR80 as TR and Archi-Stral-NR80 as TS. The first pair of violins shows the F1 scores for the entire TR (left, tan) and TS (right, blue) set and the following pairs show the scores for each RNA family. Averaged scores are shown as dashed lines (white) and at the very top. The parentheses above show the sequence counts in numbers (for the entire set or families with 1% share). The families existing in one set only are shown as “nan” for the other set, e.g., 23S rRNA in Archi-NR100 only.</p

    SDS-PAGE of emptied phages incubated at room temperature (RT), 72, and 92°C.

    No full text
    <p>(a) With addition of DNase I. (b) Without the addition of DNase I. Gel conditions and lane annotations are as in Fig. 2.</p

    Illustrations of the cross-family F1 scores and the PGscore distributions for all studies.

    No full text
    (A) The TS vs. TR F1 scores of the baseline cross-cluster study ([Baseline]) and all nine cross-family studies (labelled by the TS family name) with Strive-NR80. Four zones (I-IV) are delineated for easy reference. The diagonal line in zone III denotes the line of zero TR-TS gap, i.e., TR = TS. The dash line in zone IV is a guide to the eye only. The cross-family TS scores of the three groups of models (physics-based, ML, and DL) are shown in three respective boxplots as annotated. (B) Boxplots of the PGscores from all learning-based models for each study at the specific TR-TS similarity level. The studies are, XSeq-I: the cross-sequence study with Stral-NR100, XSeq-II: cross-sequence with Stral-NR100 and Archi-NR100, XCls-I: cross-cluster with Strive-NR80, XCls-II: cross-cluster with Stral-NR80 and Archi-Stral-NR80, XCls-III: cross-cluster with bpRNA, XFam: all nine cross-family studies with Strive-NR80. The learning-based models included for each study are shown in Figs T-U in S1 Text.</p

    S1 Text -

    No full text
    Making no use of physical laws or co-evolutionary information, de novo deep learning (DL) models for RNA secondary structure prediction have achieved far superior performances than traditional algorithms. However, their statistical underpinning raises the crucial question of generalizability. We present a quantitative study of the performance and generalizability of a series of de novo DL models, with a minimal two-module architecture and no post-processing, under varied similarities between seen and unseen sequences. Our models demonstrate excellent expressive capacities and outperform existing methods on common benchmark datasets. However, model generalizability, i.e., the performance gap between the seen and unseen sets, degrades rapidly as the sequence similarity decreases. The same trends are observed from several recent DL and machine learning models. And an inverse correlation between performance and generalizability is revealed collectively across all learning-based models with wide-ranging architectures and sizes. We further quantitate how generalizability depends on sequence and structure identity scores via pairwise alignment, providing unique quantitative insights into the limitations of statistical learning. Generalizability thus poses a major hurdle for deploying de novo DL models in practice and various pathways for future advances are discussed.</div

    SDS-PAGE of intact cI60 phages incubated at room temperature (RT), 72, and 92°C.

    No full text
    <p>(a) With addition of DNase I. (b) Without the addition of DNase I. The three fractions incubated at each temperature are shown in the order of the as-is, supernatant, and pellet fractions (see text for details). NuPAGE 12% Bis-Tris gels (Invitrogen) were run at 120 V for 2 hours and 45 minutes in the MOPS SDS buffer. Lane #1 is the PageRuler unstained protein ladder with 10 kDa to 200 kDa range.</p

    Nanoscale Structure and Interaction of Condensed Phases of DNA–Carbon Nanotube Hybrids

    No full text
    Condensation of DNA–carbon nanotube (CNT) hybrids dispersed in aqueous solutions can be induced by elevated hybrid concentrations, salts, or crowding agents. DNA–CNT condensates exhibit either nematic ordering or amorphous aggregates, dependent on the nature of interhybrid interactions. This study employed X-ray diffraction (XRD) to determine nanoscale structures of the condensates, including the presence of positional ordering, interaxial distances, and the range of ordered domains. To probe the effects of DNA sequence, two types of CNT hybrids, dispersed by genomic DNA of random sequence and synthetic oligonucleotides respectively, were studied under identical conditions. The osmotic stress method was further used to quantify force–distance dependencies of the DNA–CNT hybrids to elucidate the relation between interhybrid interactions and condensate structures. We observed that, independent of DNA sequence, lyotropic DNA–CNT phases showed weak positional ordering with long interhybrid distances, salt-induced condensates were amorphous, crowding-condensed DNA–CNTs were the most ordered with pronounced XRD peaks, and interhybrid interactions were defined by short-range hydration repulsion and long-range electrostatic repulsion. Conversely, the effects of DNA sequence became evident as to their quantitative force–distance relationships. Genomic DNA of random sequence consistently gave longer interhybrid distances than synthetic oligonucleotides, which we attribute to the likely differences in their hybrid diameters
    corecore