Illustrations of the SeqFold2D network and the Stralign dataset.

Abstract

(A) The two-module architecture of the SeqFold2D models. An input RNA sequence of length L is first embedded via one-hot encoding and feed-forward layers to yield an L×C tensor. The first module consists of N blocks of either bidirectional Long-Short-Term-Memory (LSTM) or transformer encoders. The resultant L×C tensor is then transformed into the L×L×C pair representation via outer-product, before being fed to the second module of N blocks of residual 2D convolutional layers. The output block is made up of three feed-forward layers and predicts the PPM of dimension L×L. (B) The population distributions of eight RNA families at different sequence similarity levels for the Stralign dataset. The abbreviations are, rRNA: ribosomal RNA, tRNA: transfer RNA, Intron I: group I intron, tmRNA: transfer messenger RNA, SRP: signal recognition particle, and TERC: telomerase RNA component. The innermost ring shows the original Stralign dataset with a total of 37,149 sequences, noting that the five under-represented families (counter-clockwise from Intron I to TERC) are scaled up for visibility and the multiplier N is shown as “N×” in the label (see Fig A in S1 Text for the unscaled version). The L600 ring is after removing sequences longer than 600; the NR100 ring shows the cross-sequence level; and the NR80 ring shows the cross-cluster level. Note that the 16S rRNA NR80 has only 50 sequences and is barely visible.</p

    Similar works

    Full text

    thumbnail-image

    Available Versions