Repository landing page

We are not able to resolve this OAI Identifier to the repository landing page. If you are the repository manager for this record, please head to the Dashboard and adjust the settings.

Illustrations of the SeqFold2D network and the Stralign dataset.

Abstract

(A) The two-module architecture of the SeqFold2D models. An input RNA sequence of length L is first embedded via one-hot encoding and feed-forward layers to yield an L×C tensor. The first module consists of N blocks of either bidirectional Long-Short-Term-Memory (LSTM) or transformer encoders. The resultant L×C tensor is then transformed into the L×L×C pair representation via outer-product, before being fed to the second module of N blocks of residual 2D convolutional layers. The output block is made up of three feed-forward layers and predicts the PPM of dimension L×L. (B) The population distributions of eight RNA families at different sequence similarity levels for the Stralign dataset. The abbreviations are, rRNA: ribosomal RNA, tRNA: transfer RNA, Intron I: group I intron, tmRNA: transfer messenger RNA, SRP: signal recognition particle, and TERC: telomerase RNA component. The innermost ring shows the original Stralign dataset with a total of 37,149 sequences, noting that the five under-represented families (counter-clockwise from Intron I to TERC) are scaled up for visibility and the multiplier N is shown as “N×” in the label (see Fig A in S1 Text for the unscaled version). The L600 ring is after removing sequences longer than 600; the NR100 ring shows the cross-sequence level; and the NR80 ring shows the cross-cluster level. Note that the 16S rRNA NR80 has only 50 sequences and is barely visible.</p

Similar works

Full text

thumbnail-image

FigShare

redirect
Last time updated on 20/04/2023

This paper was published in FigShare.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.