125 research outputs found
Recommended from our members
miRcode: A map of putative microRNA target sites in the long non-coding transcriptome
Summary: Although small non-coding RNAs, such as microRNAs, have well-established functions in the cell, long non-coding RNAs (lncRNAs) have only recently started to emerge as abundant regulators of cell physiology, and their functions may be diverse. A small number of studies describe interactions between small and lncRNAs, with lncRNAs acting either as inhibitory decoys or as regulatory targets of microRNAs, but such interactions are still poorly explored. To facilitate the study of microRNA–lncRNA interactions, we implemented miRcode: a comprehensive searchable map of putative microRNA target sites across the complete GENCODE annotated transcriptome, including 10 419 lncRNA genes in the current version
mRNA turnover rate limits siRNA and microRNA efficacy
Based on a simple model of the mRNA life cycle, we predict that mRNAs with high turnover rates in the cell are more difficult to perturb with RNAi. We test this hypothesis using a luciferase reporter system and obtain additional evidence from a variety of large-scale data sets, including microRNA overexpression experiments and RT–qPCR-based efficacy measurements for thousands of siRNAs. Our results suggest that mRNA half-lives will influence how mRNAs are differentially perturbed whenever small RNA levels change in the cell, not only after transfection but also during differentiation, pathogenesis and normal cell physiology
Biological Sequence Kernels with Guaranteed Flexibility
Applying machine learning to biological sequences - DNA, RNA and protein -
has enormous potential to advance human health, environmental sustainability,
and fundamental biological understanding. However, many existing machine
learning methods are ineffective or unreliable in this problem domain. We study
these challenges theoretically, through the lens of kernels. Methods based on
kernels are ubiquitous: they are used to predict molecular phenotypes, design
novel proteins, compare sequence distributions, and more. Many methods that do
not use kernels explicitly still rely on them implicitly, including a wide
variety of both deep learning and physics-based techniques. While kernels for
other types of data are well-studied theoretically, the structure of biological
sequence space (discrete, variable length sequences), as well as biological
notions of sequence similarity, present unique mathematical challenges. We
formally analyze how well kernels for biological sequences can approximate
arbitrary functions on sequence space and how well they can distinguish
different sequence distributions. In particular, we establish conditions under
which biological sequence kernels are universal, characteristic and metrize the
space of distributions. We show that a large number of existing kernel-based
machine learning methods for biological sequences fail to meet our conditions
and can as a consequence fail severely. We develop straightforward and
computationally tractable ways of modifying existing kernels to satisfy our
conditions, imbuing them with strong guarantees on accuracy and reliability.
Our proof techniques build on and extend the theory of kernels with discrete
masses. We illustrate our theoretical results in simulation and on real
biological data sets
RITA: a Study on Scaling Up Generative Protein Sequence Models
In this work we introduce RITA: a suite of autoregressive generative models
for protein sequences, with up to 1.2 billion parameters, trained on over 280
million protein sequences belonging to the UniRef-100 database. Such generative
models hold the promise of greatly accelerating protein design. We conduct the
first systematic study of how capabilities evolve with model size for
autoregressive transformers in the protein domain: we evaluate RITA models in
next amino acid prediction, zero-shot fitness, and enzyme function prediction,
showing benefits from increased scale. We release the RITA models openly, to
the benefit of the research community
Computational Analysis of Mouse piRNA Sequence and Biogenesis
The recent discovery of a new class of 30-nucleotide long RNAs in mammalian testes, called PIWI-interacting RNA (piRNA), with similarities to microRNAs and repeat-associated small interfering RNAs (rasiRNAs), has raised puzzling questions regarding their biogenesis and function. We report a comparative analysis of currently available piRNA sequence data from the pachytene stage of mouse spermatogenesis that sheds light on their sequence diversity and mechanism of biogenesis. We conclude that (i) there are at least four times as many piRNAs in mouse testes than currently known; (ii) piRNAs, which originate from long precursor transcripts, are generated by quasi-random enzymatic processing that is guided by a weak sequence signature at the piRNA 5′ends resulting in a large number of distinct sequences; and (iii) many of the piRNA clusters contain inverted repeats segments capable of forming double-strand RNA fold-back segments that may initiate piRNA processing analogous to transposon silencing
FreeContact: fast and free software for protein contact prediction from residue co-evolution
Background: 20 years of improved technology and growing sequences now renders residue-residue contact constraints in large protein families through correlated mutations accurate enough to drive de novo predictions of protein three-dimensional structure. The method EVfold broke new ground using mean-field Direct Coupling Analysis (EVfold-mfDCA); the method PSICOV applied a related concept by estimating a sparse inverse covariance matrix. Both methods (EVfold-mfDCA and PSICOV) are publicly available, but both require too much CPU time for interactive applications. On top, EVfold-mfDCA depends on proprietary software. Results: Here, we present FreeContact, a fast, open source implementation of EVfold-mfDCA and PSICOV. On a test set of 140 proteins, FreeContact was almost eight times faster than PSICOV without decreasing prediction performance. The EVfold-mfDCA implementation of FreeContact was over 220 times faster than PSICOV with negligible performance decrease. EVfold-mfDCA was unavailable for testing due to its dependency on proprietary software. FreeContact is implemented as the free C++ library “libfreecontact”, complete with command line tool “freecontact”, as well as Perl and Python modules. All components are available as Debian packages. FreeContact supports the BioXSD format for interoperability. Conclusions: FreeContact provides the opportunity to compute reliable contact predictions in any environment (desktop or cloud)
- …