Search CORE

125 research outputs found

Recommended from our members

miRcode: A map of putative microRNA target sites in the long non-coding transcriptome

Author: Jeggari Ashwini
Larsson Erik
Marks Debora
Publication venue: 'Oxford University Press (OUP)'
Publication date: 19/03/2013
Field of study

Summary: Although small non-coding RNAs, such as microRNAs, have well-established functions in the cell, long non-coding RNAs (lncRNAs) have only recently started to emerge as abundant regulators of cell physiology, and their functions may be diverse. A small number of studies describe interactions between small and lncRNAs, with lncRNAs acting either as inhibitory decoys or as regulatory targets of microRNAs, but such interactions are still poorly explored. To facilitate the study of microRNA–lncRNA interactions, we implemented miRcode: a comprehensive searchable map of putative microRNA target sites across the complete GENCODE annotated transcriptome, including 10 419 lncRNA genes in the current version

Harvard University - DASH

mRNA turnover rate limits siRNA and microRNA efficacy

Author: Larsson Erik
Marks Debora
Sander Chris
Publication venue: Nature Publishing Group
Publication date: 01/01/2010
Field of study

Based on a simple model of the mRNA life cycle, we predict that mRNAs with high turnover rates in the cell are more difficult to perturb with RNAi. We test this hypothesis using a luciferase reporter system and obtain additional evidence from a variety of large-scale data sets, including microRNA overexpression experiments and RT–qPCR-based efficacy measurements for thousands of siRNAs. Our results suggest that mRNA half-lives will influence how mRNAs are differentially perturbed whenever small RNA levels change in the cell, not only after transfection but also during differentiation, pathogenesis and normal cell physiology

Crossref

Harvard University - DASH

Directory of Open Access Journals

PubMed Central

Biological Sequence Kernels with Guaranteed Flexibility

Author: Amin Alan Nawzad
Marks Debora Susan
Weinstein Eli Nathan
Publication venue
Publication date: 06/04/2023
Field of study

Applying machine learning to biological sequences - DNA, RNA and protein - has enormous potential to advance human health, environmental sustainability, and fundamental biological understanding. However, many existing machine learning methods are ineffective or unreliable in this problem domain. We study these challenges theoretically, through the lens of kernels. Methods based on kernels are ubiquitous: they are used to predict molecular phenotypes, design novel proteins, compare sequence distributions, and more. Many methods that do not use kernels explicitly still rely on them implicitly, including a wide variety of both deep learning and physics-based techniques. While kernels for other types of data are well-studied theoretically, the structure of biological sequence space (discrete, variable length sequences), as well as biological notions of sequence similarity, present unique mathematical challenges. We formally analyze how well kernels for biological sequences can approximate arbitrary functions on sequence space and how well they can distinguish different sequence distributions. In particular, we establish conditions under which biological sequence kernels are universal, characteristic and metrize the space of distributions. We show that a large number of existing kernel-based machine learning methods for biological sequences fail to meet our conditions and can as a consequence fail severely. We develop straightforward and computationally tractable ways of modifying existing kernels to satisfy our conditions, imbuing them with strong guarantees on accuracy and reliability. Our proof techniques build on and extend the theory of kernels with discrete masses. We illustrate our theoretical results in simulation and on real biological data sets

arXiv.org e-Print Archive

RITA: a Study on Scaling Up Generative Protein Sequence Models

Author: Hesslow Daniel
Marks Debora
Notin Pascal
Poli Iacopo
Zanichelli Niccoló
Publication venue
Publication date: 14/07/2022
Field of study

In this work we introduce RITA: a suite of autoregressive generative models for protein sequences, with up to 1.2 billion parameters, trained on over 280 million protein sequences belonging to the UniRef-100 database. Such generative models hold the promise of greatly accelerating protein design. We conduct the first systematic study of how capabilities evolve with model size for autoregressive transformers in the protein domain: we evaluate RITA models in next amino acid prediction, zero-shot fitness, and enzyme function prediction, showing benefits from increased scale. We release the RITA models openly, to the benefit of the research community

arXiv.org e-Print Archive

Computational Analysis of Mouse piRNA Sequence and Biogenesis

Author: Chris Sander
Debora S Marks
Doron Betel
Narry Kim
Robert Sheridan
Publication venue: Public Library of Science
Publication date: 01/11/2007
Field of study

The recent discovery of a new class of 30-nucleotide long RNAs in mammalian testes, called PIWI-interacting RNA (piRNA), with similarities to microRNAs and repeat-associated small interfering RNAs (rasiRNAs), has raised puzzling questions regarding their biogenesis and function. We report a comparative analysis of currently available piRNA sequence data from the pachytene stage of mouse spermatogenesis that sheds light on their sequence diversity and mechanism of biogenesis. We conclude that (i) there are at least four times as many piRNAs in mouse testes than currently known; (ii) piRNAs, which originate from long precursor transcripts, are generated by quasi-random enzymatic processing that is guided by a weak sequence signature at the piRNA 5′ends resulting in a large number of distinct sequences; and (iii) many of the piRNA clusters contain inverted repeats segments capable of forming double-strand RNA fold-back segments that may initiate piRNA processing analogous to transposon silencing

Public Library of Science (PLOS)

Crossref

Harvard University - DASH

Directory of Open Access Journals

PubMed Central

FreeContact: fast and free software for protein contact prediction from residue co-evolution

Author: Hopf Thomas A
Kaján László
Kalaš Matúš
Marks Debora S
Rost Burkhard
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Background: 20 years of improved technology and growing sequences now renders residue-residue contact constraints in large protein families through correlated mutations accurate enough to drive de novo predictions of protein three-dimensional structure. The method EVfold broke new ground using mean-field Direct Coupling Analysis (EVfold-mfDCA); the method PSICOV applied a related concept by estimating a sparse inverse covariance matrix. Both methods (EVfold-mfDCA and PSICOV) are publicly available, but both require too much CPU time for interactive applications. On top, EVfold-mfDCA depends on proprietary software. Results: Here, we present FreeContact, a fast, open source implementation of EVfold-mfDCA and PSICOV. On a test set of 140 proteins, FreeContact was almost eight times faster than PSICOV without decreasing prediction performance. The EVfold-mfDCA implementation of FreeContact was over 220 times faster than PSICOV with negligible performance decrease. EVfold-mfDCA was unavailable for testing due to its dependency on proprietary software. FreeContact is implemented as the free C++ library “libfreecontact”, complete with command line tool “freecontact”, as well as Perl and Python modules. All components are available as Debian packages. FreeContact supports the BioXSD format for interoperability. Conclusions: FreeContact provides the opportunity to compute reliable contact predictions in any environment (desktop or cloud)

Bergen Open Research Archive (Univ. of Bergen)

Crossref

Harvard University - DASH

Springer - Publisher Connector

PubMed Central

NORA - Norwegian Open Research Archives