281 research outputs found
A bagging SVM to learn from positive and unlabeled examples
We consider the problem of learning a binary classifier from a training set
of positive and unlabeled examples, both in the inductive and in the
transductive setting. This problem, often referred to as \emph{PU learning},
differs from the standard supervised classification problem by the lack of
negative examples in the training set. It corresponds to an ubiquitous
situation in many applications such as information retrieval or gene ranking,
when we have identified a set of data of interest sharing a particular
property, and we wish to automatically retrieve additional data sharing the
same property among a large and easily available pool of unlabeled data. We
propose a conceptually simple method, akin to bagging, to approach both
inductive and transductive PU learning problems, by converting them into series
of supervised binary classification problems discriminating the known positive
examples from random subsamples of the unlabeled set. We empirically
demonstrate the relevance of the method on simulated and real data, where it
performs at least as well as existing methods while being faster
Improving the Efficiency of a Multicast File Transfer Tool based on ALC
This work describes several techniques that we used to design a multicast file transfer tool on top of ALC, the Asynchronous Layered Coding protocol proposed by the RMT IETF working group. More specifically we analyze several object and symbol ordering schemes that improve transmission efficiency and we see how the Application Level Framing (ALF) paradigm can help to reduce memory requirements and enable processing to be hidden behind communica- tions. Because of its popularity and availability we use a Reed-Solomon FEC code, yet most of our results can be applied to other FEC codes. A strength of this work resides in the fact that all the techniques introduced have actually been implemented and their benefits quantified
Bayesian nonparametric discovery of isoforms and individual specific quantification
Most human protein-coding genes can be transcribed into multiple distinct mRNA isoforms. These alternative splicing patterns encourage molecular diversity, and dysregulation of isoform expression plays an important role in disease etiology. However, isoforms are difficult to characterize from short-read RNA-seq data because they share identical subsequences and occur in different frequencies across tissues and samples. Here, we develop BIISQ, a Bayesian nonparametric model for isoform discovery and individual specific quantification from short-read RNA-seq data. BIISQ does not require isoform reference sequences but instead estimates an isoform catalog shared across samples. We use stochastic variational inference for efficient posterior estimates and demonstrate superior precision and recall for simulations compared to state-of-the-art isoform reconstruction methods. BIISQ shows the most gains for low abundance isoforms, with 36% more isoforms correctly inferred at low coverage versus a multi-sample method and 170% more versus single-sample methods. We estimate isoforms in the GEUVADIS RNA-seq data and validate inferred isoforms by associating genetic variants with isoform ratios
SIRENE: Supervised Inference of Regulatory Networks
Living cells are the product of gene expression programs that involve the
regulated transcription of thousands of genes. The elucidation of
transcriptional regulatory networks in thus needed to understand the cell's
working mechanism, and can for example be useful for the discovery of novel
therapeutic targets. Although several methods have been proposed to infer gene
regulatory networks from gene expression data, a recent comparison on a
large-scale benchmark experiment revealed that most current methods only
predict a limited number of known regulations at a reasonable precision level.
We propose SIRENE, a new method for the inference of gene regulatory networks
from a compendium of expression data. The method decomposes the problem of gene
regulatory network inference into a large number of local binary classification
problems, that focus on separating target genes from non-targets for each TF.
SIRENE is thus conceptually simple and computationally efficient. We test it on
a benchmark experiment aimed at predicting regulations in E. coli, and show
that it retrieves of the order of 6 times more known regulations than other
state-of-the-art inference methods
Chemokine transport across human vascular endothelial cells
Leukocyte migration across vascular endothelium is mediated by chemokines that are either synthesized by the endothelium or transferred across the endothelium from the tissue. The mechanism of transfer of two chemokines, CXCL10 (interferon gamma inducible protein [IP]-10) and CCL2 (macrophage chemotactic protein [MCP]-1), was compared across dermal and lung microvessel endothelium and saphenous vein endothelium. The rate of transfer depended on both the type of endothelium and the chemokine. The permeability coefficient (Pe) for CCL2 movement across saphenous vein was twice the value for dermal endothelium and four times that for lung endothelium. In contrast, the Pe value for CXCL10 was lower for saphenous vein endothelium than the other endothelia. The differences in transfer rate between endothelia was not related to variation in paracellular permeability using a paracellular tracer, inulin, and immunoelectron microscopy showed that CXCL10 was transferred from the basal membrane in a vesicular compartment, before distribution to the apical membrane. Although all three endothelia expressed high levels of the receptor for CXCL10 (CXCR3), the transfer was not readily saturable and did not appear to be receptor dependent. After 30 min, the chemokine started to be reinternalized from the apical membrane in clathrin-coated vesicles. The data suggest a model for chemokine transcytosis, with a separate pathway for clearance of the apical surface
Reverse Engineering Gene Networks with ANN: Variability in Network Inference Algorithms
Motivation :Reconstructing the topology of a gene regulatory network is one
of the key tasks in systems biology. Despite of the wide variety of proposed
methods, very little work has been dedicated to the assessment of their
stability properties. Here we present a methodical comparison of the
performance of a novel method (RegnANN) for gene network inference based on
multilayer perceptrons with three reference algorithms (ARACNE, CLR, KELLER),
focussing our analysis on the prediction variability induced by both the
network intrinsic structure and the available data.
Results: The extensive evaluation on both synthetic data and a selection of
gene modules of "Escherichia coli" indicates that all the algorithms suffer of
instability and variability issues with regards to the reconstruction of the
topology of the network. This instability makes objectively very hard the task
of establishing which method performs best. Nevertheless, RegnANN shows MCC
scores that compare very favorably with all the other inference methods tested.
Availability: The software for the RegnANN inference algorithm is distributed
under GPL3 and it is available at the corresponding author home page
(http://mpba.fbk.eu/grimaldi/regnann-supmat
ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples
<p>Abstract</p> <p>Background</p> <p>Elucidating the genetic basis of human diseases is a central goal of genetics and molecular biology. While traditional linkage analysis and modern high-throughput techniques often provide long lists of tens or hundreds of disease gene candidates, the identification of disease genes among the candidates remains time-consuming and expensive. Efficient computational methods are therefore needed to prioritize genes within the list of candidates, by exploiting the wealth of information available about the genes in various databases.</p> <p>Results</p> <p>We propose ProDiGe, a novel algorithm for Prioritization of Disease Genes. ProDiGe implements a novel machine learning strategy based on learning from positive and unlabeled examples, which allows to integrate various sources of information about the genes, to share information about known disease genes across diseases, and to perform genome-wide searches for new disease genes. Experiments on real data show that ProDiGe outperforms state-of-the-art methods for the prioritization of genes in human diseases.</p> <p>Conclusions</p> <p>ProDiGe implements a new machine learning paradigm for gene prioritization, which could help the identification of new disease genes. It is freely available at <url>http://cbio.ensmp.fr/prodige</url>.</p
Supervised prediction of drug–target interactions using bipartite local models
Motivation: In silico prediction of drug–target interactions from heterogeneous biological data is critical in the search for drugs for known diseases. This problem is currently being attacked from many different points of view, a strong indication of its current importance. Precisely, being able to predict new drug–target interactions with both high precision and accuracy is the holy grail, a fundamental requirement for in silico methods to be useful in a biological setting. This, however, remains extremely challenging due to, amongst other things, the rarity of known drug–target interactions
Scuba:Scalable kernel-based gene prioritization
Abstract Background The uncovering of genes linked to human diseases is a pressing challenge in molecular biology and precision medicine. This task is often hindered by the large number of candidate genes and by the heterogeneity of the available information. Computational methods for the prioritization of candidate genes can help to cope with these problems. In particular, kernel-based methods are a powerful resource for the integration of heterogeneous biological knowledge, however, their practical implementation is often precluded by their limited scalability. Results We propose Scuba, a scalable kernel-based method for gene prioritization. It implements a novel multiple kernel learning approach, based on a semi-supervised perspective and on the optimization of the margin distribution. Scuba is optimized to cope with strongly unbalanced settings where known disease genes are few and large scale predictions are required. Importantly, it is able to efficiently deal both with a large amount of candidate genes and with an arbitrary number of data sources. As a direct consequence of scalability, Scuba integrates also a new efficient strategy to select optimal kernel parameters for each data source. We performed cross-validation experiments and simulated a realistic usage setting, showing that Scuba outperforms a wide range of state-of-the-art methods. Conclusions Scuba achieves state-of-the-art performance and has enhanced scalability compared to existing kernel-based approaches for genomic data. This method can be useful to prioritize candidate genes, particularly when their number is large or when input data is highly heterogeneous. The code is freely available at https://github.com/gzampieri/Scuba
- …