14 research outputs found

    Optimal sampling in unbiased active learning

    Get PDF
    A common belief in unbiased active learning is that, in order to capture the most informative instances, the sampling probabilities should be proportional to the uncertainty of the class labels. We argue that this produces suboptimal predictions and present sampling schemes for unbiased pool-based active learning that minimise the actual prediction error, and demonstrate a better predictive performance than competing methods on a number of benchmark datasets. In contrast, both probabilistic and deterministic uncertainty sampling performed worse than simple random sampling on some of the datasets

    Optimal subsampling designs

    Get PDF
    Subsampling is commonly used to overcome computational and economical bottlenecks in the analysis of finite populations and massive datasets. Existing methods are often limited in scope and use optimality criteria (e.g., A-optimality) with well-known deficiencies, such as lack of invariance to the measurement-scale of the data and parameterisation of the model. A unified theory of optimal subsampling design is still lacking. We present a theory of optimal design for general data subsampling problems, including finite population inference, parametric density estimation, and regression modelling. Our theory encompasses and generalises most existing methods in the field of optimal subdata selection based on unequal probability sampling and inverse probability weighting. We derive optimality conditions for a general class of optimality criteria, and present corresponding algorithms for finding optimal sampling schemes under Poisson and multinomial sampling designs. We present a novel class of transformation- and parameterisation-invariant linear optimality criteria which enjoy the best of two worlds: the computational tractability of A-optimality and invariance properties similar to D-optimality. The methodology is illustrated on an application in the traffic safety domain. In our experiments, the proposed invariant linear optimality criteria achieve 92-99% D-efficiency with 90-95% lower computational demand. In contrast, the A-optimality criterion has only 46% and 60% D-efficiency on two of the examples

    A comprehensive survey of integron-associated genes present in metagenomes

    Get PDF
    Background: Integrons are genomic elements that mediate horizontal gene transfer by inserting and removing genetic material using site-specific recombination. Integrons are commonly found in bacterial genomes, where they maintain a large and diverse set of genes that plays an important role in adaptation and evolution. Previous studies have started to characterize the wide range of biological functions present in integrons. However, the efforts have so far mainly been limited to genomes from cultivable bacteria and amplicons generated by PCR, thus targeting only a small part of the total integron diversity. Metagenomic data, generated by direct sequencing of environmental and clinical samples, provides a more holistic and unbiased analysis of integron-associated genes. However, the fragmented nature of metagenomic data has previously made such analysis highly challenging. Results: Here, we present a systematic survey of integron-associated genes in metagenomic data. The analysis was based on a newly developed computational method where integron-associated genes were identified by detecting their associated recombination sites. By processing contiguous sequences assembled from more than 10 terabases of metagenomic data, we were able to identify 13,397 unique integron-associated genes. Metagenomes from marine microbial communities had the highest occurrence of integron-associated genes with levels more than 100-fold higher than in the human microbiome. The identified genes had a large functional diversity spanning over several functional classes. Genes associated with defense mechanisms and mobility facilitators were most overrepresented and more than five times as common in integrons compared to other bacterial genes. As many as two thirds of the genes were found to encode proteins of unknown function. Less than 1% of the genes were associated with antibiotic resistance, of which several were novel, previously undescribed, resistance gene variants. Conclusions: Our results highlight the large functional diversity maintained by integrons present in unculturable bacteria and significantly expands the number of described integron-associated genes

    Serine/Threonine protein kinases from bacteria, archaea and eukarya share a common evolutionary origin deeply rooted in the tree of life

    Get PDF
    The main family of serine/threonine/tyrosine protein kinases present in eukarya was defined and described by Hanks et al. in 1988 (Science, 241, 42–52). It was initially believed that these kinases do not exist in bacteria, but extensive genome sequencing revealed their existence in many bacteria. For historical reasons, the term “eukaryotic-type kinases” propagated in the literature to describe bacterial members of this protein family. Here, we argue that this term should be abandoned as a misnomer, and we provide several lines of evidence to support this claim. Our comprehensive phylostratigraphic analysis suggests that Hanks-type kinases present in eukarya, bacteria and archaea all share a common evolutionary origin in the lineage leading to the last universal common ancestor (LUCA). We found no evidence to suggest substantial horizontal transfer of genes encoding Hanks-type kinases from eukarya to bacteria. Moreover, our systematic structural comparison suggests that bacterial Hanks-type kinases resemble their eukaryal counterparts very closely, while their structures appear to be dissimilar from other kinase families of bacterial origin. This indicates that a convergent evolution scenario, by which bacterial kinases could have evolved a kinase domain similar to that of eukaryal Hanks-type kinases, is not very likely. Overall, our results strongly support a monophyletic origin of all Hanks-type kinases, and we therefore propose that this term should be adopted as a universal name for this protein family

    Comparative Gene Finding: Models, Algorithms and Implementation

    No full text
    Comparative genomics is an emerging field, which is being fed by an explosion in the number of possible biological sequences. This has led to an immense demand for faster, more efficient and more robust computer algorithms to analyze this large amount of data. This unique text/reference describes the state of the art in computational gene finding, with a particular focus on comparative approaches. Providing both an overview of the various methods that are applied in the field, and a concise guide on how computational gene finders are built, the book covers a broad range of topics from probability theory, statistics, information theory, optimization theory and numerical analysis. The text assumes the reader has some background in bioinformatics, especially in mathematics and mathematical statistics. A basic knowledge of analysis, probability theory and random processes would also aid the reader

    Comparative Gene Finding: Models, Algorithms and Implementation

    No full text
    Comparative genomics is an emerging field, which is being fed by an explosion in the number of possible biological sequences. This has led to an immense demand for faster, more efficient and more robust computer algorithms to analyze this large amount of data.This unique text/reference describes the state of the art in computational gene finding, with a particular focus on comparative approaches. Providing both an overview of the various methods that are applied in the field, and a concise guide on how computational gene finders are built, the book covers a broad range of topics from probability theory, statistics, information theory, optimization theory and numerical analysis. The text assumes the reader has some background in bioinformatics, especially in mathematics and mathematical statistics. A basic knowledge of analysis, probability theory and random processes would also aid the reader

    Gene finding in fungal genomes

    No full text

    Conditional percolation on one-dimensional lattices

    No full text
    Conditioning i.i.d.\ bond percolation with retention parameter pp on a one-dimensional periodic lattice on the event of having a bi-infinite path from −∞-\infty to ∞\infty is shown to make sense, and the resulting model exhibits a Markovian structure that facilitates its analysis. Stochastic monotonicity in pp turns out to fail in general for this model, but a weaker monotonicity property does hold: the average edge density is increasing in pp

    Biased random walk in a one-dimensional percolation model

    No full text
    We consider random walk with a nonzero bias to the right, on the infinite cluster in the following percolation model: take i.i.d. bond percolation with retention parameter p on the so-called infinite ladder, and condition on the event of having a bi-infinite path from -[infinity] to [infinity]. The random walk is shown to be transient, and to have an asymptotic speed to the right which is strictly positive or zero depending on whether the bias is below or above a certain critical value which we compute explicitly.Percolation Random walk Asymptotic speed
    corecore