273 research outputs found

    An efficient algorithm for learning to rank from preference graphs

    Get PDF
    In this paper, we introduce a framework for regularized least-squares (RLS) type of ranking cost functions and we propose three such cost functions. Further, we propose a kernel-based preference learning algorithm, which we call RankRLS, for minimizing these functions. It is shown that RankRLS has many computational advantages compared to the ranking algorithms that are based on minimizing other types of costs, such as the hinge cost. In particular, we present efficient algorithms for training, parameter selection, multiple output learning, cross-validation, and large-scale learning. Circumstances under which these computational benefits make RankRLS preferable to RankSVM are considered. We evaluate RankRLS on four different types of ranking tasks using RankSVM and the standard RLS regression as the baselines. RankRLS outperforms the standard RLS regression and its performance is very similar to that of RankSVM, while RankRLS has several computational benefits over RankSVM

    Comparative analysis of five protein-protein interaction corpora

    Get PDF
    Conclusions: Our comparative analysis uncovers key similarities and differences between the diverse PPI corpora, thus taking an important step towards standardization. In the course of this study we have created a major practical contribution in converting the corpora into a shared format. The conversion software is freely available at http://mars.cs.utu.fi/PPICorpora.</p

    All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning

    Get PDF
    Background Automated extraction of protein-protein interactions (PPI) is an important and widely studied task in biomedical text mining. We propose a graph kernel based approach for this task. In contrast to earlier approaches to PPI extraction, the introduced all-paths graph kernel has the capability to make use of full, general dependency graphs representing the sentence structure. Results We evaluate the proposed method on five publicly available PPI corpora, providing the most comprehensive evaluation done for a machine learning based PPI-extraction system. We additionally perform a detailed evaluation of the effects of training and testing on different resources, providing insight into the challenges involved in applying a system beyond the data it was trained on. Our method is shown to achieve state-of-the-art performance with respect to comparable evaluations, with 56.4 F-score and 84.8 AUC on the AImed corpus. Conclusion We show that the graph kernel approach performs on state-of-the-art level in PPI extraction, and note the possible extension to the task of extracting complex interactions. Cross-corpus results provide further insight into how the learning generalizes beyond individual corpora. Further, we identify several pitfalls that can make evaluations of PPI-extraction systems incomparable, or even invalid. These include incorrect cross-validation strategies and problems related to comparing F-score results achieved on different evaluation resources. Recommendations for avoiding these pitfalls are provided. </div

    Domain randomization using synthetic electrocardiograms for training neural networks

    Get PDF
    We present a method for training neural networks with synthetic electrocardiograms that mimic signals produced by a wearable single lead electrocardiogram monitor. We use domain randomization where the synthetic signal properties such as the waveform shape, RR-intervals and noise are varied for every training example. Models trained with synthetic data are compared to their counterparts trained with real data. Detection of r-waves in electrocardiograms recorded during different physical activities and in atrial fibrillation is used to assess the performance. By allowing the randomization of the synthetic signals to increase beyond what is typically observed in the real-world data the performance is on par or superseding the performance of networks trained with real data. Experiments show robust model performance using different seeds and on different unseen test sets that were fully separated from the training phase. The ability of the model to generalize well to hidden test sets without any specific tuning provides a simple and explainable alternative to more complex adversarial domain adaptation methods for model generalization. This method opens up the possibility of extending the use of synthetic data towards domain insensitive cardiac disease classification when disease specific a priori information is used in the electrocardiogram generation. Additionally, the method provides training with free-to-collect data with accurate labels, control of the data distribution eliminating class imbalances that are typically observed in health-related data, and the generated data is inherently private

    Heavy metal concentrations in various aqueous and biotic media in Finnish Integrated Monitoring catchments

    Get PDF
    Concentrations of Cd, Cu, Ni, Pb and Zn in various aqueous and biotic media in four small forested catchments located throughout Finland (61–70°N) are presented. The data has been collected (1989–96) as part of the UN-ECE Integrated Monitoring programme. Aqueous media included: bulk (open) precipitation, throughfall, stemflow, soil water, groundwater, and lake and stream waters. The biotic media included: moss (Pleurozium schreberi), needles (Pinus sylvestris), litterfall, humus layer, red wood ants (Formica aquilonia and F. lugubris), and common shrew (Sorex areneus) liver. In comparison to temperate ecosystems, the concentrations of each metal in all media were low. Levels and bioaccumulation of Cu and Zn were clearly affected by their role as micro-nutrients and, in the case of Zn at one of the catchments, by local lithological sources. Humus layer and surface water 95th percentile concentration values were below lowest effect values. Concentrations of Pb have significantly declined over the study period

    Tournament leave-pair-out cross-validation for receiver operating characteristic analysis

    Get PDF
    Receiver operating characteristic analysis is widely used for evaluating diagnostic systems. Recent studies have shown that estimating an area under receiver operating characteristic curve with standard cross-validation methods suffers from a large bias. The leave-pair-out cross-validation has been shown to correct this bias. However, while leave-pair-out produces an almost unbiased estimate of area under receiver operating characteristic curve, it does not provide a ranking of the data needed for plotting and analyzing the receiver operating characteristic curve. In this study, we propose a new method called tournament leave-pair-out cross-validation. This method extends leave-pair-out by creating a tournament from pair comparisons to produce a ranking for the data. Tournament leave-pair-out preserves the advantage of leave-pair-out for estimating area under receiver operating characteristic curve, while it also allows performing receiver operating characteristic analyses. We have shown using both synthetic and real-world data that tournament leave-pair-out is as reliable as leave-pair-out for area under receiver operating characteristic curve estimation and confirmed the bias in leave-one-out cross-validation on low-dimensional data. As a case study on receiver operating characteristic analysis, we also evaluate how reliably sensitivity and specificity can be estimated from tournament leave-pair-out receiver operating characteristic curves.</p

    Algebraic shortcuts for leave-one-out cross-validation in supervised network inference

    Get PDF
    Supervised machine learning techniques have traditionally been very successful at reconstructing biological networks, such as protein-ligand interaction, protein-protein interaction and gene regulatory networks. Many supervised techniques for network prediction use linear models on a possibly nonlinear pairwise feature representation of edges. Recently, much emphasis has been placed on the correct evaluation of such supervised models. It is vital to distinguish between using a model to either predict new interactions in a given network or to predict interactions for a new vertex not present in the original network. This distinction matters because (i) the performance might dramatically differ between the prediction settings and (ii) tuning the model hyperparameters to obtain the best possible model depends on the setting of interest. Specific cross-validation schemes need to be used to assess the performance in such different prediction settings.In this work we discuss a state-of-the-art kernel-based network inference technique called two-step kernel ridge regression. We show that this regression model can be trained efficiently, with a time complexity scaling with the number of vertices rather than the number of edges. Furthermore, this framework leads to a series of cross-validation shortcuts that allow one to rapidly estimate the model performance for any relevant network prediction setting. This allows computational biologists to fully assess the capabilities of their models. The machine learning techniques with the algebraic shortcuts are implemented in the RLScore software package: https://github.com/aatapa/RLScore
    corecore