157 research outputs found

    Combining active learning and semi-supervised learning techniques to extract protein interaction sentences

    Get PDF
    Background: Protein-protein interaction (PPI) extraction has been a focal point of many biomedical research and database curation tools. Both Active Learning and Semi-supervised SVMs have recently been applied to extract PPI automatically. In this paper, we explore combining the AL with the SSL to improve the performance of the PPI task. Methods: We propose a novel PPI extraction technique called PPISpotter by combining Deterministic Annealing-based SSL and an AL technique to extract protein-protein interaction. In addition, we extract a comprehensive set of features from MEDLINE records by Natural Language Processing (NLP) techniques, which further improve the SVM classifiers. In our feature selection technique, syntactic, semantic, and lexical properties of text are incorporated into feature selection that boosts the system performance significantly. Results: By conducting experiments with three different PPI corpuses, we show that PPISpotter is superior to the other techniques incorporated into semi-supervised SVMs such as Random Sampling, Clustering, and Transductive SVMs by precision, recall, and F-measure. Conclusions: Our system is a novel, state-of-the-art technique for efficiently extracting protein-protein interaction pairs.X116sciescopu

    Mining association language patterns using a distributional semantic model for negative life event classification

    Get PDF
    AbstractPurposeNegative life events, such as the death of a family member, an argument with a spouse or the loss of a job, play an important role in triggering depressive episodes. Therefore, it is worthwhile to develop psychiatric services that can automatically identify such events. This study describes the use of association language patterns, i.e., meaningful combinations of words (e.g., <loss, job>), as features to classify sentences with negative life events into predefined categories (e.g., Family, Love, Work).MethodsThis study proposes a framework that combines a supervised data mining algorithm and an unsupervised distributional semantic model to discover association language patterns. The data mining algorithm, called association rule mining, was used to generate a set of seed patterns by incrementally associating frequently co-occurring words from a small corpus of sentences labeled with negative life events. The distributional semantic model was then used to discover more patterns similar to the seed patterns from a large, unlabeled web corpus.ResultsThe experimental results showed that association language patterns were significant features for negative life event classification. Additionally, the unsupervised distributional semantic model was not only able to improve the level of performance but also to reduce the reliance of the classification process on the availability of a large, labeled corpus

    A Comprehensive Benchmark of Kernel Methods to Extract Protein–Protein Interactions from Literature

    Get PDF
    The most important way of conveying new findings in biomedical research is scientific publication. Extraction of protein–protein interactions (PPIs) reported in scientific publications is one of the core topics of text mining in the life sciences. Recently, a new class of such methods has been proposed - convolution kernels that identify PPIs using deep parses of sentences. However, comparing published results of different PPI extraction methods is impossible due to the use of different evaluation corpora, different evaluation metrics, different tuning procedures, etc. In this paper, we study whether the reported performance metrics are robust across different corpora and learning settings and whether the use of deep parsing actually leads to an increase in extraction quality. Our ultimate goal is to identify the one method that performs best in real-life scenarios, where information extraction is performed on unseen text and not on specifically prepared evaluation data. We performed a comprehensive benchmarking of nine different methods for PPI extraction that use convolution kernels on rich linguistic information. Methods were evaluated on five different public corpora using cross-validation, cross-learning, and cross-corpus evaluation. Our study confirms that kernels using dependency trees generally outperform kernels based on syntax trees. However, our study also shows that only the best kernel methods can compete with a simple rule-based approach when the evaluation prevents information leakage between training and test corpora. Our results further reveal that the F-score of many approaches drops significantly if no corpus-specific parameter optimization is applied and that methods reaching a good AUC score often perform much worse in terms of F-score. We conclude that for most kernels no sensible estimation of PPI extraction performance on new text is possible, given the current heterogeneity in evaluation data. Nevertheless, our study shows that three kernels are clearly superior to the other methods

    Multiple Instance Learning: A Survey of Problem Characteristics and Applications

    Full text link
    Multiple instance learning (MIL) is a form of weakly supervised learning where training instances are arranged in sets, called bags, and a label is provided for the entire bag. This formulation is gaining interest because it naturally fits various problems and allows to leverage weakly labeled data. Consequently, it has been used in diverse application fields such as computer vision and document classification. However, learning from bags raises important challenges that are unique to MIL. This paper provides a comprehensive survey of the characteristics which define and differentiate the types of MIL problems. Until now, these problem characteristics have not been formally identified and described. As a result, the variations in performance of MIL algorithms from one data set to another are difficult to explain. In this paper, MIL problem characteristics are grouped into four broad categories: the composition of the bags, the types of data distribution, the ambiguity of instance labels, and the task to be performed. Methods specialized to address each category are reviewed. Then, the extent to which these characteristics manifest themselves in key MIL application areas are described. Finally, experiments are conducted to compare the performance of 16 state-of-the-art MIL methods on selected problem characteristics. This paper provides insight on how the problem characteristics affect MIL algorithms, recommendations for future benchmarking and promising avenues for research

    Text Classification: A Review, Empirical, and Experimental Evaluation

    Full text link
    The explosive and widespread growth of data necessitates the use of text classification to extract crucial information from vast amounts of data. Consequently, there has been a surge of research in both classical and deep learning text classification methods. Despite the numerous methods proposed in the literature, there is still a pressing need for a comprehensive and up-to-date survey. Existing survey papers categorize algorithms for text classification into broad classes, which can lead to the misclassification of unrelated algorithms and incorrect assessments of their qualities and behaviors using the same metrics. To address these limitations, our paper introduces a novel methodological taxonomy that classifies algorithms hierarchically into fine-grained classes and specific techniques. The taxonomy includes methodology categories, methodology techniques, and methodology sub-techniques. Our study is the first survey to utilize this methodological taxonomy for classifying algorithms for text classification. Furthermore, our study also conducts empirical evaluation and experimental comparisons and rankings of different algorithms that employ the same specific sub-technique, different sub-techniques within the same technique, different techniques within the same category, and categorie

    Splice site prediction using transfer learning

    Get PDF
    Ένα από τα ανοιχτά προβλήματα της βιοπληροφορικής, είναι η αυτόματη πρόβλεψη γονιδίων (αλληλουχία νουκλεοτιδίων που κωδικοποιεί πρωτεΐνες). Πιο συγκεκριμένα, οι ερευνητές προσπαθούν να προβλέψουν τις θέσεις που αντιστοιχούν στην αρχή και το τέλος των γονιδίων σε ένα γονιδίωμα. Οι θέσεις αυτές είναι γνωστές ως σήματα ματίσματος (splice sites). Διάφορες τεχνικές της μηχανικής μάθησης έχουν χρησιμοποιηθεί για το συγκεκριμένο πρόβλημα. Παρόλα αυτά, η απόκτηση των επισημειωμένων δεδομένων που είναι αναγκαία για να εφαρμοστούν οι τεχνικές επιβλεπόμενης μάθησης, αποτελεί μια σημαντική πρόκληση, καθώς το κόστος είναι πολύ μεγάλο. Μία από τις προσεγγίσεις για την αντιμετώπιση αυτού του προβλήματος είναι η μεταφορά μάθησης (transfer learning). Στόχος της παρούσας εργασίας είναι η μελέτη της αναπαράστασης των γονιδίων, ώστε να λαμβάνεται υπόψιν η αλληλουχία των νουκλεοτιδίων σε ένα γονιδίωμα, και ο ρόλος της αναπαράστασης αυτής σε μεθόδους μεταφοράς μάθησης.One of the open problems in the field of bioinformatics, is the automatic gene prediction (nucleotide sequence that encodes proteins). More specifically, researchers are trying to predict those positions that correspond to the beginning and the end of genes within a genome. These positions are known as splice sites. Several machine learning techniques have been used for the specific problem. Nevertheless, the acquisition of annotated data, necessary to implement supervised learning techniques, is a significant challenge, as the cost is very large. One of the approaches for addressing this problem is the transferring of knowledge (transfer learning approach). The aim of this work is the study of the representation of genes in order to take into account the sequence of nucleotides within a genome and the role of this representation in transfer learning methods

    Positive and negative label propagation

    Get PDF

    New kernel functions and learning methods for text and data mining

    Get PDF
    Recent advances in machine learning methods enable increasingly the automatic construction of various types of computer assisted methods that have been difficult or laborious to program by human experts. The tasks for which this kind of tools are needed arise in many areas, here especially in the fields of bioinformatics and natural language processing. The machine learning methods may not work satisfactorily if they are not appropriately tailored to the task in question. However, their learning performance can often be improved by taking advantage of deeper insight of the application domain or the learning problem at hand. This thesis considers developing kernel-based learning algorithms incorporating this kind of prior knowledge of the task in question in an advantageous way. Moreover, computationally efficient algorithms for training the learning machines for specific tasks are presented. In the context of kernel-based learning methods, the incorporation of prior knowledge is often done by designing appropriate kernel functions. Another well-known way is to develop cost functions that fit to the task under consideration. For disambiguation tasks in natural language, we develop kernel functions that take account of the positional information and the mutual similarities of words. It is shown that the use of this information significantly improves the disambiguation performance of the learning machine. Further, we design a new cost function that is better suitable for the task of information retrieval and for more general ranking problems than the cost functions designed for regression and classification. We also consider other applications of the kernel-based learning algorithms such as text categorization, and pattern recognition in differential display. We develop computationally efficient algorithms for training the considered learning machines with the proposed kernel functions. We also design a fast cross-validation algorithm for regularized least-squares type of learning algorithm. Further, an efficient version of the regularized least-squares algorithm that can be used together with the new cost function for preference learning and ranking tasks is proposed. In summary, we demonstrate that the incorporation of prior knowledge is possible and beneficial, and novel advanced kernels and cost functions can be used in algorithms efficiently.Siirretty Doriast
    corecore