3,632 research outputs found

    Efficient Regularized Least-Squares Algorithms for Conditional Ranking on Relational Data

    Full text link
    In domains like bioinformatics, information retrieval and social network analysis, one can find learning tasks where the goal consists of inferring a ranking of objects, conditioned on a particular target object. We present a general kernel framework for learning conditional rankings from various types of relational data, where rankings can be conditioned on unseen data objects. We propose efficient algorithms for conditional ranking by optimizing squared regression and ranking loss functions. We show theoretically, that learning with the ranking loss is likely to generalize better than with the regression loss. Further, we prove that symmetry or reciprocity properties of relations can be efficiently enforced in the learned models. Experiments on synthetic and real-world data illustrate that the proposed methods deliver state-of-the-art performance in terms of predictive power and computational efficiency. Moreover, we also show empirically that incorporating symmetry or reciprocity properties can improve the generalization performance

    BioNLP Shared Task - The Bacteria Track

    Get PDF
    Background: We present the BioNLP 2011 Shared Task Bacteria Track, the first Information Extraction challenge entirely dedicated to bacteria. It includes three tasks that cover different levels of biological knowledge. The Bacteria Gene Renaming supporting task is aimed at extracting gene renaming and gene name synonymy in PubMed abstracts. The Bacteria Gene Interaction is a gene/protein interaction extraction task from individual sentences. The interactions have been categorized into ten different sub-types, thus giving a detailed account of genetic regulations at the molecular level. Finally, the Bacteria Biotopes task focuses on the localization and environment of bacteria mentioned in textbook articles. We describe the process of creation for the three corpora, including document acquisition and manual annotation, as well as the metrics used to evaluate the participants' submissions. Results: Three teams submitted to the Bacteria Gene Renaming task; the best team achieved an F-score of 87%. For the Bacteria Gene Interaction task, the only participant's score had reached a global F-score of 77%, although the system efficiency varies significantly from one sub-type to another. Three teams submitted to the Bacteria Biotopes task with very different approaches; the best team achieved an F-score of 45%. However, the detailed study of the participating systems efficiency reveals the strengths and weaknesses of each participating system. Conclusions: The three tasks of the Bacteria Track offer participants a chance to address a wide range of issues in Information Extraction, including entity recognition, semantic typing and coreference resolution. We found commond trends in the most efficient systems: the systematic use of syntactic dependencies and machine learning. Nevertheless, the originality of the Bacteria Biotopes task encouraged the use of interesting novel methods and techniques, such as term compositionality, scopes wider than the sentence

    Development of a framework for the classification of antibiotics adjuvants

    Get PDF
    Dissertação de mestrado em BioInformaticsThroughout the last decades, bacteria have become increasingly resistant to available antibiotics, leading to a growing need for new antibiotics and new drug development methodologies. In the last 40 years, there are no records of the development of new antibiotics, which has begun to shorten possible alternatives. Therefore, finding new antibiotics and bringing them to market is increasingly challenging. One approach is finding compounds that restore or leverage the activity of existing antibiotics against biofilm bacteria. As the information in this field is very limited and there is no database regarding this theme, machine learning models were used to predict the relevance of the documents regarding adjuvants. In this project, the BIOFILMad - Catalog of antimicrobial adjuvants to tackle biofilms application was developed to help researchers save time in their daily research. This application was constructed using Django and Django REST Framework for the backend and React for the frontend. As for the backend, a database needed to be constructed since no database entirely focuses on this topic. For that, a machine learning model was trained to help us classify articles. Three different algorithms were used, Support-Vector Machine (SVM), Random Forest (RF), and Logistic Regression (LR), combined with a different number of features used, more precisely, 945 and 1890. When analyzing all metrics, model LR-1 performed the best for classifying relevant documents with an accuracy score of 0.8461, a recall score of 0.6170, an f1-score of 0.6904, and a precision score of 0.7837. This model is the best at correctly predicting the relevant documents, as proven by the higher recall score compared to the other models. With this model, our database was populated with relevant information. Our backend has a unique feature, the aggregation feature constructed with Named Entity Recognition (NER). The goal is to identify specific entity types, in our case, it identifies CHEMICAL and DISEASE. An association between these entities was made, thus delivering the user the respective associations, saving researchers time. For example, a researcher can see with which compounds "pseudomonas aeruginosa" has already been tested thanks to this aggregation feature. The frontend was implemented so the user could access this aggregation feature, see the articles present in the database, use the machine learning models to classify new documents, and insert them in the database if they are relevant.Ao longo das últimas décadas, as bactérias tornaram-se cada vez mais resistentes aos antibióticos disponíveis, levando a uma crescente necessidade de novos antibióticos e novas metodologias de desenvolvimento de medicamentos. Nos últimos 40 anos, não há registos do desenvolvimento de novos antibióticos, o que começa a reduzir as alternativas possíveis. Portanto, criar novos antibióticos e torna-los disponíveis no mercado é cada vez mais desafiante. Uma abordagem é a descoberta de compostos que restaurem ou potencializem a atividade dos antibióticos existentes contra bactérias multirresistentes. Como as informações neste campo são muito limitadas e não há uma base de dados sobre este tema, modelos de Machine Learning foram utilizados para prever a relevância dos documentos acerca dos adjuvantes. Neste projeto, foi desenvolvida a aplicação BIOFILMad - Catalog of antimicrobial adjuvants to tackle biofilms para ajudar os investigadores a economizar tempo nas suas pesquisas. Esta aplicação foi construída usando o Django e Django REST Framework para o backend e React para o frontend. Quanto ao backend, foi necessário construir uma base de dados, pois não existe nenhuma que se concentre inteiramente neste tópico. Para isso, foi treinado um modelo machine learning para nos ajudar a classificar os artigos. Três algoritmos diferentes foram usados: Support-Vector Machine (SVM), Random Forest (RF) e Logistic Regression (LR), combinados com um número diferente de features, mais precisamente, 945 e 1890. Ao analisar todas as métricas, o modelo LR-1 teve o melhor desempenho para classificar artigos relevantes com uma accuracy de 0,8461, um recall de 0,6170, um f1-score de 0,6904 e uma precision de 0,7837. Este modelo foi o melhor a prever corretamente os artigos relevantes, comprovado pelo alto recall em comparação com os outros modelos. Com este modelo, a base de dados foi populda com informação relevante. O backend apresenta uma caracteristica particular, a agregação construída com Named-Entity-Recognition (NER). O objetivo é identificar tipos específicos de entidades, no nosso caso, identifica QUÍMICOS e DOENÇAS. Esta classificação serviu para formar associações entre entidades, demonstrando ao utilizador as respetivas associações feitas, permitindo economizar o tempo dos investigadores. Por exemplo, um investigador pode ver com quais compostos a "pseudomonas aeruginosa" já foi testada graças à funcionalidade de agregação. O frontend foi implementado para que o utilizador possa ter acesso a esta funcionalidade de agregação, ver os artigos presentes na base de dados, utilizar o modelo de machine learning para classificar novos artigos e inseri-los na base de dados caso sejam relevantes

    A kernel-based framework for learning graded relations from data

    Get PDF
    Driven by a large number of potential applications in areas like bioinformatics, information retrieval and social network analysis, the problem setting of inferring relations between pairs of data objects has recently been investigated quite intensively in the machine learning community. To this end, current approaches typically consider datasets containing crisp relations, so that standard classification methods can be adopted. However, relations between objects like similarities and preferences are often expressed in a graded manner in real-world applications. A general kernel-based framework for learning relations from data is introduced here. It extends existing approaches because both crisp and graded relations are considered, and it unifies existing approaches because different types of graded relations can be modeled, including symmetric and reciprocal relations. This framework establishes important links between recent developments in fuzzy set theory and machine learning. Its usefulness is demonstrated through various experiments on synthetic and real-world data.Comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibl

    Investigating Citation Linkage Between Research Articles

    Get PDF
    In recent years, there has been a dramatic increase in scientific publications across the globe. To help navigate this overabundance of information, methods have been devised to find papers with related content, but they are lacking in the ability to provide specific information that a researcher may need without having to read hundreds of linked papers. The search and browsing capabilities of online domain specific scientific repositories are limited to finding a paper citing other papers, but do not point to the specific text that is being cited. Providing this capability to the research community will be beneficial in terms of the time required to acquire the amount of background information they need to undertake their research. In this thesis, we present our effort to develop a citation linkage framework for finding those sentences in a cited article that are the focus of a citation in a citing paper. This undertaking has involved the construction of datasets and corpora that are required to build models for focused information extraction, text classification and information retrieval. As the first part of this thesis, two preprocessing steps that are deemed to assist with the citation linkage task are explored: method mention extraction and rhetorical categorization of scientific discourse. In the second part of this thesis, two methodologies for achieving the citation linkage goal are investigated. Firstly, regression techniques have been used to predict the degree of similarity between citation sentences and their equivalent target sentences with medium Pearson correlation score between predicted and expected values. The resulting learning models are then used to rank sentences in the cited paper based on their predicted scores. Secondly, search engine-like retrieval techniques have been used to rank sentences in the cited paper based on the words contained in the citation sentence. Our experiments show that it is possible to find the set of sentences that a citation refers to in a cited paper with reasonable performance. Possible applications of this work include: creation of better science paper repository navigation tools, development of scientific argumentation across research articles, and multi-document summarization of science articles

    Text Mining for Pathway Curation

    Get PDF
    Biolog:innen untersuchen häufig Pathways, Netzwerke von Interaktionen zwischen Proteinen und Genen mit einer spezifischen Funktion. Neue Erkenntnisse über Pathways werden in der Regel zunächst in Publikationen veröffentlicht und dann in strukturierter Form in Lehrbüchern, Datenbanken oder mathematischen Modellen weitergegeben. Deren Kuratierung kann jedoch aufgrund der hohen Anzahl von Publikationen sehr aufwendig sein. In dieser Arbeit untersuchen wir wie Text Mining Methoden die Kuratierung unterstützen können. Wir stellen PEDL vor, ein Machine-Learning-Modell zur Extraktion von Protein-Protein-Assoziationen (PPAs) aus biomedizinischen Texten. PEDL verwendet Distant Supervision und vortrainierte Sprachmodelle, um eine höhere Genauigkeit als vergleichbare Methoden zu erreichen. Eine Evaluation durch Expert:innen bestätigt die Nützlichkeit von PEDLs für Pathway-Kurator:innen. Außerdem stellen wir PEDL+ vor, ein Kommandozeilen-Tool, mit dem auch Nicht-Expert:innen PPAs effizient extrahieren können. Drei Kurator:innen bewerten 55,6 % bis 79,6 % der von PEDL+ gefundenen PPAs als nützlich für ihre Arbeit. Die große Anzahl von PPAs, die durch Text Mining identifiziert werden, kann für Forscher:innen überwältigend sein. Um hier Abhilfe zu schaffen, stellen wir PathComplete vor, ein Modell, das nützliche Erweiterungen eines Pathways vorschlägt. Es ist die erste Pathway-Extension-Methode, die auf überwachtem maschinellen Lernen basiert. Unsere Experimente zeigen, dass PathComplete wesentlich genauer ist als existierende Methoden. Schließlich schlagen wir eine Methode vor, um Pathways mit komplexen Ereignisstrukturen zu erweitern. Hier übertrifft unsere neue Methode zur konditionalen Graphenmodifikation die derzeit beste Methode um 13-24% Genauigkeit in drei Benchmarks. Insgesamt zeigen unsere Ergebnisse, dass Deep Learning basierte Informationsextraktion eine vielversprechende Grundlage für die Unterstützung von Pathway-Kurator:innen ist.Biological knowledge often involves understanding the interactions between molecules, such as proteins and genes, that form functional networks called pathways. New knowledge about pathways is typically communicated through publications and later condensed into structured formats such as textbooks, pathway databases or mathematical models. However, curating updated pathway models can be labour-intensive due to the growing volume of publications. This thesis investigates text mining methods to support pathway curation. We present PEDL (Protein-Protein-Association Extraction with Deep Language Models), a machine learning model designed to extract protein-protein associations (PPAs) from biomedical text. PEDL uses distant supervision and pre-trained language models to achieve higher accuracy than the state of the art. An expert evaluation confirms its usefulness for pathway curators. We also present PEDL+, a command-line tool that allows non-expert users to efficiently extract PPAs. When applied to pathway curation tasks, 55.6% to 79.6% of PEDL+ extractions were found useful by curators. The large number of PPAs identified by text mining can be overwhelming for researchers. To help, we present PathComplete, a model that suggests potential extensions to a pathway. It is the first method based on supervised machine learning for this task, using transfer learning from pathway databases. Our evaluations show that PathComplete significantly outperforms existing methods. Finally, we generalise pathway extension from PPAs to more realistic complex events. Here, our novel method for conditional graph modification outperforms the current best by 13-24% accuracy on three benchmarks. We also present a new dataset for event-based pathway extension. Overall, our results show that deep learning-based information extraction is a promising basis for supporting pathway curators

    AUTOMATED KEYWORD EXTRACTION FROM BIO-MEDICAL LITERATURE WITH CONCENTRATION ON ANTIBIOTIC RESISTANCE

    Get PDF
    The explosive growth of bio-medical literature makes it increasingly difficult and time consuming to keep up with newly discovered and published information. The extraction of knowledge from papers is critical in enabling computational analysis of biological data. In the last decade, tremendous effort has been put into development of automated and semi-automated tools for knowledge discovery and extraction from text, as an alternative to monotonous and time-consuming manual processing. This thesis research was focused on determining whether minor human supervision can improve the process of automated bio-medical text annotation. One of the main outcomes of this study is a tool that requires minimal effort and time from scientists to reach high precision in semi-automated annotation. The task we targeted is the extraction of keywords related to antibiotic resistance in bacteria. The tool is based on a machine learning algorithm that is retrained several times to achieve the best accuracy

    Event extraction of bacteria biotopes: a knowledge-intensive NLP-based approach

    Get PDF
    International audienceBackground: Bacteria biotopes cover a wide range of diverse habitats including animal and plant hosts, natural, medical and industrial environments. The high volume of publications in the microbiology domain provides a rich source of up-to-date information on bacteria biotopes. This information, as found in scientific articles, is expressed in natural language and is rarely available in a structured format, such as a database. This information is of great importance for fundamental research and microbiology applications (e.g., medicine, agronomy, food, bioenergy). The automatic extraction of this information from texts will provide a great benefit to the field
    corecore