132,331 research outputs found

    Extracting protein-protein interactions from text using rich feature vectors and feature selection

    Get PDF
    Because of the intrinsic complexity of natural language, automatically extracting accurate information from text remains a challenge. We have applied rich featurevectors derived from dependency graphs to predict protein-protein interactions using machine learning techniques. We present the first extensive analysis of applyingfeature selection in this domain, and show that it can produce more cost-effective models. For the first time, our technique was also evaluated on several large-scalecross-dataset experiments, which offers a more realistic view on model performance. During benchmarking, we encountered several fundamental problems hindering comparability with other methods. We present a set of practical guidelines to set up ameaningful evaluation. Finally, we have analysed the feature sets from our experiments before and after feature selection, and evaluated the contribution of both lexical and syntacticinformation to our method. The gained insight will be useful to develop better performing methods in this domain

    CESI: Canonicalizing Open Knowledge Bases using Embeddings and Side Information

    Full text link
    Open Information Extraction (OpenIE) methods extract (noun phrase, relation phrase, noun phrase) triples from text, resulting in the construction of large Open Knowledge Bases (Open KBs). The noun phrases (NPs) and relation phrases in such Open KBs are not canonicalized, leading to the storage of redundant and ambiguous facts. Recent research has posed canonicalization of Open KBs as clustering over manuallydefined feature spaces. Manual feature engineering is expensive and often sub-optimal. In order to overcome this challenge, we propose Canonicalization using Embeddings and Side Information (CESI) - a novel approach which performs canonicalization over learned embeddings of Open KBs. CESI extends recent advances in KB embedding by incorporating relevant NP and relation phrase side information in a principled manner. Through extensive experiments on multiple real-world datasets, we demonstrate CESI's effectiveness.Comment: Accepted at WWW 201
    corecore