18 research outputs found

    Kernel Methods for Minimally Supervised WSD

    Get PDF
    We present a semi-supervised technique for word sense disambiguation that exploits external knowledge acquired in an unsupervised manner. In particular, we use a combination of basic kernel functions to independently estimate syntagmatic and domain similarity, building a set of word-expert classifiers that share a common domain model acquired from a large corpus of unla- beled data. The results show that the proposed approach achieves state-of-the-art performance on a wide range of lexical sample tasks and on the English all-words task of Senseval-3, although it uses a considerably smaller number of training examples than other methods

    Cross language text categorization by acquiring multilingual domain models from comparable corpora

    No full text
    In a multilingual scenario, the classical monolingual text categorization problem can be reformulated as a cross language TC task, in which we have to cope with two or more languages (e.g. English and Italian). In this setting, the system is trained using labeled examples in a source language (e.g. English), and it classifies documents in a different target language (e.g. Italian). In this paper we propose a novel approach to solve the cross language text categorization problem based on acquiring Multilingual Domain Models from comparable corpora in a totally unsupervised way and without using any external knowledge source (e.g. bilingual dictionaries). These Multilingual Domain Models are exploited to define a generalized similarity function (i.e. a kernel function) among documents in different languages, which is used inside a Support Vector Machines classification framework. The results show that our approach is a feasible and cheap solution that largely outperforms a baseline

    Instance Based Lexical Entailment for Ontology Population

    No full text
    In this paper we propose an instance based method for lexical entailment and apply it to automatic ontology population from text. The approach is fully unsupervised and based on kernel methods. We demonstrate the effectiveness of our technique largely surpassing both the random and most frequent baselines and outperforming current state-of-the-art unsupervised approaches on a benchmark ontology available in the literature

    Instance Pruning by Filtering Uninformative Words: an Information Extraction Case Study

    No full text
    In this paper we present a novel instance pruning technique for Information Extraction (IE). In particular, our technique filters out uninformative words from texts on the basis of the assumption that very frequent words in the language do not provide any specific information about the text in which they appear, therefore their expectation of being (part of) relevant entities is very low. The experiments on two benchmark datasets show that the computation time can be significantly reduced without any significant decrease in the prediction accuracy. We also report an improvement in accuracy for one task

    Domain kernels for word sense disambiguation

    No full text
    In this paper we present a supervised Word Sense Disambiguation methodology, that exploits kernel methods to model sense distinctions. In particular a combination of kernel functions is adopted to estimate independently both syntagmatic and domain similarity. We defined a kernel function, namely the Domain Kernel, that allowed us to plug ``external knowledge` into the supervised learning process. External knowledge is acquired from unlabeled data in a totally unsupervised way, and it is represented by means of Domain Models. We evaluated our methodology on several lexical sample tasks in different languages, outperforming significantly the state-of-the-art for each of them, while reducing the amount of labeled training data required for learning

    Bridging Languages by SuperSense Entity Tagging

    No full text
    This paper explores a very basic linguistic phenomenon in multilingualism: the lexicalizations of entities are very often identical within different languages while concepts are usually lexicalized differently. Since entities are commonly referred to by proper names in natural language, we measured their distribution in the lexical overlap of the terminologies extracted from comparable corpora. Results show that the lexical overlap is mostly composed by unambiguous words, which can be regarded as anchors to bridge languages: most of terms having the same spelling refer exactly to the same entities. Thanks to this important feature of Named Entities, we developed a multilingual super sense tagging system capable to distinguish between concepts and individuals. Individuals adopted for training have been extracted both by YAGO and by a heuristic procedure. The general F1 of the English tagger is over 76%, which is in line with the state of the art on super sense tagging while augmenting the number of classes. Performances for Italian are slightly lower, while ensuring a reasonable accuracy level which is capable to show effective results for knowledge acquisition.

    Unsupervised Part-Of-Speech Tagging Supporting Supervised Methods

    No full text
    This paper investigates the utility of an unsupervised part-of-speech (PoS) system in a task oriented way. We use PoS labels as features for different supervised NLP tasks: Word Sense Disambiguation, Named Entity Recognition and Chunking. Further we explore, how much supervised tagging can gain from unsupervised tagging. A comparative evaluation between variants of systems using standard PoS, unsupervised PoS and no PoS at all reveals that Supervised tagging gains substantially from unsupervised tagging. In particular unsupervised PoS tagging behaves similarly to supervised PoS in Word Sense Disambiguation and Named Entity Recognition, while only chunking still benefit more from Supervised PoS. Overall results indicate that unsupervised PoS tagging is useful for many applications and a veritable low-cost alternative, if none or very little PoS training data is available for the target language or domain
    corecore