881 research outputs found

    A Word Sense-Oriented User Interface for Interactive Multilingual Text Retrieval

    Get PDF
    In this paper we present an interface for supporting a user in an interactive cross-language search process using semantic classes. In order to enable users to access multilingual information, different problems have to be solved: disambiguating and translating the query words, as well as categorizing and presenting the results appropriately. Therefore, we first give a brief introduction to word sense disambiguation, cross-language text retrieval and document categorization and finally describe recent achievements of our research towards an interactive multilingual retrieval system. We focus especially on the problem of browsing and navigation of the different word senses in one source and possibly several target languages. In the last part of the paper, we discuss the developed user interface and its functionalities in more detail

    Word vs. Class-Based Word Sense Disambiguation

    Get PDF
    As empirically demonstrated by the Word Sense Disambiguation (WSD) tasks of the last SensEval/SemEval exercises, assigning the appropriate meaning to words in context has resisted all attempts to be successfully addressed. Many authors argue that one possible reason could be the use of inappropriate sets of word meanings. In particular, WordNet has been used as a de-facto standard repository of word meanings in most of these tasks. Thus, instead of using the word senses defined in WordNet, some approaches have derived semantic classes representing groups of word senses. However, the meanings represented by WordNet have been only used for WSD at a very fine-grained sense level or at a very coarse-grained semantic class level (also called SuperSenses). We suspect that an appropriate level of abstraction could be on between both levels. The contributions of this paper are manifold. First, we propose a simple method to automatically derive semantic classes at intermediate levels of abstraction covering all nominal and verbal WordNet meanings. Second, we empirically demonstrate that our automatically derived semantic classes outperform classical approaches based on word senses and more coarse-grained sense groupings. Third, we also demonstrate that our supervised WSD system benefits from using these new semantic classes as additional semantic features while reducing the amount of training examples. Finally, we also demonstrate the robustness of our supervised semantic class-based WSD system when tested on out of domain corpus.This work has been partially supported by the NewsReader project (ICT-2011-316404), the Spanish project SKaTer (TIN2012-38584-C06-02)

    This Table is Different: A WordNet-Based Approach to Identifying References to Document Entities

    Get PDF
    Writing intended to inform frequently con-tains references to document entities (DEs), a mixed class that includes orthographically structured items (e.g., illustrations, sections, lists) and discourse entities (arguments, sug-gestions, points). Such references are vital to the interpretation of documents, but they of-ten eschew identifiers such as "Figure 1 " for inexplicit phrases like "in this figure " or "from these premises". We examine inexplicit references to DEs, termed DE references, and recast the problem of their automatic detec-tion into the determination of relevant word senses. We then show the feasibility of ma-chine learning for the detection of DE-relevant word senses, using a corpus of hu-man-labeled synsets from WordNet. We test cross-domain performance by gathering lemmas and synsets from three corpora: web-site privacy policies, Wikipedia articles, and Wikibooks textbooks. Identifying DE refer-ences will enable language technologies to use the information encoded by them, permit-ting the automatic generation of finely-tuned descriptions of DEs and the presentation of richly-structured information to readers.

    Exploiting extensible background knowledge for clustering-based automatic keyphrase extraction

    Get PDF
    Keyphrases are single- or multi-word phrases that are used to describe the essential content of a document. Utilizing an external knowledge source such as WordNet is often used in keyphrase extraction methods to obtain relation information about terms and thus improves the result, but the drawback is that a sole knowledge source is often limited. This problem is identified as the coverage limitation problem. In this paper, we introduce SemCluster, a clustering-based unsupervised keyphrase extraction method that addresses the coverage limitation problem by using an extensible approach that integrates an internal ontology (i.e., WordNet) with other knowledge sources to gain a wider background knowledge. SemCluster is evaluated against three unsupervised methods, TextRank, ExpandRank, and KeyCluster, and under the F1-measure metric. The evaluation results demonstrate that SemCluster has better accuracy and computational efficiency and is more robust when dealing with documents from different domains

    Sharing Semantic Resources

    Get PDF
    The Semantic Web is an extension of the current Web in which information, so far created for human consumption, becomes machine readable, “enabling computers and people to work in cooperation”. To turn into reality this vision several challenges are still open among which the most important is to share meaning formally represented with ontologies or more generally with semantic resources. This Semantic Web long-term goal has many convergences with the activities in the field of Human Language Technology and in particular in the development of Natural Language Processing applications where there is a great need of multilingual lexical resources. For instance, one of the most important lexical resources, WordNet, is also commonly regarded and used as an ontology. Nowadays, another important phenomenon is represented by the explosion of social collaboration, and Wikipedia, the largest encyclopedia in the world, is object of research as an up to date omni comprehensive semantic resource. The main topic of this thesis is the management and exploitation of semantic resources in a collaborative way, trying to use the already available resources as Wikipedia and Wordnet. This work presents a general environment able to turn into reality the vision of shared and distributed semantic resources and describes a distributed three-layer architecture to enable a rapid prototyping of cooperative applications for developing semantic resources

    Semantic based Text Summarization for Single Document on Android Mobile Device

    Get PDF
    The explosion of information in the World Wide Web is overwhelming readers with limitless information. Large internet articles or journals are often cumbersome to read as well as comprehend. More often than not, readers are immersed in a pool of information with limited time to assimilate all of the articles. It leads to information overload whereby readers are trying to deal with more information than they can process. Hence, there is an apparent need for an automatic text summarizer as to produce summaries quicker than humans. The text summarization research on mobile platform has been inspired by the new paradigm shift in accessing information ubiquitously at anytime and anywhere on Smartphones or smart devices. In this research, a semantic and syntactic based summarization is implemented in a text summarizer to solve the overload problem whilst providing a more coherent summary. Additionally, WordNet is used as the lexical database to semantically extract the text document which provides a more efficient and accurate algorithm than the existing summary system. The objective of the paper is to integrate WordNet into the proposed system called TextSumIt which condenses lengthy documents into shorter summarized text that gives a higher readability to Android mobile users. The experimental results are done using recall, precision and F-Score to evaluate on the summary output, in comparison with the existing automated summarizer. Human-generated summaries from Document Understanding Conference (DUC) are taken as the reference summaries for the evaluation. The evaluation of experimental results shows satisfactory results

    Mining Hypernyms Semantic Relations from Stack Overflow

    Get PDF
    Communication between a software development team and business partners is often a challenging task due to the different context of terms used in the information exchange. The various contexts in which the concepts are defined or used create slightly different semantic fields that can evolve into information and communication silos. Due to the silo effect, the necessary information is often inadequately forwarded to developers resulting in poorly specified software requirements or misinterpreted user feedback. Communication difficulties can be reduced by introducing a mapping between the semantic fields of the parties involved in the communication based on the commonly used terminologies. Our research aims to obtain a suitable semantic database in the form of a semantic network built from the Stack Overflow corpus, which can be considered to encompass the common tacit knowledge of the software development community. Terminologies used in the business world can be assigned to our semantic network, so software developers do not miss features that are not specific to their world but relevant to their clients. We present an initial experiment of mining semantic network from Stack Overflow and provide insights of the newly captured relations compared to WordNet

    Medical WordNet: A new methodology for the construction and validation of information resources for consumer health

    Get PDF
    A consumer health information system must be able to comprehend both expert and non-expert medical vocabulary and to map between the two. We describe an ongoing project to create a new lexical database called Medical WordNet (MWN), consisting of medically relevant terms used by and intelligible to non-expert subjects and supplemented by a corpus of natural-language sentences that is designed to provide medically validated contexts for MWN terms. The corpus derives primarily from online health information sources targeted to consumers, and involves two sub-corpora, called Medical FactNet (MFN) and Medical BeliefNet (MBN), respectively. The former consists of statements accredited as true on the basis of a rigorous process of validation, the latter of statements which non-experts believe to be true. We summarize the MWN / MFN / MBN project, and describe some of its applications
    corecore