5,920 research outputs found

    Using distributional similarity to organise biomedical terminology

    Get PDF
    We investigate an application of distributional similarity techniques to the problem of structural organisation of biomedical terminology. Our application domain is the relatively small GENIA corpus. Using terms that have been accurately marked-up by hand within the corpus, we consider the problem of automatically determining semantic proximity. Terminological units are dened for our purposes as normalised classes of individual terms. Syntactic analysis of the corpus data is carried out using the Pro3Gres parser and provides the data required to calculate distributional similarity using a variety of dierent measures. Evaluation is performed against a hand-crafted gold standard for this domain in the form of the GENIA ontology. We show that distributional similarity can be used to predict semantic type with a good degree of accuracy

    Query Expansion for Survey Question Retrieval in the Social Sciences

    Full text link
    In recent years, the importance of research data and the need to archive and to share it in the scientific community have increased enormously. This introduces a whole new set of challenges for digital libraries. In the social sciences typical research data sets consist of surveys and questionnaires. In this paper we focus on the use case of social science survey question reuse and on mechanisms to support users in the query formulation for data sets. We describe and evaluate thesaurus- and co-occurrence-based approaches for query expansion to improve retrieval quality in digital libraries and research data archives. The challenge here is to translate the information need and the underlying sociological phenomena into proper queries. As we can show retrieval quality can be improved by adding related terms to the queries. In a direct comparison automatically expanded queries using extracted co-occurring terms can provide better results than queries manually reformulated by a domain expert and better results than a keyword-based BM25 baseline.Comment: to appear in Proceedings of 19th International Conference on Theory and Practice of Digital Libraries 2015 (TPDL 2015

    Using ontology engineering for understanding needs and allocating resources in web-based industrial virtual collaboration systems

    Get PDF
    In many interactions in cross-industrial and inter-industrial collaboration, analysis and understanding of relative specialist and non-specialist language is one of the most pressing challenges when trying to build multi-party, multi-disciplinary collaboration system. Hence, identifying the scope of the language used and then understanding the relationships between the language entities are key problems. In computer science, ontologies are used to provide a common vocabulary for a domain of interest together with descriptions of the meaning of terms and relationships between them, like in an encyclopedia. These, however, often lack the fuzziness required for human orientated systems. This paper uses an engineering sector business collaboration system (www.wmccm.co.uk) as a case study to illustrate the issues. The purpose of this paper is to introduce a novel ontology engineering methodology, which generates structurally enriched cross domain ontologies economically, quickly and reliably. A semantic relationship analysis of the Google Search Engine Index was devised and evaluated. Using Semantic analysis seems to generate a viable list of subject terms. A social network analysis of the semantically derived terms was conducted to generate a decision support network with rich relationships between terms. The derived ontology was quicker to generate, provided richer internal relationships and relied far less on expert contribution. More importantly, it improved the collaboration matching capability of WMCCM

    Machine Learning of User Profiles: Representational Issues

    Full text link
    As more information becomes available electronically, tools for finding information of interest to users becomes increasingly important. The goal of the research described here is to build a system for generating comprehensible user profiles that accurately capture user interest with minimum user interaction. The research described here focuses on the importance of a suitable generalization hierarchy and representation for learning profiles which are predictively accurate and comprehensible. In our experiments we evaluated both traditional features based on weighted term vectors as well as subject features corresponding to categories which could be drawn from a thesaurus. Our experiments, conducted in the context of a content-based profiling system for on-line newspapers on the World Wide Web (the IDD News Browser), demonstrate the importance of a generalization hierarchy and the promise of combining natural language processing techniques with machine learning (ML) to address an information retrieval (IR) problem.Comment: 6 page

    Issues for historical corpora: first catch your word

    Get PDF
    corecore