5,920 research outputs found
Using distributional similarity to organise biomedical terminology
We investigate an application of distributional similarity techniques to the problem of structural organisation of biomedical terminology. Our application domain is the relatively small GENIA corpus. Using terms that have been accurately marked-up by hand within the corpus, we consider the problem of automatically determining semantic proximity. Terminological units are dened for our purposes as normalised classes of individual terms. Syntactic analysis of the corpus data is carried out using the Pro3Gres parser and provides the data required to calculate distributional similarity using a variety of dierent measures. Evaluation is performed against a hand-crafted gold standard for this domain in the form of the GENIA ontology. We show that distributional similarity can be used to predict semantic type with a good degree of accuracy
Query Expansion for Survey Question Retrieval in the Social Sciences
In recent years, the importance of research data and the need to archive and
to share it in the scientific community have increased enormously. This
introduces a whole new set of challenges for digital libraries. In the social
sciences typical research data sets consist of surveys and questionnaires. In
this paper we focus on the use case of social science survey question reuse and
on mechanisms to support users in the query formulation for data sets. We
describe and evaluate thesaurus- and co-occurrence-based approaches for query
expansion to improve retrieval quality in digital libraries and research data
archives. The challenge here is to translate the information need and the
underlying sociological phenomena into proper queries. As we can show retrieval
quality can be improved by adding related terms to the queries. In a direct
comparison automatically expanded queries using extracted co-occurring terms
can provide better results than queries manually reformulated by a domain
expert and better results than a keyword-based BM25 baseline.Comment: to appear in Proceedings of 19th International Conference on Theory
and Practice of Digital Libraries 2015 (TPDL 2015
Using ontology engineering for understanding needs and allocating resources in web-based industrial virtual collaboration systems
In many interactions in cross-industrial and inter-industrial collaboration, analysis and understanding of relative specialist and non-specialist language is one of the most pressing challenges when trying to build multi-party, multi-disciplinary collaboration system. Hence, identifying the scope of the language used and then understanding the relationships between the language entities are key problems. In computer science, ontologies are used to provide a common vocabulary for a domain of interest together with descriptions of the meaning of terms and relationships between them, like in an encyclopedia. These, however, often lack the fuzziness required for human orientated systems. This paper uses an engineering sector business collaboration system (www.wmccm.co.uk) as a case study to illustrate the issues. The purpose of this paper is to introduce a novel ontology engineering methodology, which generates structurally enriched cross domain ontologies economically, quickly and reliably. A semantic relationship analysis of the Google Search Engine Index was devised and evaluated. Using Semantic analysis seems to generate a viable list of subject terms. A social network analysis of the semantically derived terms was conducted to generate a decision support network with rich relationships between terms. The derived ontology was quicker to generate, provided richer internal relationships and relied far less on expert contribution. More importantly, it improved the collaboration matching capability of WMCCM
Machine Learning of User Profiles: Representational Issues
As more information becomes available electronically, tools for finding
information of interest to users becomes increasingly important. The goal of
the research described here is to build a system for generating comprehensible
user profiles that accurately capture user interest with minimum user
interaction. The research described here focuses on the importance of a
suitable generalization hierarchy and representation for learning profiles
which are predictively accurate and comprehensible. In our experiments we
evaluated both traditional features based on weighted term vectors as well as
subject features corresponding to categories which could be drawn from a
thesaurus. Our experiments, conducted in the context of a content-based
profiling system for on-line newspapers on the World Wide Web (the IDD News
Browser), demonstrate the importance of a generalization hierarchy and the
promise of combining natural language processing techniques with machine
learning (ML) to address an information retrieval (IR) problem.Comment: 6 page
- …