24 research outputs found

    Natural language processing

    Get PDF
    Beginning with the basic issues of NLP, this chapter aims to chart the major research activities in this area since the last ARIST Chapter in 1996 (Haas, 1996), including: (i) natural language text processing systems - text summarization, information extraction, information retrieval, etc., including domain-specific applications; (ii) natural language interfaces; (iii) NLP in the context of www and digital libraries ; and (iv) evaluation of NLP systems

    Mixed-Language Arabic- English Information Retrieval

    Get PDF
    Includes abstract.Includes bibliographical references.This thesis attempts to address the problem of mixed querying in CLIR. It proposes mixed-language (language-aware) approaches in which mixed queries are used to retrieve most relevant documents, regardless of their languages. To achieve this goal, however, it is essential firstly to suppress the impact of most problems that are caused by the mixed-language feature in both queries and documents and which result in biasing the final ranked list. Therefore, a cross-lingual re-weighting model was developed. In this cross-lingual model, term frequency, document frequency and document length components in mixed queries are estimated and adjusted, regardless of languages, while at the same time the model considers the unique mixed-language features in queries and documents, such as co-occurring terms in two different languages. Furthermore, in mixed queries, non-technical terms (mostly those in non-English language) would likely overweight and skew the impact of those technical terms (mostly those in English) due to high document frequencies (and thus low weights) of the latter terms in their corresponding collection (mostly the English collection). Such phenomenon is caused by the dominance of the English language in scientific domains. Accordingly, this thesis also proposes reasonable re-weighted Inverse Document Frequency (IDF) so as to moderate the effect of overweighted terms in mixed queries

    Doctor of Philosophy

    Get PDF
    dissertationThe explosion of structured Web data (e.g., online databases, Wikipedia infoboxes) creates many opportunities for integrating and querying these data that go far beyond the simple search capabilities provided by search engines. Although much work has been devoted to data integration in the database community, the Web brings new challenges: the Web-scale (e.g., the large and growing volume of data) and the heterogeneity in Web data. Because there are so much data, scalable techniques that require little or no manual intervention and that are robust to noisy data are needed. In this dissertation, we propose a new and effective approach for matching Web-form interfaces and for matching multilingual Wikipedia infoboxes. As a further step toward these problems, we propose a general prudent schema-matching framework that matches a large number of schemas effectively. Our comprehensive experiments for Web-form interfaces and Wikipedia infoboxes show that it can enable on-the-fly, automatic integration of large collections of structured Web data. Another problem we address in this dissertation is schema discovery. While existing integration approaches assume that the relevant data sources and their schemas have been identified in advance, schemas are not always available for structured Web data. Approaches exist that exploit information in Wikipedia to discover the entity types and their associate schemas. However, due to inconsistencies, sparseness, and noise from the community contribution, these approaches are error prone and require substantial human intervention. Given the schema heterogeneity in Wikipedia infoboxes, we developed a new approach that uses the structured information available in infoboxes to cluster similar infoboxes and infer the schemata for entity types. Our approach is unsupervised and resilient to the unpredictable skew in the entity class distribution. Our experiments, using over one hundred thousand infoboxes extracted from Wikipedia, indicate that our approach is effective and produces accurate schemata for Wikipedia entities

    Interim research assessment 2003-2005 - Computer Science

    Get PDF
    This report primarily serves as a source of information for the 2007 Interim Research Assessment Committee for Computer Science at the three technical universities in the Netherlands. The report also provides information for others interested in our research activities

    Collaborative Knowledge Visualisation for Cross-Community Knowledge Exchange

    Get PDF
    The notion of communities as informal social networks based on shared interests or common practices has been increasingly used as an important unit of analysis when considering the processes of cooperative creation and sharing of knowledge. While knowledge exchange within communities has been extensively researched, different studies observed the importance of cross-community knowledge exchange for the creation of new knowledge and innovation in knowledge-intensive organizations. Especially in knowledge management a critical problem has become the need to support the cooperation and exchange of knowledge between different communities with highly specialized expertise and activities. Though several studies discuss the importance and difficulties of knowledge sharing across community boundaries, the development of technological support incorporating these findings has been little addressed. This work presents an approach to supporting cross-community knowledge exchange based on using knowledge visualisation for facilitating information access in unfamiliar community domains. The theoretical grounding and practical relevance of the proposed approach are ensured by defining a requirements model that integrates theoretical frameworks for cross-community knowledge exchange with practical needs of typical knowledge management processes and sensemaking tasks in information access in unfamiliar domains. This synthesis suggests that visualising knowledge structures of communities and supporting the discovery of relationships between them during access to community spaces, could provide valuable support for cross-community discovery and sharing of knowledge. This is the main hypothesis investigated in this thesis. Accordingly, a novel method is developed for eliciting and visualising implicit knowledge structures of individuals and communities in form of dynamic knowledge maps that make the elicited knowledge usable for semantic exploration and navigation of community spaces. The method allows unobtrusive construction of personal and community knowledge maps based on user interaction with information and their use for dynamic classification of information from a specific point of view. The visualisation model combines Document Maps presenting main topics, document clusters and relationships between knowledge reflected in community spaces with Concept Maps visualising personal and shared conceptual structures of community members. The technical realization integrates Kohonen’s self-organizing maps with extraction of word categories from texts, collaborative indexing and personalised classification based on user-induced templates. This is accompanied by intuitive visualisation and interaction with complex information spaces based on multi-view navigation of document landscapes and concept networks. The developed method is prototypically implemented in form of an application framework, a concrete system and a visual information interface for multi-perspective access to community information spaces, the Knowledge Explorer. The application framework implements services for generating and using personal and community knowledge maps to support explicit and implicit knowledge exchange between members of different communities. The Knowledge Explorer allows simultaneous visualisation of different personal and community knowledge structures and enables their use for structuring, exploring and navigating community information spaces from different points of view. The empirical evaluation in a comparative laboratory study confirms the adequacy of the developed solutions with respect to specific requirements of the cross-community problem and demonstrates much better quality of knowledge access compared to a standard information seeking reference system. The developed evaluation framework and operative measures for quality of knowledge access in cross-community contexts also provide a theoretically grounded and practically feasible method for further developing and evaluating new solutions addressing this important but little investigated problem

    Corpus Linguistics software:Understanding their usages and delivering two new tools

    Get PDF
    The increasing availability of computers to ordinary users in the last few decades has led to an exponential increase in the use of Corpus Linguistics (CL) methodologies. The people exploring this data come from a variety of backgrounds and, in many cases, are not proficient corpus linguists. Despite the ongoing development of new tools, there is still an immense gap between what CL can offer and what is currently being done by researchers. This study has two outcomes. It (a) identifies the gap between potential and actual uses of CL methods and tools, and (b) enhances the usability of CL software and complement statistical application through the use of data visualization and user-friendly interfaces. The first outcome is achieved through (i) an investigation of how CL methods are reported in academic publications; (ii) a systematic observation of users of CL software as they engage in the routine tasks; and (iii) a review of four well-established pieces of software used for corpus exploration. Based on the findings, two new statistical tools for CL studies with high usability were developed and implemented on to an existing system, CQPweb. The Advanced Dispersion tool allows users to graphically explore how queries are distributed in a corpus, which makes it easier for users to understand the concept of dispersion. The tool also provides accurate dispersion measures. The Parlink Tool was designed having as its primary target audience beginners with interest in translations studies and second language education. The tool’s primary function is to make it easier for users to see possible translations for corpus queries in the parallel concordances, without the need to use external resources, such as translation memories
    corecore