970 research outputs found

    Knowledge-Based Techniques for Scholarly Data Access: Towards Automatic Curation

    Get PDF
    Accessing up-to-date and quality scientific literature is a critical preliminary step in any research activity. Identifying relevant scholarly literature for the extents of a given task or application is, however a complex and time consuming activity. Despite the large number of tools developed over the years to support scholars in their literature surveying activity, such as Google Scholar, Microsoft Academic search, and others, the best way to access quality papers remains asking a domain expert who is actively involved in the field and knows research trends and directions. State of the art systems, in fact, either do not allow exploratory search activity, such as identifying the active research directions within a given topic, or do not offer proactive features, such as content recommendation, which are both critical to researchers. To overcome these limitations, we strongly advocate a paradigm shift in the development of scholarly data access tools: moving from traditional information retrieval and filtering tools towards automated agents able to make sense of the textual content of published papers and therefore monitor the state of the art. Building such a system is however a complex task that implies tackling non trivial problems in the fields of Natural Language Processing, Big Data Analysis, User Modelling, and Information Filtering. In this work, we introduce the concept of Automatic Curator System and present its fundamental components.openDottorato di ricerca in InformaticaopenDe Nart, Dari

    Sharing Semantic Resources

    Get PDF
    The Semantic Web is an extension of the current Web in which information, so far created for human consumption, becomes machine readable, “enabling computers and people to work in cooperation”. To turn into reality this vision several challenges are still open among which the most important is to share meaning formally represented with ontologies or more generally with semantic resources. This Semantic Web long-term goal has many convergences with the activities in the field of Human Language Technology and in particular in the development of Natural Language Processing applications where there is a great need of multilingual lexical resources. For instance, one of the most important lexical resources, WordNet, is also commonly regarded and used as an ontology. Nowadays, another important phenomenon is represented by the explosion of social collaboration, and Wikipedia, the largest encyclopedia in the world, is object of research as an up to date omni comprehensive semantic resource. The main topic of this thesis is the management and exploitation of semantic resources in a collaborative way, trying to use the already available resources as Wikipedia and Wordnet. This work presents a general environment able to turn into reality the vision of shared and distributed semantic resources and describes a distributed three-layer architecture to enable a rapid prototyping of cooperative applications for developing semantic resources

    Implementation of a knowledge discovery and enhancement module from structured information gained from unstructured sources of information

    Get PDF
    Tese de mestrado integrado. Engenharia Informática e Computação. Faculdade de Engenharia. Universidade do Porto. 201

    Spatial and temporal-based query disambiguation for improving web search

    Get PDF
    Queries submitted to search engines are ambiguous in nature due to users’ irrelevant input which poses real challenges to web search engines both towards understanding a query and giving results. A lot of irrelevant and ambiguous information creates disappointment among users. Thus, this research proposes an ambiguity evolvement process followed by an integrated use of spatial and temporal features to alleviate the search results imprecision. To enhance the effectiveness of web information retrieval the study develops an enhanced Adaptive Disambiguation Approach for web search queries to overcome the problems caused by ambiguous queries. A query classification method was used to filter search results to overcome the imprecision. An algorithm was utilized for finding the similarity of the search results based on spatial and temporal features. Users’ selection based on web results facilitated recording of implicit feedback which was then utilized for web search improvement. Performance evaluation was conducted on data sets GISQC_DS, AMBIENT and MORESQUE comprising of ambiguous queries to certify the effectiveness of the proposed approach in comparison to a well-known temporal evaluation and two-box search methods. The implemented prototype is focused on ambiguous queries to be classified by spatial or temporal features. Spatial queries focus on targeting the location information whereas temporal queries target time in years. In conclusion, the study used search results in the context of Spatial Information Retrieval (S-IR) along with temporal information. Experiments results show that the use of spatial and temporal features in combination can significantly improve the performance in terms of precision (92%), accuracy (93%), recall (95%), and f-measure (93%). Moreover, the use of implicit feedback has a significant impact on the search results which has been demonstrated through experimental evaluation.SHAHID KAMA

    AuthCrowd: Author Name Disambiguation and Entity Matching using Crowdsourcing

    Get PDF
    Despite decades of research and development in named entity resolution, dealing with name ambiguity is still a a challenging issue for much bibliometric-enhanced information retrieval (IR) tasks. As new bibliographic datasets are created as a result of the upward growth of publication records worldwide, more problems arise when considering the effects of errors resulting from missing data fields, duplicate entities, misspellings, extra characters, etc. As these concerns tend to be of large-scale, both the general consistency and the quality of electronic data are largely affected. This paper presents an approach to handle these name ambiguity problems through the use of crowdsourcing as a complementary means to traditional unsupervised approaches. To this end, we present “AuthCrowd”, a crowdsourcing system with the ability to decompose named entity disambiguation and entity matching tasks. Experimental results on a real-world dataset of publicly available papers published in peer-reviewed venues demonstrate the potential of our proposed approach for improving author name disambiguation. The findings further highlight the importance of adopting hybrid crowd-algorithm collaboration strategies, especially for handling complexity and quantifying bias when working with large amounts of data

    Real Time Web Search Framework for Performing Efficient Retrieval of Data

    Get PDF
    With the rapidly growing amount of information on the internet, real-time system is one of the key strategies to cope with the information overload and to help users in finding highly relevant information. Real-time events and domain-specific information are important knowledge base references on the Web that frequently accessed by millions of users. Real-time system is a vital to product and a technique must resolve the context of challenges to be more reliable, e.g. short data life-cycles, heterogeneous user interests, strict time constraints, and context-dependent article relevance. Since real-time data have only a short time to live, real-time models have to be continuously adapted, ensuring that real-time data are always up-to-date. The focal point of this manuscript is for designing a real-time web search approach that aggregates several web search algorithms at query time to tune search results for relevancy. We learn a context-aware delegation algorithm that allows choosing the best real-time algorithms for each query request. The evaluation showed that the proposed approach outperforms the traditional models, in which it allows us to adapt the specific properties of the considered real-time resources. In the experiments, we found that it is highly relevant for most recently searched queries, consistent in its performance, and resilient to the drawbacks faced by other algorithms

    Natural language processing

    Get PDF
    Beginning with the basic issues of NLP, this chapter aims to chart the major research activities in this area since the last ARIST Chapter in 1996 (Haas, 1996), including: (i) natural language text processing systems - text summarization, information extraction, information retrieval, etc., including domain-specific applications; (ii) natural language interfaces; (iii) NLP in the context of www and digital libraries ; and (iv) evaluation of NLP systems

    Adaptive Semantic Annotation of Entity and Concept Mentions in Text

    Get PDF
    The recent years have seen an increase in interest for knowledge repositories that are useful across applications, in contrast to the creation of ad hoc or application-specific databases. These knowledge repositories figure as a central provider of unambiguous identifiers and semantic relationships between entities. As such, these shared entity descriptions serve as a common vocabulary to exchange and organize information in different formats and for different purposes. Therefore, there has been remarkable interest in systems that are able to automatically tag textual documents with identifiers from shared knowledge repositories so that the content in those documents is described in a vocabulary that is unambiguously understood across applications. Tagging textual documents according to these knowledge bases is a challenging task. It involves recognizing the entities and concepts that have been mentioned in a particular passage and attempting to resolve eventual ambiguity of language in order to choose one of many possible meanings for a phrase. There has been substantial work on recognizing and disambiguating entities for specialized applications, or constrained to limited entity types and particular types of text. In the context of shared knowledge bases, since each application has potentially very different needs, systems must have unprecedented breadth and flexibility to ensure their usefulness across applications. Documents may exhibit different language and discourse characteristics, discuss very diverse topics, or require the focus on parts of the knowledge repository that are inherently harder to disambiguate. In practice, for developers looking for a system to support their use case, is often unclear if an existing solution is applicable, leading those developers to trial-and-error and ad hoc usage of multiple systems in an attempt to achieve their objective. In this dissertation, I propose a conceptual model that unifies related techniques in this space under a common multi-dimensional framework that enables the elucidation of strengths and limitations of each technique, supporting developers in their search for a suitable tool for their needs. Moreover, the model serves as the basis for the development of flexible systems that have the ability of supporting document tagging for different use cases. I describe such an implementation, DBpedia Spotlight, along with extensions that we performed to the knowledge base DBpedia to support this implementation. I report evaluations of this tool on several well known data sets, and demonstrate applications to diverse use cases for further validation

    PersoNER: Persian named-entity recognition

    Full text link
    © 1963-2018 ACL. Named-Entity Recognition (NER) is still a challenging task for languages with low digital resources. The main difficulties arise from the scarcity of annotated corpora and the consequent problematic training of an effective NER pipeline. To abridge this gap, in this paper we target the Persian language that is spoken by a population of over a hundred million people world-wide. We first present and provide ArmanPerosNERCorpus, the first manually-annotated Persian NER corpus. Then, we introduce PersoNER, an NER pipeline for Persian that leverages a word embedding and a sequential max-margin classifier. The experimental results show that the proposed approach is capable of achieving interesting MUC7 and CoNNL scores while outperforming two alternatives based on a CRF and a recurrent neural network
    • …
    corecore