11,193 research outputs found

    Semi-automatic Parsing for Web Knowledge Extraction through Semantic Annotation

    Get PDF
    Parsing Web information, namely parsing content to find relevant documents on the basis of a user’s query, represents a crucial step to guarantee fast and accurate Information Retrieval (IR). Generally, an automated approach to such task is considered faster and cheaper than manual systems. Nevertheless, results do not seem have a high level of accuracy, indeed, as also Hjorland (2007) states, using stochastic algorithms entails: • Low precision due to the indexing of common Atomic Linguistic Units (ALUs) or sentences. • Low recall caused by the presence of synonyms. • Generic results arising from the use of too broad or too narrow terms. Usually IR systems are based on invert text index, namely an index data structure storing a mapping from content to its locations in a database file, or in a document or a set of documents. In this paper we propose a system, by means of which we will develop a search engine able to process online documents, starting from a natural language query, and to return information to users. The proposed approach, based on the Lexicon-Grammar (LG) framework and its language formalization methodologies, aims at integrating a semantic annotation process for both query analysis and document retrieval

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Compositional Semantic Parsing on Semi-Structured Tables

    Full text link
    Two important aspects of semantic parsing for question answering are the breadth of the knowledge source and the depth of logical compositionality. While existing work trades off one aspect for another, this paper simultaneously makes progress on both fronts through a new task: answering complex questions on semi-structured tables using question-answer pairs as supervision. The central challenge arises from two compounding factors: the broader domain results in an open-ended set of relations, and the deeper compositionality results in a combinatorial explosion in the space of logical forms. We propose a logical-form driven parsing algorithm guided by strong typing constraints and show that it obtains significant improvements over natural baselines. For evaluation, we created a new dataset of 22,033 complex questions on Wikipedia tables, which is made publicly available
    • …
    corecore