11,193 research outputs found
Semi-automatic Parsing for Web Knowledge Extraction through Semantic Annotation
Parsing Web information, namely parsing content to find relevant documents on the basis of a userâs query, represents a crucial step to guarantee fast and accurate Information Retrieval (IR). Generally, an automated approach to such task is considered faster and cheaper than manual systems. Nevertheless, results do not seem have a high level of accuracy, indeed, as also Hjorland (2007) states, using stochastic algorithms entails: ⢠Low precision due to the indexing of common Atomic Linguistic Units (ALUs) or sentences. ⢠Low recall caused by the presence of synonyms. ⢠Generic results arising from the use of too broad or too narrow terms. Usually IR systems are based on invert text index, namely an index data structure storing a mapping from content to its locations in a database file, or in a document or a set of documents. In this paper we propose a system, by means of which we will develop a search engine able to process online documents, starting from a natural language query, and to return information to users. The proposed approach, based on the Lexicon-Grammar (LG) framework and its language formalization methodologies, aims at integrating a semantic annotation process for both query analysis and document retrieval
BlogForever D2.6: Data Extraction Methodology
This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform
Compositional Semantic Parsing on Semi-Structured Tables
Two important aspects of semantic parsing for question answering are the
breadth of the knowledge source and the depth of logical compositionality.
While existing work trades off one aspect for another, this paper
simultaneously makes progress on both fronts through a new task: answering
complex questions on semi-structured tables using question-answer pairs as
supervision. The central challenge arises from two compounding factors: the
broader domain results in an open-ended set of relations, and the deeper
compositionality results in a combinatorial explosion in the space of logical
forms. We propose a logical-form driven parsing algorithm guided by strong
typing constraints and show that it obtains significant improvements over
natural baselines. For evaluation, we created a new dataset of 22,033 complex
questions on Wikipedia tables, which is made publicly available
- âŚ