1,693 research outputs found

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Doctor of Philosophy

    Get PDF
    dissertationEvents are one important type of information throughout text. Event extraction is an information extraction (IE) task that involves identifying entities and objects (mainly noun phrases) that represent important roles in events of a particular type. However, the extraction performance of current event extraction systems is limited because they mainly consider local context (mostly isolated sentences) when making each extraction decision. My research aims to improve both coverage and accuracy of event extraction performance by explicitly identifying event contexts before extracting individual facts. First, I introduce new event extraction architectures that incorporate discourse information across a document to seek out and validate pieces of event descriptions within the document. TIER is a multilayered event extraction architecture that performs text analysis at multiple granularities to progressively \zoom in" on relevant event information. LINKER is a unied discourse-guided approach that includes a structured sentence classier to sequentially read a story and determine which sentences contain event information based on both the local and preceding contexts. Experimental results on two distinct event domains show that compared to previous event extraction systems, TIER can nd more event information while maintaining a good extraction accuracy, and LINKER can further improve extraction accuracy. Finding documents that describe a specic type of event is also highly challenging because of the wide variety and ambiguity of event expressions. In this dissertation, I present the multifaceted event recognition approach that uses event dening characteristics (facets), in addition to event expressions, to eectively resolve the complexity of event descriptions. I also present a novel bootstrapping algorithm to automatically learn event expressions as well as facets of events, which requires minimal human supervision. Experimental results show that the multifaceted event recognition approach can eectively identify documents that describe a particular type of event and make event extraction systems more precise

    Report on shape analysis and matching and on semantic matching

    No full text
    In GRAVITATE, two disparate specialities will come together in one working platform for the archaeologist: the fields of shape analysis, and of metadata search. These fields are relatively disjoint at the moment, and the research and development challenge of GRAVITATE is precisely to merge them for our chosen tasks. As shown in chapter 7 the small amount of literature that already attempts join 3D geometry and semantics is not related to the cultural heritage domain. Therefore, after the project is done, there should be a clear ‘before-GRAVITATE’ and ‘after-GRAVITATE’ split in how these two aspects of a cultural heritage artefact are treated.This state of the art report (SOTA) is ‘before-GRAVITATE’. Shape analysis and metadata description are described separately, as currently in the literature and we end the report with common recommendations in chapter 8 on possible or plausible cross-connections that suggest themselves. These considerations will be refined for the Roadmap for Research deliverable.Within the project, a jargon is developing in which ‘geometry’ stands for the physical properties of an artefact (not only its shape, but also its colour and material) and ‘metadata’ is used as a general shorthand for the semantic description of the provenance, location, ownership, classification, use etc. of the artefact. As we proceed in the project, we will find a need to refine those broad divisions, and find intermediate classes (such as a semantic description of certain colour patterns), but for now the terminology is convenient – not least because it highlights the interesting area where both aspects meet.On the ‘geometry’ side, the GRAVITATE partners are UVA, Technion, CNR/IMATI; on the metadata side, IT Innovation, British Museum and Cyprus Institute; the latter two of course also playing the role of internal users, and representatives of the Cultural Heritage (CH) data and target user’s group. CNR/IMATI’s experience in shape analysis and similarity will be an important bridge between the two worlds for geometry and metadata. The authorship and styles of this SOTA reflect these specialisms: the first part (chapters 3 and 4) purely by the geometry partners (mostly IMATI and UVA), the second part (chapters 5 and 6) by the metadata partners, especially IT Innovation while the joint overview on 3D geometry and semantics is mainly by IT Innovation and IMATI. The common section on Perspectives was written with the contribution of all
    • 

    corecore