990 research outputs found

    Querying and updating treebanks: A critical survey and requirements analysis

    Get PDF
    Language technology makes extensive use of hierarchically annotated text and speech data. These databases are stored in flat files and manipulated using corpus-specific query tools or special-purpose scripts. While the size of these databases and the range of applications has grown rapidly in recent years, neither method for managing the data has led to reusable, scalable software. The formal properties of the query languages are not well understood. Hence established methods for indexing tree data and optimizing tree queries cannot be employed. We analyze a range of existing linguistic query languages, and adduce a set of requirements for a reusable, scalable linguistic query language.

    Improving the Representation and Conversion of Mathematical Formulae by Considering their Textual Context

    Full text link
    Mathematical formulae represent complex semantic information in a concise form. Especially in Science, Technology, Engineering, and Mathematics, mathematical formulae are crucial to communicate information, e.g., in scientific papers, and to perform computations using computer algebra systems. Enabling computers to access the information encoded in mathematical formulae requires machine-readable formats that can represent both the presentation and content, i.e., the semantics, of formulae. Exchanging such information between systems additionally requires conversion methods for mathematical representation formats. We analyze how the semantic enrichment of formulae improves the format conversion process and show that considering the textual context of formulae reduces the error rate of such conversions. Our main contributions are: (1) providing an openly available benchmark dataset for the mathematical format conversion task consisting of a newly created test collection, an extensive, manually curated gold standard and task-specific evaluation metrics; (2) performing a quantitative evaluation of state-of-the-art tools for mathematical format conversions; (3) presenting a new approach that considers the textual context of formulae to reduce the error rate for mathematical format conversions. Our benchmark dataset facilitates future research on mathematical format conversions as well as research on many problems in mathematical information retrieval. Because we annotated and linked all components of formulae, e.g., identifiers, operators and other entities, to Wikidata entries, the gold standard can, for instance, be used to train methods for formula concept discovery and recognition. Such methods can then be applied to improve mathematical information retrieval systems, e.g., for semantic formula search, recommendation of mathematical content, or detection of mathematical plagiarism.Comment: 10 pages, 4 figure

    CHORUS Deliverable 2.2: Second report - identification of multi-disciplinary key issues for gap analysis toward EU multimedia search engines roadmap

    Get PDF
    After addressing the state-of-the-art during the first year of Chorus and establishing the existing landscape in multimedia search engines, we have identified and analyzed gaps within European research effort during our second year. In this period we focused on three directions, notably technological issues, user-centred issues and use-cases and socio- economic and legal aspects. These were assessed by two central studies: firstly, a concerted vision of functional breakdown of generic multimedia search engine, and secondly, a representative use-cases descriptions with the related discussion on requirement for technological challenges. Both studies have been carried out in cooperation and consultation with the community at large through EC concertation meetings (multimedia search engines cluster), several meetings with our Think-Tank, presentations in international conferences, and surveys addressed to EU projects coordinators as well as National initiatives coordinators. Based on the obtained feedback we identified two types of gaps, namely core technological gaps that involve research challenges, and “enablers”, which are not necessarily technical research challenges, but have impact on innovation progress. New socio-economic trends are presented as well as emerging legal challenges

    A General Framework for Representing, Reasoning and Querying with Annotated Semantic Web Data

    Full text link
    We describe a generic framework for representing and reasoning with annotated Semantic Web data, a task becoming more important with the recent increased amount of inconsistent and non-reliable meta-data on the web. We formalise the annotated language, the corresponding deductive system and address the query answering problem. Previous contributions on specific RDF annotation domains are encompassed by our unified reasoning formalism as we show by instantiating it on (i) temporal, (ii) fuzzy, and (iii) provenance annotations. Moreover, we provide a generic method for combining multiple annotation domains allowing to represent, e.g. temporally-annotated fuzzy RDF. Furthermore, we address the development of a query language -- AnQL -- that is inspired by SPARQL, including several features of SPARQL 1.1 (subqueries, aggregates, assignment, solution modifiers) along with the formal definitions of their semantics

    Adaptive Semantic Annotation of Entity and Concept Mentions in Text

    Get PDF
    The recent years have seen an increase in interest for knowledge repositories that are useful across applications, in contrast to the creation of ad hoc or application-specific databases. These knowledge repositories figure as a central provider of unambiguous identifiers and semantic relationships between entities. As such, these shared entity descriptions serve as a common vocabulary to exchange and organize information in different formats and for different purposes. Therefore, there has been remarkable interest in systems that are able to automatically tag textual documents with identifiers from shared knowledge repositories so that the content in those documents is described in a vocabulary that is unambiguously understood across applications. Tagging textual documents according to these knowledge bases is a challenging task. It involves recognizing the entities and concepts that have been mentioned in a particular passage and attempting to resolve eventual ambiguity of language in order to choose one of many possible meanings for a phrase. There has been substantial work on recognizing and disambiguating entities for specialized applications, or constrained to limited entity types and particular types of text. In the context of shared knowledge bases, since each application has potentially very different needs, systems must have unprecedented breadth and flexibility to ensure their usefulness across applications. Documents may exhibit different language and discourse characteristics, discuss very diverse topics, or require the focus on parts of the knowledge repository that are inherently harder to disambiguate. In practice, for developers looking for a system to support their use case, is often unclear if an existing solution is applicable, leading those developers to trial-and-error and ad hoc usage of multiple systems in an attempt to achieve their objective. In this dissertation, I propose a conceptual model that unifies related techniques in this space under a common multi-dimensional framework that enables the elucidation of strengths and limitations of each technique, supporting developers in their search for a suitable tool for their needs. Moreover, the model serves as the basis for the development of flexible systems that have the ability of supporting document tagging for different use cases. I describe such an implementation, DBpedia Spotlight, along with extensions that we performed to the knowledge base DBpedia to support this implementation. I report evaluations of this tool on several well known data sets, and demonstrate applications to diverse use cases for further validation

    XML: aplicações e tecnologias associadas: 6th National Conference

    Get PDF
    This volume contains the papers presented at the Sixth Portuguese XML Conference, called XATA (XML, Aplicações e Tecnologias Associadas), held in Évora, Portugal, 14-15 February, 2008. The conference followed on from a successful series held throughout Portugal in the last years: XATA2003 was held in Braga, XATA2004 was held in Porto, XATA2005 was held in Braga, XATA2006 was held in Portalegre and XATA2007 was held in Lisboa. Dued to research evaluation criteria that are being used to evaluate researchers and research centers national conferences are becoming deserted. Many did not manage to gather enough submissions to proceed in this scenario. XATA made it through. However with a large decrease in the number of submissions. In this edition a special meeting will join the steering committee with some interested attendees to discuss XATA's future: internationalization, conference model, ... We think XATA is important in the national context. It has succeeded in gathering and identifying a comunity that shares the same research interests and has promoted some colaborations. We want to keep "the wheel spinning"... This edition has its program distributed by first day's afternoon and next day's morning. This way we are facilitating travel arrangements and we will have one night to meet

    Querying and Efficiently Searching Large, Temporal Text Corpora

    Get PDF
    corecore