1,369 research outputs found

    An Information Extraction Approach to Reorganizing and Summarizing Specifications

    Get PDF
    Materials and Process Specifications are complex semi-structured documents containing numeric data, text, and images. This article describes a coarse-grain extraction technique to automatically reorganize and summarize spec content. Specifically, a strategy for semantic-markup, to capture content within a semantic ontology, relevant to semi-automatic extraction, has been developed and experimented with. The working prototypes were built in the context of Cohesia\u27s existing software infrastructure, and use techniques from Information Extraction, XML technology, etc

    Survey over Existing Query and Transformation Languages

    Get PDF
    A widely acknowledged obstacle for realizing the vision of the Semantic Web is the inability of many current Semantic Web approaches to cope with data available in such diverging representation formalisms as XML, RDF, or Topic Maps. A common query language is the first step to allow transparent access to data in any of these formats. To further the understanding of the requirements and approaches proposed for query languages in the conventional as well as the Semantic Web, this report surveys a large number of query languages for accessing XML, RDF, or Topic Maps. This is the first systematic survey to consider query languages from all these areas. From the detailed survey of these query languages, a common classification scheme is derived that is useful for understanding and differentiating languages within and among all three areas

    Integrating data warehouses with web data : a survey

    Get PDF
    This paper surveys the most relevant research on combining Data Warehouse (DW) and Web data. It studies the XML technologies that are currently being used to integrate, store, query, and retrieve Web data and their application to DWs. The paper reviews different DW distributed architectures and the use of XML languages as an integration tool in these systems. It also introduces the problem of dealing with semistructured data in a DW. It studies Web data repositories, the design of multidimensional databases for XML data sources, and the XML extensions of OnLine Analytical Processing techniques. The paper addresses the application of information retrieval technology in a DW to exploit text-rich document collections. The authors hope that the paper will help to discover the main limitations and opportunities that offer the combination of the DW and the Web fields, as well as to identify open research line

    Collaborative software agents support for the texpros document management system

    Get PDF
    This dissertation investigates the use of active rules that are embedded in markup documents. Active rules are used in a markup representation by integrating Collaborative Software Agents with TEXPROS (abbreviation for TEXt PROcessing System) [Liu and Ng 1996] to create a powerful distributed document management system. Such markup documents with embedded active rules are called Active Documents. For fast retrieval purposes, when we need to generate a customized Internet folder organization, we first define the Folder Organization Query Language (FO-QL) to solve data categorization problems. FO-QL defines the folder organization query process that automatically retrieves links of documents deposited into folders and then constructs a folder organization in either a centralized document repository or multiple distributed document repositories. Traditional documents are stored as static data that do not provide any dynamic capabilities for accessing or interacting with the document environment. The dynamic and distributed nature of both markup data and markup rules do not merely respond to requests for information, but intelligently anticipate, adapt, and actively seek ways to support the computing processes. This outcome feature conquers the static nature of the traditional documents. An Office Automation Definition Language (OADL) with active rules is defined for constructing the TEXPROS \u27s dual modeling approach and workflow events representation. Active Documents are such agent-supported OADL documents. With embedded rules and self-describing data features, Active Documents provide capability of collaborative interactions with software agents. Data transformation and data integration are both data processing problems but little research has focused on the markup documents to generate a versatile folder organization. Some of the research merely provides manual browsing in a document repository to find the right document. This browsing is time consuming and unrealistic, especially in multiple document repositories. With FO-QL, one can create a customized folder organization on demand

    CRIS-IR 2006

    Get PDF
    The recognition of entities and their relationships in document collections is an important step towards the discovery of latent knowledge as well as to support knowledge management applications. The challenge lies on how to extract and correlate entities, aiming to answer key knowledge management questions, such as; who works with whom, on which projects, with which customers and on what research areas. The present work proposes a knowledge mining approach supported by information retrieval and text mining tasks in which its core is based on the correlation of textual elements through the LRD (Latent Relation Discovery) method. Our experiments show that LRD outperform better than other correlation methods. Also, we present an application in order to demonstrate the approach over knowledge management scenarios.Fundação para a Ciência e a Tecnologia (FCT) Denmark's Electronic Research Librar

    Accelerating data retrieval steps in XML documents

    Get PDF

    Schema Inference for Massive JSON Datasets

    Get PDF
    In the recent years JSON affirmed as a very popular data format for representing massive data collections. JSON data collections are usually schemaless. While this ensures sev- eral advantages, the absence of schema information has im- portant negative consequences: the correctness of complex queries and programs cannot be statically checked, users cannot rely on schema information to quickly figure out the structural properties that could speed up the formulation of correct queries, and many schema-based optimizations are not possible. In this paper we deal with the problem of inferring a schema from massive JSON datasets. We first identify a JSON type language which is simple and, at the same time, expressive enough to capture irregularities and to give com- plete structural information about input data. We then present our main contribution, which is the design of a schema inference algorithm, its theoretical study, and its implemen- tation based on Spark, enabling reasonable schema infer- ence time for massive collections. Finally, we report about an experimental analysis showing the effectiveness of our ap- proach in terms of execution time, precision, and conciseness of inferred schemas, and scalability

    XML: aplicações e tecnologias associadas: 6th National Conference

    Get PDF
    This volume contains the papers presented at the Sixth Portuguese XML Conference, called XATA (XML, Aplicações e Tecnologias Associadas), held in Évora, Portugal, 14-15 February, 2008. The conference followed on from a successful series held throughout Portugal in the last years: XATA2003 was held in Braga, XATA2004 was held in Porto, XATA2005 was held in Braga, XATA2006 was held in Portalegre and XATA2007 was held in Lisboa. Dued to research evaluation criteria that are being used to evaluate researchers and research centers national conferences are becoming deserted. Many did not manage to gather enough submissions to proceed in this scenario. XATA made it through. However with a large decrease in the number of submissions. In this edition a special meeting will join the steering committee with some interested attendees to discuss XATA's future: internationalization, conference model, ... We think XATA is important in the national context. It has succeeded in gathering and identifying a comunity that shares the same research interests and has promoted some colaborations. We want to keep "the wheel spinning"... This edition has its program distributed by first day's afternoon and next day's morning. This way we are facilitating travel arrangements and we will have one night to meet
    corecore