575 research outputs found

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    Graph-Based Weakly-Supervised Methods for Information Extraction & Integration

    Get PDF
    The variety and complexity of potentially-related data resources available for querying --- webpages, databases, data warehouses --- has been growing ever more rapidly. There is a growing need to pose integrative queries across multiple such sources, exploiting foreign keys and other means of interlinking data to merge information from diverse sources. This has traditionally been the focus of research within Information Extraction (IE) and Information Integration (II) communities, with IE focusing on converting unstructured sources into structured sources, and II focusing on providing a unified view of diverse structured data sources. However, most of the current IE and II methods, which can potentially be applied to the pro blem of integration across sources, require large amounts of human supervision, often in the form of annotated data. This need for extensive supervision makes existing methods expensive to deploy and difficult to maintain. In this thesis, we develop techniques that generalize from limited human input, via weakly-supervised methods for IE and II. In particular, we argue that graph-based representation of data and learning over such graphs can result in effective and scalable methods for large-scale Information Extraction and Integration. Within IE, we focus on the problem of assigning semantic classes to entities. First we develop a context pattern induction method to extend small initial entity lists of various semantic classes. We also demonstrate that features derived from such extended entity lists can significantly improve performance of state-of-the-art discriminative taggers. The output of pattern-based class-instance extractors is often high-precision and low-recall in nature, which is inadequate for many real world applications. We use Adsorption, a graph based label propagation algorithm, to significantly increase recall of an initial high-precision, low-recall pattern-based extractor by combining evidences from unstructured and structured text corpora. Building on Adsorption, we propose a new label propagation algorithm, Modified Adsorption (MAD), and demonstrate its effectiveness on various real-world datasets. Additionally, we also show how class-instance acquisition performance in the graph-based SSL setting can be improved by incorporating additional semantic constraints available in independently developed knowledge bases. Within Information Integration, we develop a novel system, Q, which draws ideas from machine learning and databases to help a non-expert user construct data-integrating queries based on keywords (across databases) and interactive feedback on answers. We also present an information need-driven strategy for automatically incorporating new sources and their information in Q. We also demonstrate that Q\u27s learning strategy is highly effective in combining the outputs of ``black box\u27\u27 schema matchers and in re-weighting bad alignments. This removes the need to develop an expensive mediated schema which has been necessary for most previous systems

    Proceedings of Monterey Workshop 2001 Engineering Automation for Sofware Intensive System Integration

    Get PDF
    The 2001 Monterey Workshop on Engineering Automation for Software Intensive System Integration was sponsored by the Office of Naval Research, Air Force Office of Scientific Research, Army Research Office and the Defense Advance Research Projects Agency. It is our pleasure to thank the workshop advisory and sponsors for their vision of a principled engineering solution for software and for their many-year tireless effort in supporting a series of workshops to bring everyone together.This workshop is the 8 in a series of International workshops. The workshop was held in Monterey Beach Hotel, Monterey, California during June 18-22, 2001. The general theme of the workshop has been to present and discuss research works that aims at increasing the practical impact of formal methods for software and systems engineering. The particular focus of this workshop was "Engineering Automation for Software Intensive System Integration". Previous workshops have been focused on issues including, "Real-time & Concurrent Systems", "Software Merging and Slicing", "Software Evolution", "Software Architecture", "Requirements Targeting Software" and "Modeling Software System Structures in a fastly moving scenario".Office of Naval ResearchAir Force Office of Scientific Research Army Research OfficeDefense Advanced Research Projects AgencyApproved for public release, distribution unlimite

    Content based dissemination of XML data

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Model driven language engineering

    Get PDF
    Modeling is a most important exercise in software engineering and development and one of the current practices is object-oriented (OO) modeling. The Object Management Group (OMG) has defined a standard object-oriented modeling language the Unified Modeling Language (UML). The OMG is not only interested in modeling languages; its primary aim is to enable easy integration of software systems and components using vendor-neutral technologies. This thesis investigates the possibilities for designing and implementing modeling frameworks and transformation languages that operate on models and to explore the validation of source and target models. Specifically, we will focus on OO models used in OMG's Model Driven Architecture (MDA), which can be expressed in terms of UML terms (e.g. classes and associations). The thesis presents the Kent Modeling Framework (KMF), a modeling framework that we developed, and describes how this framework can be used to generate a modeling tool from a model. It then proceeds to describe the customization of the generated code, in particular the definition of methods that allows a rapid and repeatable instantiation of a model. Model validation should include not only checking the well-formedness using OCL constraints, but also the evaluation of model quality. Software metrics are useful means for evaluating the quality of both software development processes and software products. As models are used to drive the entire software development process it is unlikely that high quality software will be obtained using low quality models. The thesis presents a methodology supported by KMF that uses the UML specification to compute the design metrics at an early stage of software development. The thesis presents a transformation language called YATL (Yet Another Transformation Language), which was designed and implemented to support the features provided by OMG's Request For Proposal and the future QVT standard. YATL is a hybrid language (a mix of declarative and imperative constructions) designed to answer the Query/Views/Transformations Request For Proposals issued by OMG and to express model transformations as required by the Model Driven Architecture (MDA) approach. Several examples of model transformations, which have been implemented using YATL and the support provided by KMF, are presented. These experiments investigate different knowledge areas as programming languages, visual diagrams and distributed systems. YATL was used to implement the following transformations: * UML to Java mapping * Spider diagrams to OCL mapping * EDOC to Web Service

    Model driven language engineering

    Get PDF
    Modeling is a most important exercise in software engineering and development and one of the current practices is object-oriented (OO) modeling. The Object Management Group (OMG) has defined a standard object-oriented modeling language the Unified Modeling Language (UML). The OMG is not only interested in modeling languages; its primary aim is to enable easy integration of software systems and components using vendor-neutral technologies. This thesis investigates the possibilities for designing and implementing modeling frameworks and transformation languages that operate on models and to explore the validation of source and target models. Specifically, we will focus on OO models used in OMG's Model Driven Architecture (MDA), which can be expressed in terms of UML terms (e.g. classes and associations). The thesis presents the Kent Modeling Framework (KMF), a modeling framework that we developed, and describes how this framework can be used to generate a modeling tool from a model. It then proceeds to describe the customization of the generated code, in particular the definition of methods that allows a rapid and repeatable instantiation of a model. Model validation should include not only checking the well-formedness using OCL constraints, but also the evaluation of model quality. Software metrics are useful means for evaluating the quality of both software development processes and software products. As models are used to drive the entire software development process it is unlikely that high quality software will be obtained using low quality models. The thesis presents a methodology supported by KMF that uses the UML specification to compute the design metrics at an early stage of software development. The thesis presents a transformation language called YATL (Yet Another Transformation Language), which was designed and implemented to support the features provided by OMG's Request For Proposal and the future QVT standard. YATL is a hybrid language (a mix of declarative and imperative constructions) designed to answer the Query/Views/Transformations Request For Proposals issued by OMG and to express model transformations as required by the Model Driven Architecture (MDA) approach. Several examples of model transformations, which have been implemented using YATL and the support provided by KMF, are presented. These experiments investigate different knowledge areas as programming languages, visual diagrams and distributed systems. YATL was used to implement the following transformations: * UML to Java mapping * Spider diagrams to OCL mapping * EDOC to Web ServicesEThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Learning Ontology Relations by Combining Corpus-Based Techniques and Reasoning on Data from Semantic Web Sources

    Get PDF
    The manual construction of formal domain conceptualizations (ontologies) is labor-intensive. Ontology learning, by contrast, provides (semi-)automatic ontology generation from input data such as domain text. This thesis proposes a novel approach for learning labels of non-taxonomic ontology relations. It combines corpus-based techniques with reasoning on Semantic Web data. Corpus-based methods apply vector space similarity of verbs co-occurring with labeled and unlabeled relations to calculate relation label suggestions from a set of candidates. A meta ontology in combination with Semantic Web sources such as DBpedia and OpenCyc allows reasoning to improve the suggested labels. An extensive formal evaluation demonstrates the superior accuracy of the presented hybrid approach

    Concept Trees: Building Dynamic Concepts from Semi-Structured Data using Nature-Inspired Methods

    Full text link
    This paper describes a method for creating structure from heterogeneous sources, as part of an information database, or more specifically, a 'concept base'. Structures called 'concept trees' can grow from the semi-structured sources when consistent sequences of concepts are presented. They might be considered to be dynamic databases, possibly a variation on the distributed Agent-Based or Cellular Automata models, or even related to Markov models. Semantic comparison of text is required, but the trees can be built more, from automatic knowledge and statistical feedback. This reduced model might also be attractive for security or privacy reasons, as not all of the potential data gets saved. The construction process maintains the key requirement of generality, allowing it to be used as part of a generic framework. The nature of the method also means that some level of optimisation or normalisation of the information will occur. This gives comparisons with databases or knowledge-bases, but a database system would firstly model its environment or datasets and then populate the database with instance values. The concept base deals with a more uncertain environment and therefore cannot fully model it beforehand. The model itself therefore evolves over time. Similar to databases, it also needs a good indexing system, where the construction process provides memory and indexing structures. These allow for more complex concepts to be automatically created, stored and retrieved, possibly as part of a more cognitive model. There are also some arguments, or more abstract ideas, for merging physical-world laws into these automatic processes.Comment: Pre-prin

    Combining SOA and BPM Technologies for Cross-System Process Automation

    Get PDF
    This paper summarizes the results of an industry case study that introduced a cross-system business process automation solution based on a combination of SOA and BPM standard technologies (i.e., BPMN, BPEL, WSDL). Besides discussing major weaknesses of the existing, custom-built, solution and comparing them against experiences with the developed prototype, the paper presents a course of action for transforming the current solution into the proposed solution. This includes a general approach, consisting of four distinct steps, as well as specific action items that are to be performed for every step. The discussion also covers language and tool support and challenges arising from the transformation
    corecore