Search CORE

575 research outputs found

Web Data Extraction, Applications and Techniques: A Survey

Author: Abel
Amalfitano
Balduzzi
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Berger
Berthold
Bettencourt
Califf
Catanese
Chang
Chen
Chen
Chen
Collins
Conover
Crandall
Crescenzi
Crescenzi
Dalvi
Dalvi
De Meo
De Meo
Doan
Emilio Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Flesca
Freitag
Furche
Gatterbauer
Gatterbauer
Giacomo Fiumara
Gjoka
Gkotsis
Gottlob
Gottlob
Hammersley
Han
Hecht
Hsu
Irmak
Khare
Kim
Kinsella
Kleinberg
Kleinberg
Kohlschütter
Kokkoras
Kokkoras
Kokkoras
Krüpl
Kushmerick
Kwak
Laender
Liu
Manning
Masanès
Mathes
Meng
Mislove
Monge
Muslea
Oro
Pan
Pasquale De Meo
Perito
Phan
Plake
Rahm
Rahm
Reis
Robert Baumgartner
Sahuguet
Sarawagi
Schifanella
Selkow
Shi
Soderland
Szomszor
Turmo
Vosecky
Wang
Wang
Weikum
Wilson
Winograd
Yang
Ye
Zafarani
Zanasi
Zhai
Zhang
Zhang
Publication venue: 'Elsevier BV'
Publication date: 09/06/2014
Field of study

Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

arXiv.org e-Print Archive

Crossref

Graph-Based Weakly-Supervised Methods for Information Extraction & Integration

Author: Talukdar Partha P
Publication venue: ScholarlyCommons
Publication date: 01/01/2010
Field of study

The variety and complexity of potentially-related data resources available for querying --- webpages, databases, data warehouses --- has been growing ever more rapidly. There is a growing need to pose integrative queries across multiple such sources, exploiting foreign keys and other means of interlinking data to merge information from diverse sources. This has traditionally been the focus of research within Information Extraction (IE) and Information Integration (II) communities, with IE focusing on converting unstructured sources into structured sources, and II focusing on providing a unified view of diverse structured data sources. However, most of the current IE and II methods, which can potentially be applied to the pro blem of integration across sources, require large amounts of human supervision, often in the form of annotated data. This need for extensive supervision makes existing methods expensive to deploy and difficult to maintain. In this thesis, we develop techniques that generalize from limited human input, via weakly-supervised methods for IE and II. In particular, we argue that graph-based representation of data and learning over such graphs can result in effective and scalable methods for large-scale Information Extraction and Integration. Within IE, we focus on the problem of assigning semantic classes to entities. First we develop a context pattern induction method to extend small initial entity lists of various semantic classes. We also demonstrate that features derived from such extended entity lists can significantly improve performance of state-of-the-art discriminative taggers. The output of pattern-based class-instance extractors is often high-precision and low-recall in nature, which is inadequate for many real world applications. We use Adsorption, a graph based label propagation algorithm, to significantly increase recall of an initial high-precision, low-recall pattern-based extractor by combining evidences from unstructured and structured text corpora. Building on Adsorption, we propose a new label propagation algorithm, Modified Adsorption (MAD), and demonstrate its effectiveness on various real-world datasets. Additionally, we also show how class-instance acquisition performance in the graph-based SSL setting can be improved by incorporating additional semantic constraints available in independently developed knowledge bases. Within Information Integration, we develop a novel system, Q, which draws ideas from machine learning and databases to help a non-expert user construct data-integrating queries based on keywords (across databases) and interactive feedback on answers. We also present an information need-driven strategy for automatically incorporating new sources and their information in Q. We also demonstrate that Q\u27s learning strategy is highly effective in combining the outputs of ``black box\u27\u27 schema matchers and in re-weighting bad alignments. This removes the need to develop an expensive mediated schema which has been necessary for most previous systems

ScholarlyCommons@Penn

Proceedings of Monterey Workshop 2001 Engineering Automation for Sofware Intensive System Integration

Author: Broy Manfred
Luqi
Publication venue
Publication date: 01/06/2001
Field of study

The 2001 Monterey Workshop on Engineering Automation for Software Intensive System Integration was sponsored by the Office of Naval Research, Air Force Office of Scientific Research, Army Research Office and the Defense Advance Research Projects Agency. It is our pleasure to thank the workshop advisory and sponsors for their vision of a principled engineering solution for software and for their many-year tireless effort in supporting a series of workshops to bring everyone together.This workshop is the 8 in a series of International workshops. The workshop was held in Monterey Beach Hotel, Monterey, California during June 18-22, 2001. The general theme of the workshop has been to present and discuss research works that aims at increasing the practical impact of formal methods for software and systems engineering. The particular focus of this workshop was "Engineering Automation for Software Intensive System Integration". Previous workshops have been focused on issues including, "Real-time & Concurrent Systems", "Software Merging and Slicing", "Software Evolution", "Software Architecture", "Requirements Targeting Software" and "Modeling Software System Structures in a fastly moving scenario".Office of Naval ResearchAir Force Office of Scientific Research Army Research OfficeDefense Advanced Research Projects AgencyApproved for public release, distribution unlimite

Calhoun, Institutional Archive of the Naval Postgraduate School

Content based dissemination of XML data

Author: NI YUAN
Publication venue
Publication date: 17/04/2008
Field of study

Ph.DDOCTOR OF PHILOSOPH

ScholarBank@NUS

Model driven language engineering

Author: Patrascoiu Octavian
Publication venue
Publication date: 22/09/2021
Field of study

Modeling is a most important exercise in software engineering and development and one of the current practices is object-oriented (OO) modeling. The Object Management Group (OMG) has defined a standard object-oriented modeling language the Unified Modeling Language (UML). The OMG is not only interested in modeling languages; its primary aim is to enable easy integration of software systems and components using vendor-neutral technologies. This thesis investigates the possibilities for designing and implementing modeling frameworks and transformation languages that operate on models and to explore the validation of source and target models. Specifically, we will focus on OO models used in OMG's Model Driven Architecture (MDA), which can be expressed in terms of UML terms (e.g. classes and associations). The thesis presents the Kent Modeling Framework (KMF), a modeling framework that we developed, and describes how this framework can be used to generate a modeling tool from a model. It then proceeds to describe the customization of the generated code, in particular the definition of methods that allows a rapid and repeatable instantiation of a model. Model validation should include not only checking the well-formedness using OCL constraints, but also the evaluation of model quality. Software metrics are useful means for evaluating the quality of both software development processes and software products. As models are used to drive the entire software development process it is unlikely that high quality software will be obtained using low quality models. The thesis presents a methodology supported by KMF that uses the UML specification to compute the design metrics at an early stage of software development. The thesis presents a transformation language called YATL (Yet Another Transformation Language), which was designed and implemented to support the features provided by OMG's Request For Proposal and the future QVT standard. YATL is a hybrid language (a mix of declarative and imperative constructions) designed to answer the Query/Views/Transformations Request For Proposals issued by OMG and to express model transformations as required by the Model Driven Architecture (MDA) approach. Several examples of model transformations, which have been implemented using YATL and the support provided by KMF, are presented. These experiments investigate different knowledge areas as programming languages, visual diagrams and distributed systems. YATL was used to implement the following transformations: * UML to Java mapping * Spider diagrams to OCL mapping * EDOC to Web Service

Kent Academic Repository

Model driven language engineering

Author: Patrascoiu Octavian
Publication venue
Publication date: 01/01/2005
Field of study

OpenGrey Repository

Learning Ontology Relations by Combining Corpus-Based Techniques and Reasoning on Data from Semantic Web Sources

Author: Wohlgenannt Gerhard
Publication venue: 'Peter Lang, International Academic Publishers'
Publication date: 01/04/2020
Field of study

The manual construction of formal domain conceptualizations (ontologies) is labor-intensive. Ontology learning, by contrast, provides (semi-)automatic ontology generation from input data such as domain text. This thesis proposes a novel approach for learning labels of non-taxonomic ontology relations. It combines corpus-based techniques with reasoning on Semantic Web data. Corpus-based methods apply vector space similarity of verbs co-occurring with labeled and unlabeled relations to calculate relation label suggestions from a set of candidates. A meta ontology in combination with Semantic Web sources such as DBpedia and OpenCyc allows reasoning to improve the suggested labels. An extensive formal evaluation demonstrates the superior accuracy of the presented hybrid approach

Directory of Open Access Books (DOAB)

Concept Trees: Building Dynamic Concepts from Semi-Structured Data using Nature-Inspired Methods

Author: Greer Kieran
Publication venue
Publication date: 23/06/2014
Field of study

This paper describes a method for creating structure from heterogeneous sources, as part of an information database, or more specifically, a 'concept base'. Structures called 'concept trees' can grow from the semi-structured sources when consistent sequences of concepts are presented. They might be considered to be dynamic databases, possibly a variation on the distributed Agent-Based or Cellular Automata models, or even related to Markov models. Semantic comparison of text is required, but the trees can be built more, from automatic knowledge and statistical feedback. This reduced model might also be attractive for security or privacy reasons, as not all of the potential data gets saved. The construction process maintains the key requirement of generality, allowing it to be used as part of a generic framework. The nature of the method also means that some level of optimisation or normalisation of the information will occur. This gives comparisons with databases or knowledge-bases, but a database system would firstly model its environment or datasets and then populate the database with instance values. The concept base deals with a more uncertain environment and therefore cannot fully model it beforehand. The model itself therefore evolves over time. Similar to databases, it also needs a good indexing system, where the construction process provides memory and indexing structures. These allow for more complex concepts to be automatically created, stored and retrieved, possibly as part of a more cognitive model. There are also some arguments, or more abstract ideas, for merging physical-world laws into these automatic processes.Comment: Pre-prin

arXiv.org e-Print Archive

Crossref

Combining SOA and BPM Technologies for Cross-System Process Automation

Author: Herr Sebastian
Läufer Konstantin
Shafaee John
Thiruvathukal George K.
Wirtz Guido
Publication venue: Loyola eCommons
Publication date: 01/01/2008
Field of study

This paper summarizes the results of an industry case study that introduced a cross-system business process automation solution based on a combination of SOA and BPM standard technologies (i.e., BPMN, BPEL, WSDL). Besides discussing major weaknesses of the existing, custom-built, solution and comparing them against experiences with the developed prototype, the paper presents a course of action for transforming the current solution into the proposed solution. This includes a general approach, consisting of four distinct steps, as well as specific action items that are to be performed for every step. The discussion also covers language and tool support and challenges arising from the transformation

Loyola eCommons