Search CORE

14,343 research outputs found

Web Data Extraction, Applications and Techniques: A Survey

Author: Abel
Amalfitano
Balduzzi
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Berger
Berthold
Bettencourt
Califf
Catanese
Chang
Chen
Chen
Chen
Collins
Conover
Crandall
Crescenzi
Crescenzi
Dalvi
Dalvi
De Meo
De Meo
Doan
Emilio Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Flesca
Freitag
Furche
Gatterbauer
Gatterbauer
Giacomo Fiumara
Gjoka
Gkotsis
Gottlob
Gottlob
Hammersley
Han
Hecht
Hsu
Irmak
Khare
Kim
Kinsella
Kleinberg
Kleinberg
Kohlschütter
Kokkoras
Kokkoras
Kokkoras
Krüpl
Kushmerick
Kwak
Laender
Liu
Manning
Masanès
Mathes
Meng
Mislove
Monge
Muslea
Oro
Pan
Pasquale De Meo
Perito
Phan
Plake
Rahm
Rahm
Reis
Robert Baumgartner
Sahuguet
Sarawagi
Schifanella
Selkow
Shi
Soderland
Szomszor
Turmo
Vosecky
Wang
Wang
Weikum
Wilson
Winograd
Yang
Ye
Zafarani
Zanasi
Zhai
Zhang
Zhang
Publication venue: 'Elsevier BV'
Publication date: 09/06/2014
Field of study

Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

arXiv.org e-Print Archive

Crossref

Towards Comparative Web Content Mining using Object Oriented Model

Author: Mutsuddy Titas
Publication venue: 'University of Windsor Leddy Library'
Publication date: 01/01/2010
Field of study

Web content data are heterogeneous in nature; usually composed of different types of contents and data structure. Thus, extraction and mining of web content data is a challenging branch of data mining. Traditional web content extraction and mining techniques are classified into three categories: programming language based wrappers, wrapper (data extraction program) induction techniques, and automatic wrapper generation techniques. First category constructs data extraction system by providing some specialized pattern specification languages, second category is a supervised learning, which learns data extraction rules and third category is automatic extraction process. All these data extraction techniques rely on web document presentation structures, which need complicated matching and tree alignment algorithms, routine maintenance, hard to unify for vast variety of websites and fail to catch heterogeneous data together. To catch more diversity of web documents, a feasible implementation of an automatic data extraction technique based on object oriented data model technique, 00Web, had been proposed in Annoni and Ezeife (2009). This thesis implements, materializes and extends the structured automatic data extraction technique. We developed a system (called WebOMiner) for extraction and mining of structured web contents based on object-oriented data model. Thesis extends the extraction algorithms proposed by Annoni and Ezeife (2009) and develops an automata based automatic wrapper generation algorithm for extraction and mining of structured web content data. Our algorithm identifies data blocks from flat array data structure and generates Non-Deterministic Finite Automata (NFA) pattern for different types of content data for extraction. Objective of this thesis is to extract and mine heterogeneous web content and relieve the hard effort of matching, tree alignment and routine maintenance. Experimental results show that our system is highly effective and it performs the mining task with 100% precision and 96.22% recall value

Scholarship at UWindsor

Structured Audio Podcasts via Web Text-to-Speech System

Author: Buzzi Maria Claudia
Buzzi Marina
Leporini Barbara
Mori Giulio
Publication venue: Sheridan Printing
Publication date: 01/01/2010
Field of study

Audio podcasting is increasingly present in the educational field and is especially appreciated as an ubiquitous/pervasive tool (?anywhere, anytime, at any pace?) for acquiring or expanding knowledge. We designed and implemented a Web-based Text To Speech (TTS) system for automatic generation of a set of structured audio podcasts from a single text document. The system receives a document in input (doc, rtf, or txt), and in output provides a set of audio files that reflect the document?s internal structure (one mp3 file for each document section), ready to be downloaded on portable mp3 players. Structured audio files are useful for everyone but are especially appreciated by blind users, who must explore content audially. Fully accessible for the blind, our system offers WAI-ARIA-based Web interfaces for easy navigation and interaction via screen reader and voice synthesizer, and produces a set of accessible audio files for Rockbox mp3 players (mp3 and talk files), allowing blind users to also listen to naturally spoken file names (instead of their spelled-out strings). In this demo, we will show how the system works when a user interacts via screen reader and voice synthesizer, showing the interaction with both our Web-based system and with an mp3 player

PUblication MAnagement

Mapping Text to Knowledge using Natural Language Processing

Author: Bodnari Andreea
Publication venue: Digital WPI
Publication date: 11/03/2010
Field of study

The goal of this project was to design and implement a system that analyzes text corpora. This system uses natural language processing techniques to extract knowledge from written text and represents this knowledge as a network. The system displays this network to the user and allows the user to interactively explore the network. The accuracy of the knowledge extraction process and the overall performance of the developed system were assessed. Possible applications are in social networks and text simplification

DigitalCommons@WPI

The Interpretation of Tables in Texts

Author: Hurst Matthew Francis
Publication venue: University of Edinburgh
Publication date: 01/01/2000
Field of study

Edinburgh Research Archive

Automatic document classification and extraction system (ADoCES)

Author: Li Xuhong
Publication venue: Digital Commons @ NJIT
Publication date: 31/05/1999
Field of study

Document processing is a critical element of office automation. Document image processing begins from the Optical Character Recognition (OCR) phase with complex processing for document classification and extraction. Document classification is a process that classifies an incoming document into a particular predefined document type. Document extraction is a process that extracts information pertinent to the users from the content of a document and assigns the information as the values of the “logical structure” of the document type. Therefore, after document classification and extraction, a paper document will be represented in its digital form instead of its original image file format, which is called a frame instance. A frame instance is an operable and efficient form that can be processed and manipulated during document filing and retrieval. This dissertation describes a system to support a complete procedure, which begins with the scanning of the paper document into the system and ends with the output of an effective digital form of the original document. This is a general-purpose system with “learning” ability and, therefore, it can be adapted easily to many application domains. In this dissertation, the “logical closeness” segmentation method is proposed. A novel representation of document layout structure - Labeled Directed Weighted Graph (LDWG) and a methodology of transforming document segmentation into LDWG representation are described. To find a match between two LDWGs, string representation matching is applied first instead of doing graph comparison directly, which reduces the time necessary to make the comparison. Applying artificial intelligence, the system is able to learn from experiences and build samples of LDWGs to represent each document type. In addition, the concept of frame templates is used for the document logical structure representation. The concept of Document Type Hierarchy (DTH) is also enhanced to express the hierarchical relation over the logical structures existing among the documents

Digital Commons @ New Jersey Institute of Technology (NJIT)