Search CORE

9 research outputs found

Intelligent Self-Repairable Web Wrappers

Author: A. Laender
B. Chidlovskii
E. Ferrara
K. Lerman
N. Kushmerick
P. Bille
R. Baumgartner
S. Sarawagi
S. Selkow
W. Yang
X. Meng
Y. Kim
Publication venue
Publication date: 01/01/2011
Field of study

The amount of information available on the Web grows at an incredible high rate. Systems and procedures devised to extract these data from Web sources already exist, and different approaches and techniques have been investigated during the last years. On the one hand, reliable solutions should provide robust algorithms of Web data mining which could automatically face possible malfunctioning or failures. On the other, in literature there is a lack of solutions about the maintenance of these systems. Procedures that extract Web data may be strictly interconnected with the structure of the data source itself; thus, malfunctioning or acquisition of corrupted data could be caused, for example, by structural modifications of data sources brought by their owners. Nowadays, verification of data integrity and maintenance are mostly manually managed, in order to ensure that these systems work correctly and reliably. In this paper we propose a novel approach to create procedures able to extract data from Web sources -- the so called Web wrappers -- which can face possible malfunctioning caused by modifications of the structure of the data source, and can automatically repair themselves.\u

arXiv.org e-Print Archive

Crossref

CogPrints Cognitive Sciences Eprint Archive

Sample-based XPath Ranking for Web Information Extraction

Author: Jundt Oliver
Keulen Maurice van
Publication venue
Publication date: 01/01/2013
Field of study

Web information extraction typically relies on a wrapper, i.e., program code or a configuration that specifies how to extract some information from web pages at a specific website. Manually creating and maintaining wrappers is a cumbersome and error-prone task. It may even be prohibitive as some applications require information extraction from previously unseen websites. This paper approaches the problem of automatic on-the-fly wrapper creation for websites that provide attribute data for objects in a ‘search – search result page – detail page’ setup. The approach is a wrapper induction approach which uses a small and easily obtainable set of sample data for ranking XPaths on their suitability for extracting the wanted attribute data. Experiments show that the automatically generated top-ranked XPaths indeed extract the wanted data. Moreover, it appears that 20 to 25 input samples suffice for finding a suitable XPath for an attribute

Crossref

University of Twente Research Information

A Knowledge Management Approach: Business Intelligence in an Intranet Data Warehouse

Author: . Campbell Fraser
. Lisa Soon
Publication venue: GSTF Journal on Computing (JoC)
Publication date: 13/09/2014
Field of study

For contemporary businesses to stay viable, businessintelligence is mission critical. Although the importance ofbusiness intelligence is recognised, there is limited research onwhat information contributes to business intelligence and howbusiness intelligence is sought for use in an organisationalintranet. This research discusses how business intelligence issought, captured and used having tapped into an intranet datawarehouse as a knowledge management approach. It adoptsqualitative case study method using interviews and observationtechniques. A case study was conducted to examine how anIntranet system was designed, how business intelligence wascaptured, and how it aided strategic planning and decisionmaking in business operation. The respondents explained howstructured business intelligence data was categorised anddisseminated to users and how the used information empoweredstaff in their work performance. The intranet design successfullyretains staff knowledge within the organisation. It was alsosuccessful in drawing all internal resources together, capturingresources from external sources, and forming a commonrepository of organisational assets for use through organisationalwork procedures within the intranet

GSTF Digital Library (GSTF-DL): Open Journal Systems (Global Science and Technology Forum)

Framework for a Hospitality Big Data Warehouse: The Implementation of an Efficient Hospitality Business Intelligence System

Author: Cardoso Pedro
Correia Marisol
Lam Roberto
Martins Daniel
Ramos Celia
Rodrigues João
Serra Francisco
Publication venue: 'IGI Global'
Publication date: 01/04/2017
Field of study

order to increase the hotel's competitiveness, to maximize its revenue, to meliorate its online reputation and improve customer relationship, the information about the hotel's business has to be managed by adequate information systems (IS). Those IS should be capable of returning knowledge from a necessarily large quantity of information, anticipating and influencing the consumer's behaviour. One way to manage the information is to develop a Big Data Warehouse (BDW), which includes information from internal sources (e.g., Data Warehouse) and external sources (e.g., competitive set and customers' opinions). This paper presents a framework for a Hospitality Big Data Warehouse (HBDW). The framework includes a (1) Web crawler that periodically accesses targeted websites to automatically extract information from them, and a (2) data model to organize and consolidate the collected data into a HBDW. Additionally, the usefulness of this HBDW to the development of the business analytical tools is discussed, keeping in mind the implementation of the business intelligence (BI) concepts.SRM QREN IDT [38962]FCT projects LARSyS [UID/EEA/50009/2013]CIAC [PEstOE/EAT/UI4019/2013]CEFAGE [PEst-C/EGE/UI4007/2013]CEG-IST - Universidade de Lisboainfo:eu-repo/semantics/publishedVersio

Crossref

Sapientia

State-of-the-art web data extraction systems for online business intelligence

Author: Grigalis Tomas
Čenys Antanas
Publication venue: 'Vilnius University Press'
Publication date: 01/01/2013
Field of study

The success of a company hinges on identifying and responding to competitive pressures. The main objective of online business intelligence is to collect valuable information from many Web sources to support decision making and thus gain competitive advantage. However, the online business intelligence presents non-trivial challenges to Web data extraction systems that must deal with technologically sophisticated modern Web pages where traditional manual programming approaches often fail. In this paper, we review commercially available state-of-the-art Web data extraction systems and their technological advances in the context of online business intelligence.Keywords: online business intelligence, Web data extraction, Web scrapingŠiuolaikinės iš tinklalapių duomenis renkančios ir verslo analitikai tinkamos sistemos (anglų k.)Tomas Grigalis, Antanas Čenys Santrauka Šiuolaikinės verslo organizacijos sėkmė priklauso nuo sugebėjimo atitinkamai reaguoti į nuolat besikeičiančią konkurencinę aplinką. Internete veikiančios verslo analitinės sistemos pagrindinis tikslas yra rinkti vertingą informaciją iš daugybės skirtingų internetinių šaltinių ir tokiu būdu padėti verslo organizacijai priimti tinkamus sprendimus ir įgyti konkurencinį pranašumą. Tačiau informacijos rinkimas iš internetinių šaltinių yra sudėtinga problema, kai informaciją renkančios sistemos turi gerai veikti su itin technologiškai sudėtingais tinklalapiais. Šiame straipsnyje verslo analitikos kontekste apžvelgiamos pažangiausios internetinių duomenų rinkimo sistemos. Taip pat pristatomi konkretūs scenarijai, kai duomenų rinkimo sistemos gali padėti verslo analitikai. Straipsnio pabaigoje autoriai aptaria pastarųjų metų technologinius pasiekimus, kurie turi potencialą tapti visiškai automatinėmis internetinių duomenų rinkimo sistemomis ir dar labiau patobulinti verslo analitiką bei gerokai sumažinti jos išlaidas

Crossref

Directory of Open Access Journals

Informacijos mokslai

Web Data Extraction, Applications and Techniques: A Survey

Author: Abel
Amalfitano
Balduzzi
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Berger
Berthold
Bettencourt
Califf
Catanese
Chang
Chen
Chen
Chen
Collins
Conover
Crandall
Crescenzi
Crescenzi
Dalvi
Dalvi
De Meo
De Meo
Doan
Emilio Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Flesca
Freitag
Furche
Gatterbauer
Gatterbauer
Giacomo Fiumara
Gjoka
Gkotsis
Gottlob
Gottlob
Hammersley
Han
Hecht
Hsu
Irmak
Khare
Kim
Kinsella
Kleinberg
Kleinberg
Kohlschütter
Kokkoras
Kokkoras
Kokkoras
Krüpl
Kushmerick
Kwak
Laender
Liu
Manning
Masanès
Mathes
Meng
Mislove
Monge
Muslea
Oro
Pan
Pasquale De Meo
Perito
Phan
Plake
Rahm
Rahm
Reis
Robert Baumgartner
Sahuguet
Sarawagi
Schifanella
Selkow
Shi
Soderland
Szomszor
Turmo
Vosecky
Wang
Wang
Weikum
Wilson
Winograd
Yang
Ye
Zafarani
Zanasi
Zhai
Zhang
Zhang
Publication venue: 'Elsevier BV'
Publication date: 09/06/2014
Field of study

Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

arXiv.org e-Print Archive

Crossref

Theory and Applications for Advanced Text Mining

Author
Publication venue: 'IntechOpen'
Publication date: 20/04/2021
Field of study

Due to the growth of computer technologies and web technologies, we can easily collect and store large amounts of text data. We can believe that the data include useful knowledge. Text mining techniques have been studied aggressively in order to extract the knowledge from the data since late 1990s. Even if many important techniques have been developed, the text mining research field continues to expand for the needs arising from various application fields. This book is composed of 9 chapters introducing advanced text mining techniques. They are various techniques from relation extraction to under or less resourced language. I believe that this book will give new knowledge in the text mining field and help many readers open their new research fields

Directory of Open Access Books (DOAB)

Scalable web data extraction for online market intelligence

Author: Baumgartner R.
Baumgartner R.
Baumgartner R.
Baumgartner R.
Baxter R.
Carme J.
Cunningham H.
Gu L.
Kuhlins S.
Liu B.
Publication venue: 'VLDB Endowment'
Publication date
Field of study

Crossref