9 research outputs found

    Intelligent Self-Repairable Web Wrappers

    Get PDF
    The amount of information available on the Web grows at an incredible high rate. Systems and procedures devised to extract these data from Web sources already exist, and different approaches and techniques have been investigated during the last years. On the one hand, reliable solutions should provide robust algorithms of Web data mining which could automatically face possible malfunctioning or failures. On the other, in literature there is a lack of solutions about the maintenance of these systems. Procedures that extract Web data may be strictly interconnected with the structure of the data source itself; thus, malfunctioning or acquisition of corrupted data could be caused, for example, by structural modifications of data sources brought by their owners. Nowadays, verification of data integrity and maintenance are mostly manually managed, in order to ensure that these systems work correctly and reliably. In this paper we propose a novel approach to create procedures able to extract data from Web sources -- the so called Web wrappers -- which can face possible malfunctioning caused by modifications of the structure of the data source, and can automatically repair themselves.\u

    Sample-based XPath Ranking for Web Information Extraction

    Get PDF
    Web information extraction typically relies on a wrapper, i.e., program code or a configuration that specifies how to extract some information from web pages at a specific website. Manually creating and maintaining wrappers is a cumbersome and error-prone task. It may even be prohibitive as some applications require information extraction from previously unseen websites. This paper approaches the problem of automatic on-the-fly wrapper creation for websites that provide attribute data for objects in a ‘search – search result page – detail page’ setup. The approach is a wrapper induction approach which uses a small and easily obtainable set of sample data for ranking XPaths on their suitability for extracting the wanted attribute data. Experiments show that the automatically generated top-ranked XPaths indeed extract the wanted data. Moreover, it appears that 20 to 25 input samples suffice for finding a suitable XPath for an attribute

    A Knowledge Management Approach: Business Intelligence in an Intranet Data Warehouse

    Get PDF
    For contemporary businesses to stay viable, businessintelligence is mission critical. Although the importance ofbusiness intelligence is recognised, there is limited research onwhat information contributes to business intelligence and howbusiness intelligence is sought for use in an organisationalintranet. This research discusses how business intelligence issought, captured and used having tapped into an intranet datawarehouse as a knowledge management approach. It adoptsqualitative case study method using interviews and observationtechniques. A case study was conducted to examine how anIntranet system was designed, how business intelligence wascaptured, and how it aided strategic planning and decisionmaking in business operation. The respondents explained howstructured business intelligence data was categorised anddisseminated to users and how the used information empoweredstaff in their work performance. The intranet design successfullyretains staff knowledge within the organisation. It was alsosuccessful in drawing all internal resources together, capturingresources from external sources, and forming a commonrepository of organisational assets for use through organisationalwork procedures within the intranet

    Framework for a Hospitality Big Data Warehouse: The Implementation of an Efficient Hospitality Business Intelligence System

    Get PDF
    order to increase the hotel's competitiveness, to maximize its revenue, to meliorate its online reputation and improve customer relationship, the information about the hotel's business has to be managed by adequate information systems (IS). Those IS should be capable of returning knowledge from a necessarily large quantity of information, anticipating and influencing the consumer's behaviour. One way to manage the information is to develop a Big Data Warehouse (BDW), which includes information from internal sources (e.g., Data Warehouse) and external sources (e.g., competitive set and customers' opinions). This paper presents a framework for a Hospitality Big Data Warehouse (HBDW). The framework includes a (1) Web crawler that periodically accesses targeted websites to automatically extract information from them, and a (2) data model to organize and consolidate the collected data into a HBDW. Additionally, the usefulness of this HBDW to the development of the business analytical tools is discussed, keeping in mind the implementation of the business intelligence (BI) concepts.SRM QREN IDT [38962]FCT projects LARSyS [UID/EEA/50009/2013]CIAC [PEstOE/EAT/UI4019/2013]CEFAGE [PEst-C/EGE/UI4007/2013]CEG-IST - Universidade de Lisboainfo:eu-repo/semantics/publishedVersio

    State-of-the-art web data extraction systems for online business intelligence

    Get PDF
    The success of a company hinges on identifying and responding to competitive pressures. The main objective of online business intelligence is to collect valuable information from many Web sources to support decision making and thus gain competitive advantage. However, the online business intelligence presents non-trivial challenges to Web data extraction systems that must deal with technologically sophisticated modern Web pages where traditional manual programming approaches often fail. In this paper, we review commercially available state-of-the-art Web data extraction systems and their technological advances in the context of online business intelligence.Keywords: online business intelligence, Web data extraction, Web scrapingŠiuolaikinės iš tinklalapių duomenis renkančios ir verslo analitikai tinkamos sistemos (anglų k.)Tomas Grigalis, Antanas Čenys Santrauka Šiuolaikinės verslo organizacijos sėkmė priklauso nuo sugebėjimo atitinkamai reaguoti į nuolat besi­keičiančią konkurencinę aplinką. Internete veikian­čios verslo analitinės sistemos pagrindinis tikslas yra rinkti vertingą informaciją iš daugybės skirtingų internetinių šaltinių ir tokiu būdu padėti verslo orga­nizacijai priimti tinkamus sprendimus ir įgyti kon­kurencinį pranašumą. Tačiau informacijos rinkimas iš internetinių šaltinių yra sudėtinga problema, kai informaciją renkančios sistemos turi gerai veikti su itin technologiškai sudėtingais tinklalapiais. Šiame straipsnyje verslo analitikos kontekste apžvelgiamos pažangiausios internetinių duomenų rinkimo siste­mos. Taip pat pristatomi konkretūs scenarijai, kai duomenų rinkimo sistemos gali padėti verslo anali­tikai. Straipsnio pabaigoje autoriai aptaria pastarųjų metų technologinius pasiekimus, kurie turi potencia­lą tapti visiškai automatinėmis internetinių duomenų rinkimo sistemomis ir dar labiau patobulinti verslo analitiką bei gerokai sumažinti jos išlaidas

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    Theory and Applications for Advanced Text Mining

    Get PDF
    Due to the growth of computer technologies and web technologies, we can easily collect and store large amounts of text data. We can believe that the data include useful knowledge. Text mining techniques have been studied aggressively in order to extract the knowledge from the data since late 1990s. Even if many important techniques have been developed, the text mining research field continues to expand for the needs arising from various application fields. This book is composed of 9 chapters introducing advanced text mining techniques. They are various techniques from relation extraction to under or less resourced language. I believe that this book will give new knowledge in the text mining field and help many readers open their new research fields
    corecore